32C3 CTF sandbox writeup13 Jan 2016
Sandbox was an exploitation challenge for 300 points from 32C3, that executes our shellcode in something very similar to the old seccomp-legacy sandbox in Chromium. It was mostly me working on it, with some help from @kt. Even though we didn’t manage to solve the challenge during the ctf, it was surprisingly enjoyable. There are two possible solutions, both will be covered.
Rough sandbox arch
The basic idea is to set
PR_SET_SECCOMP via prctl to confine the sandboxed process.
PR_SET_NO_NEW_PRIVSWith no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call. For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set
PR_SET_SECCOMPWith arg2 set to SECCOMP_MODE_STRICT, the only system calls that the thread is permitted to make are read(2), write(2), _exit(2), and sigreturn(2). Other system calls result in the delivery of a SIGKILL signal. Strict secure computing mode is useful for number-crunching applications that may need to execute untrusted byte code, perhaps obtained by reading from a pipe or socket.
This limits the process to those 4 syscalls without the ability to gain new privs via execve. If it needs anything else, it must make a request to its parent process using some form of IPC (shared memory and pipes in this case). The parent verifies the syscall number and arguments, then does the syscall. The problem is, there are operations that the parent cannot do for its child, like allocating memory. To overcome this, there is a trusted thread alongside the sandboxed one, which isn’t restricted by seccomp. The trusted thread runs in a hostile environment because the sandboxed thread has access to its address space. Therefore, the trusted thread only uses CPU registers and executes carefully handwritten assembly code.
The hierarchy is the following:
parent processunconfined, clones child-process (sandboxee) and enters a loop waiting for syscall requests
sandboxee threadcreated by parent, creates the
trustedthread, sets seccomp mode and executes our shellcode
trusted threadhandwritten assembly routine executing the syscalls requests verified by
The parent process mmaps two 4096-byte regions, mmap1 and mmap2, both shared with the children, which are used to pass syscall requests back and forth.
mmap1is read-write in the children,
sandboxeecan request a syscall by placing the syscall number/arguments here and signaling
mmap2is read-only in the children, the parent process places the syscall information here after validation and signals
trustedto execute the syscall.
The signaling happens via pipes, there are three, p0, p1 and p2, used for the sandboxee->parent, parent->trusted and trusted->sandboxie directions, respectively. No actual data is sent through the pipes.
The syscall structure passed in mmap1 and mmap2 can be seen below.
00000000 syscall struc ; (sizeof=0x38, mappedto_10) 00000000 rax_ dq ? 00000008 rdi_ dq ? 00000010 rsi_ dq ? 00000018 rdx_ dq ? 00000020 rcx_ dq ? 00000028 r8_ dq ? 00000030 r9_ dq ?
parent enters a loop. It reads on the p0 pipe, when read returns, makes a local copy of the syscall struct in mmap1 to prevent us from messing with it. It then checks the syscall number against a list of verifiers. Extracting the syscall numbers and matching it up with names via a simple python script gave the following list of allowed syscalls. If the requested syscall is not in the list or the verification function fails,
parent kills the children and quits. If verification succeeds,
parent copies the local syscall struct to mmap2. The verifier for most of these simply allows the call, only two,
chdir have a common handler that checks whether the path to open/chdir contains dev, proc or sys.
The code that
trusted executes (seen below) is a simple loop of waiting for signaling on the p1 pipe, filling up the registers from the syscall struct in mmap2, executing the syscall, storing the return value and signaling
sandboxee via p3. The code seems robust, even considering our access to the address space in which it executes. It doesn’t use anything of interest from writable locations, doesn’t call library functions, only syscalls, and exits without returning on a failure.
The thread executing our shellcode, restricted by
SECCOMP_MODE_STRICT. Before our code is executed,
sandboxee requests an open syscall for the file ready.txt, reads its contents and writes them to stdout, presenting how the sandbox works and how to request syscalls from
An obvious way to break out would be to modify the code of the
trusted thread. This would however require changing page protection attributes and
parent won’t let mprotect syscalls through. However…
Solution I: overflow in the open/chdir handler
The decompiled code of the handler can be seen below. The check for the path containing proc should hint at a possible direction to take: somehow bypass it and open /proc/self/mem for writing. This would allow us to modify the code of
sandboxee and break out.
Looking at the code, there is a rather simple buffer overflow. The handler checks that rdi (the path arg of the syscall) points inside the mmap1 region but at the end of the function copies it to
mmap2+56 via strcpy. Both mmap1 and mmap2 are 4096 bytes big, so we could make the filename point to
mmap1+16 for example and have a long string starting there. Since mmap2 is mmapped right after mmap1, it will be placed right under it in memory, meaning we overflow back into mmap1. At first, this doesn’t buy us much, considering that
parent has already made a copy of the syscall struct from mmap1. What we get however is the ability to modify the end of the path, since mmap1 is writable by us and the path overflows into it. So by requesting a chdir into a long path consisting only of ‘/.’s, the checks will pass and we might be able to append /proc to the path before
trusted executes the syscall. But we don’t even need to win a race, since
sandboxee share file descriptors and
parent lets pipe, dup and dup2 through without validation. This means we can take over the pipe on which
trusted waits for the signal from
parent to execute a syscall and trigger it at our leisure.
Once we have /proc/self/mem open for writing, we modify
trusted to simply jump to our code and spawn a shell. Executing the final exploit (the live services were still up at the time of writing this):
Solution II: a race condition
- Request an unverified syscall with the arguments that you want for open/chdir, e.g. getpid(“/proc”)
- Request an open/chdir with bogus arguments, e.g. chdir(“/”)
- Hope the scheduler will be nice to you and preempts
parentat just the right time after it has written the syscall number of open/chdir into mmap2 (0x00401294 in the binary) but before it writes the new arguments so that
trustedends up executing chdir(“/proc”)
- “It takes quite a lot of tries.”