32C3 CTF sandbox writeup
13 Jan 2016Sandbox was an exploitation challenge for 300 points from 32C3, that executes our shellcode in something very similar to the old seccomp-legacy sandbox in Chromium. It was mostly me working on it, with some help from @kt. Even though we didn’t manage to solve the challenge during the ctf, it was surprisingly enjoyable. There are two possible solutions, both will be covered.
Rough sandbox arch
The basic idea is to set PR_SET_NO_NEW_PRIVS
and PR_SET_SECCOMP
via prctl to confine the sandboxed process.
PR_SET_NO_NEW_PRIVS
With no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call. For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set
PR_SET_SECCOMP
With arg2 set to SECCOMP_MODE_STRICT, the only system calls that the thread is permitted to make are read(2), write(2), _exit(2), and sigreturn(2). Other system calls result in the delivery of a SIGKILL signal. Strict secure computing mode is useful for number-crunching applications that may need to execute untrusted byte code, perhaps obtained by reading from a pipe or socket.
This limits the process to those 4 syscalls without the ability to gain new privs via execve. If it needs anything else, it must make a request to its parent process using some form of IPC (shared memory and pipes in this case). The parent verifies the syscall number and arguments, then does the syscall. The problem is, there are operations that the parent cannot do for its child, like allocating memory. To overcome this, there is a trusted thread alongside the sandboxed one, which isn’t restricted by seccomp. The trusted thread runs in a hostile environment because the sandboxed thread has access to its address space. Therefore, the trusted thread only uses CPU registers and executes carefully handwritten assembly code.
The hierarchy is the following:
parent process
unconfined, clones child-process (sandboxee) and enters a loop waiting for syscall requestschild process
sandboxee thread
created by parent, creates thetrusted
thread, sets seccomp mode and executes our shellcodetrusted thread
handwritten assembly routine executing the syscalls requests verified byparent
Communication
The parent process mmaps two 4096-byte regions, mmap1 and mmap2, both shared with the children, which are used to pass syscall requests back and forth.
-
mmap1
is read-write in the children,sandboxee
can request a syscall by placing the syscall number/arguments here and signalingparent
. -
mmap2
is read-only in the children, the parent process places the syscall information here after validation and signalstrusted
to execute the syscall.
The signaling happens via pipes, there are three, p0, p1 and p2, used for the sandboxee->parent, parent->trusted and trusted->sandboxie directions, respectively. No actual data is sent through the pipes.
The syscall structure passed in mmap1 and mmap2 can be seen below.
00000000 syscall struc ; (sizeof=0x38, mappedto_10)
00000000 rax_ dq ?
00000008 rdi_ dq ?
00000010 rsi_ dq ?
00000018 rdx_ dq ?
00000020 rcx_ dq ?
00000028 r8_ dq ?
00000030 r9_ dq ?
Parent
After initialization, parent
enters a loop. It reads on the p0 pipe, when read returns, makes a local copy of the syscall struct in mmap1 to prevent us from messing with it. It then checks the syscall number against a list of verifiers. Extracting the syscall numbers and matching it up with names via a simple python script gave the following list of allowed syscalls. If the requested syscall is not in the list or the verification function fails, parent
kills the children and quits. If verification succeeds, parent
copies the local syscall struct to mmap2. The verifier for most of these simply allows the call, only two, open
and chdir
have a common handler that checks whether the path to open/chdir contains dev, proc or sys.
Trusted
The code that trusted
executes (seen below) is a simple loop of waiting for signaling on the p1 pipe, filling up the registers from the syscall struct in mmap2, executing the syscall, storing the return value and signaling sandboxee
via p3. The code seems robust, even considering our access to the address space in which it executes. It doesn’t use anything of interest from writable locations, doesn’t call library functions, only syscalls, and exits without returning on a failure.
Sandboxee
The thread executing our shellcode, restricted by PR_SET_NO_NEW_PRIVS
and SECCOMP_MODE_STRICT
. Before our code is executed, sandboxee
requests an open syscall for the file ready.txt, reads its contents and writes them to stdout, presenting how the sandbox works and how to request syscalls from parent
.
An obvious way to break out would be to modify the code of the trusted
thread. This would however require changing page protection attributes and parent
won’t let mprotect syscalls through. However…
Solution I: overflow in the open/chdir handler
The decompiled code of the handler can be seen below. The check for the path containing proc should hint at a possible direction to take: somehow bypass it and open /proc/self/mem for writing. This would allow us to modify the code of trusted
from sandboxee
and break out.
Looking at the code, there is a rather simple buffer overflow. The handler checks that rdi (the path arg of the syscall) points inside the mmap1 region but at the end of the function copies it to mmap2+56
via strcpy. Both mmap1 and mmap2 are 4096 bytes big, so we could make the filename point to mmap1+16
for example and have a long string starting there. Since mmap2 is mmapped right after mmap1, it will be placed right under it in memory, meaning we overflow back into mmap1. At first, this doesn’t buy us much, considering that parent
has already made a copy of the syscall struct from mmap1. What we get however is the ability to modify the end of the path, since mmap1 is writable by us and the path overflows into it. So by requesting a chdir into a long path consisting only of ‘/.’s, the checks will pass and we might be able to append /proc to the path before trusted
executes the syscall. But we don’t even need to win a race, since trusted
and sandboxee
share file descriptors and parent
lets pipe, dup and dup2 through without validation. This means we can take over the pipe on which trusted
waits for the signal from parent
to execute a syscall and trigger it at our leisure.
Once we have /proc/self/mem open for writing, we modify trusted
to simply jump to our code and spawn a shell. Executing the final exploit (the live services were still up at the time of writing this):
Solution II: a race condition
There is a tight race when parent
starts copying a newly verified syscall struct over the previous one in mmap2. This was used by ricky in his exploit. The steps are the following:
- Request an unverified syscall with the arguments that you want for open/chdir, e.g. getpid(“/proc”)
- Request an open/chdir with bogus arguments, e.g. chdir(“/”)
- Hope the scheduler will be nice to you and preempts
parent
at just the right time after it has written the syscall number of open/chdir into mmap2 (0x00401294 in the binary) but before it writes the new arguments so thattrusted
ends up executing chdir(“/proc”) - “It takes quite a lot of tries.”