thread local caching in glibc malloc08 Jul 2017
Welcome to the fifth episode of the ptmalloc fanzine, in which we look at thread local caching, a recent addition to glibc malloc.
This episode consists of:
- an overview of the changes brought by per-thread caching
- an exploration of how tcaching affects some old techniques
- the House of Spirit: fewer and looser prerequisites
- creating overlapping chunks via size corruption of allocated chunks later passed into free
- tcache poisoning: forcing malloc to return completely arbitrary chunks in a similar fashion to fastbin poisoning with fewer prereqs.
- new possibilities
All analysis was done on this state of the glibc tree on Ubuntu 16.04 (x86-64).
The patch (see commit) offers significant performance gains (see benchmarks) by creating per-thread caches for chunks up to a certain size (practically below largebin sizes). Modifying these bins requires no locking, hence the speed improvements. It’s important to note that there are no distros using it currrently, since it will be only released as part of glibc 2.26 (scheduled in August), so things may change before it sees widespread use.
There are 2 new structures of interest, tcache_entry and tcache_perthread_struct. Both are rather simple, see them below. There are 64 singly-linked bins per thread by default, for chunksizes from 24 to 1032 (12 to 516 on x86) bytes, in 16 (8) byte increments. A single tcache bin contains at most 7 chunks by default.
Chunks can end up in the thread caches multiple ways:
- upon free: before the fastbin code in _int_free, if the chunk has an appropriate size and the corresponding bin isn’t full
- upon malloc, there are 3 places where caches are filled
Chunks are taken from the tcache:
__libc_malloc, before _int_malloc.
- after the binning code, if at least one exact match was found.
- there can also be a limit on the number chunks that are put in the tcache in a run of the binning code. If that’s reached, the last one found is returned. However, this is unlimited by default.
- the tcache fill code in the fast path of malloc will reverse the order of the chunks.
- cached chunks won’t be coalesced
2.1 neither on free of neighboring chunks
2.2 nor with top when they are freed
An offense-focused analysis
The tcache handling code is very early in both free and malloc, as it should be, meaning that most of the free/malloc code is bypassed for non-large chunksizes until the corresponding tcache bins are full. This is by design but has some ramifications. As a direct result, most integrity checks are bypassed. In
_int_free, a corrupted chunk only has to pass the alignment and wrapping checks before being cached. The caching happens in __libc_malloc on the malloc side.
The House of Spirit
Consider what is needed to make the caching code accept a region as a chunk in the beginning of _int_free. A fake size value is enough that satisfies the following :
- it’s at a
- its value is between
MINSIZEand the maximum cached chunksize (1032/516 bytes).
- on x64 it also mustn’t have its 4th LSB set.
This makes the House of Spirit much more powerful than it used to be: there are no nextsize checks and now it works for smallbin sizes, too. The tcache_house_of_spirit.c example shows this in practice by building a fake chunk with a smallbin size and an invalid nextchunk on the stack, passing it to free and getting it back from malloc.
tukan@farm:~/work/libc/build/b2$ ./testrun.sh ../../../ptmalloc-fanzine/05-tcache/tcache_house_of_spirit This example showcases how the House of Spirit became more powerful after the tcache patch Filling space at and after the fake chunk with invalid data Building fake chunk on the stack at 0x7fff781b50e0 Passed chunk to free, let's make an allocation for the fake size malloc(0x100) returned: 0x7fff781b50f0
Creating overlapping chunks via the binning code in
_int_malloc by corrupting the size of a freed chunk has already been rather easy but the caching mechanisms brings this possibility to the
_int_free side, too. Any size that passes the checks discussed above will result in the chunk being placed into the tcache bin corresponding to the fake size. The overlapping_chunks_by_caching.c shows this by enlarging a chunk.
tukan@farm:~/work/libc/build/b2$ ./testrun.sh ../../../ptmalloc-fanzine/05-tcache/overlapping_chunks_by_caching This example showcases the possibility to create overlapping chunks via the tcaching code in _int_free Allocated victim chunk with requested size 0x48 at 0x560e374c4670 Allocated sentry element after victim (not strictly necessary): 0x560e374c46c0 Emulating corruption of the victim's size to 0x110 Freed victim chunk to put it in a different tcache bin Requested a chunk of 0x100 bytes, it is at: 0x560e374c4670
Bins in a tcache behave rather similar to fastbins. Below is the code for
tcache_get, responsible for removing a chunk from a tcache bin. Corrupting the
next pointer in a
tcache_entry yields the ability to return completely arbitrary chunks. Compared to the requirements for fastbin poisoning (a
size_t value with the same
fastbin_index as the poisoned fastbin to act as the size of the fake chunk), this is very attacker-friendly.
The tcache_poisoning.c example shows this in practice.
tukan@farm:~/work/libc/build/b2$ ./testrun.sh ../../../ptmalloc-fanzine/05-tcache/tcache_poisoning This example showcases tcache poisoning by forcing malloc to return an arbitrary chunk after the corruption of a tcache_entry Our target is a stack region at 0x7fff0faa62c0 Allocated victim chunk with requested size 0x48 at 0x55f6a8fb9670 Freed victim chunk to put it in a tcache bin Emulating corruption of the next ptr of victim (while also corrupting its size for good measure) Now we need to make two requests for the appropriate size so that malloc returns a chunk overlapping our target The first malloc(0x48) returned 0x55f6a8fb9670, the second one: 0x7fff0faa62c0
Making a tcache bin circular by a double free is also a bit simpler than fastbin duplication because there is no double free check against the first member of the bin upon free.
I consider these the more serious issues, what follows is a theoretical treatment of a couple of other primitives.
Smallbin cache filling bck write
The cache filling code of the smallbin path in
_int_malloc mentioned previously traverses the smallbin corresponding to the requested size and places chunks into the corresponding tcache bin (until the smallbin is empty or the tcache bin is full). It does the same unlinking as a couple of lines above to remove the victim chunk from the smallbin but lacks the
bck->fd != victim check. This means that
- the House of Lore could be made more practical again (though there are much more useful techniques now)
- an uncontrolled write similar to the unsorted bck write could be achieved
tcache_perthread_structs as corruption targets
tcache_perthread_struct of a thread is allocated via
_int_malloc, so it resides on the heap. The
counts member is mostly uninteresting but corrupting the
entries array would make it possible to do the previous tcache poisoning in fewer steps. Since allocation of the structure happens before any other allocation, the viability of this approach will highly depend on the target:
- the type of corruption
- once the number of threads reach the arena number limit, arena sharing between threads might make these structures more interleaved with other data.
Per-thread caching is an interesting addition to glibc malloc providing significant performance benefits. However, it also seems to be a few steps backwards regarding the security posture of the allocator and shows that striking a good balance between performance and security is hard.
Special thanks to gym again for the rigorous proofreading.