thread local caching in glibc malloc

08 Jul 2017

Welcome to the fifth episode of the ptmalloc fanzine, in which we look at thread local caching, a recent addition to glibc malloc.

TLDR

This episode consists of:

an overview of the changes brought by per-thread caching
an exploration of how tcaching affects some old techniques
- the House of Spirit: fewer and looser prerequisites
- creating overlapping chunks via size corruption of allocated chunks later passed into free
- tcache poisoning: forcing malloc to return completely arbitrary chunks in a similar fashion to fastbin poisoning with fewer prereqs.
new possibilities

All analysis was done on this state of the glibc tree on Ubuntu 16.04 (x86-64).

Overview

The patch (see commit) offers significant performance gains (see benchmarks) by creating per-thread caches for chunks up to a certain size (practically below largebin sizes). Modifying these bins requires no locking, hence the speed improvements. It’s important to note that there are no distros using it currrently, since it will be only released as part of glibc 2.26 (scheduled in August), so things may change before it sees widespread use.

New structures

There are 2 new structures of interest, tcache_entry and tcache_perthread_struct. Both are rather simple, see them below. There are 64 singly-linked bins per thread by default, for chunksizes from 24 to 1032 (12 to 516 on x86) bytes, in 16 (8) byte increments. A single tcache bin contains at most 7 chunks by default.

/* We overlay this structure on the user-data portion of a chunk when the chunk is stored in the per-thread cache.  */
typedef struct tcache_entry
{
  struct tcache_entry *next;
} tcache_entry;

/* There is one of these for each thread, which contains the per-thread cache (hence "tcache_perthread_struct").  Keeping overall size low is mildly important.  Note that COUNTS and ENTRIES are redundant (we could have just counted the linked list each time), this is for performance reasons.  */
typedef struct tcache_perthread_struct
{
  char counts[TCACHE_MAX_BINS];
  tcache_entry *entries[TCACHE_MAX_BINS];
} tcache_perthread_struct;

static __thread tcache_perthread_struct *tcache = NULL;

tcache usage

Chunks can end up in the thread caches multiple ways:

upon free: before the fastbin code in _int_free, if the chunk has an appropriate size and the corresponding bin isn’t full
upon malloc, there are 3 places where caches are filled
- if a fast chunk is returned, the other chunks from the corresponding fastbin are used to fill the appropriate tcache bin.
- the same is done if a small chunk is returned by malloc.
- in the binning code, exact size matches are first put in the tcache instead of returning immediately.

Chunks are taken from the tcache:

in __libc_malloc, before _int_malloc.
after the binning code, if at least one exact match was found.
there can also be a limit on the number chunks that are put in the tcache in a run of the binning code. If that’s reached, the last one found is returned. However, this is unlimited by default.

Some observations:

the tcache fill code in the fast path of malloc will reverse the order of the chunks.
cached chunks won’t be coalesced
2.1 neither on free of neighboring chunks
2.2 nor with top when they are freed

An offense-focused analysis

The tcache handling code is very early in both free and malloc, as it should be, meaning that most of the free/malloc code is bypassed for non-large chunksizes until the corresponding tcache bins are full. This is by design but has some ramifications. As a direct result, most integrity checks are bypassed. In _int_free, a corrupted chunk only has to pass the alignment and wrapping checks before being cached. The caching happens in __libc_malloc on the malloc side.

The House of Spirit

Consider what is needed to make the caching code accept a region as a chunk in the beginning of _int_free. A fake size value is enough that satisfies the following :

it’s at a 2*SIZE_SZ aligned address
its value is between MINSIZE and the maximum cached chunksize (1032/516 bytes).
on x64 it also mustn’t have its 4th LSB set.

This makes the House of Spirit much more powerful than it used to be: there are no nextsize checks and now it works for smallbin sizes, too. The tcache_house_of_spirit.c example shows this in practice by building a fake chunk with a smallbin size and an invalid nextchunk on the stack, passing it to free and getting it back from malloc.

tukan@farm:~/work/libc/build/b2$ ./testrun.sh ../../../ptmalloc-fanzine/05-tcache/tcache_house_of_spirit
This example showcases how the House of Spirit became more powerful  after the tcache patch
Filling space at and after the fake chunk with invalid data
Building fake chunk on the stack at 0x7fff781b50e0
Passed chunk to free, let's make an allocation for the fake size
malloc(0x100) returned: 0x7fff781b50f0

Overlapping chunks

Creating overlapping chunks via the binning code in _int_malloc by corrupting the size of a freed chunk has already been rather easy but the caching mechanisms brings this possibility to the allocated chunk/_int_free side, too. Any size that passes the checks discussed above will result in the chunk being placed into the tcache bin corresponding to the fake size. The overlapping_chunks_by_caching.c shows this by enlarging a chunk.

tukan@farm:~/work/libc/build/b2$ ./testrun.sh ../../../ptmalloc-fanzine/05-tcache/overlapping_chunks_by_caching
This example showcases the possibility to create overlapping chunks             via the tcaching code in _int_free
Allocated victim chunk with requested size 0x48 at 0x560e374c4670
Allocated sentry element after victim (not strictly necessary): 0x560e374c46c0
Emulating corruption of the victim's size to 0x110
Freed victim chunk to put it in a different tcache bin
Requested a chunk of 0x100 bytes, it is at: 0x560e374c4670

tcache poisoning

Bins in a tcache behave rather similar to fastbins. Below is the code for tcache_get, responsible for removing a chunk from a tcache bin. Corrupting the next pointer in a tcache_entry yields the ability to return completely arbitrary chunks. Compared to the requirements for fastbin poisoning (a size_t value with the same fastbin_index as the poisoned fastbin to act as the size of the fake chunk), this is very attacker-friendly.

static void *
tcache_get (size_t tc_idx)
{
  tcache_entry *e = tcache->entries[tc_idx];
  assert (tc_idx < TCACHE_MAX_BINS);
  assert (tcache->entries[tc_idx] > 0);
  tcache->entries[tc_idx] = e->next;
  --(tcache->counts[tc_idx]);
  return (void *) e;
}

The tcache_poisoning.c example shows this in practice.

tukan@farm:~/work/libc/build/b2$ ./testrun.sh ../../../ptmalloc-fanzine/05-tcache/tcache_poisoning
This example showcases tcache poisoning by forcing malloc to return an arbitrary chunk after the corruption of a tcache_entry
Our target is a stack region at 0x7fff0faa62c0
Allocated victim chunk with requested size 0x48 at 0x55f6a8fb9670
Freed victim chunk to put it in a tcache bin
Emulating corruption of the next ptr of victim (while also corrupting its size for good measure)
Now we need to make two requests for the appropriate size so that malloc returns a chunk overlapping our target
The first malloc(0x48) returned 0x55f6a8fb9670, the second one: 0x7fff0faa62c0

Making a tcache bin circular by a double free is also a bit simpler than fastbin duplication because there is no double free check against the first member of the bin upon free.

I consider these the more serious issues, what follows is a theoretical treatment of a couple of other primitives.

Smallbin cache filling bck write

The cache filling code of the smallbin path in _int_malloc mentioned previously traverses the smallbin corresponding to the requested size and places chunks into the corresponding tcache bin (until the smallbin is empty or the tcache bin is full). It does the same unlinking as a couple of lines above to remove the victim chunk from the smallbin but lacks the bck->fd != victim check. This means that

the House of Lore could be made more practical again (though there are much more useful techniques now)
an uncontrolled write similar to the unsorted bck write could be achieved

tcache_perthread_structs as corruption targets

The tcache_perthread_struct of a thread is allocated via _int_malloc, so it resides on the heap. The counts member is mostly uninteresting but corrupting the entries array would make it possible to do the previous tcache poisoning in fewer steps. Since allocation of the structure happens before any other allocation, the viability of this approach will highly depend on the target:

the type of corruption
once the number of threads reach the arena number limit, arena sharing between threads might make these structures more interleaved with other data.

Conclusion

Per-thread caching is an interesting addition to glibc malloc providing significant performance benefits. However, it also seems to be a few steps backwards regarding the security posture of the allocator and shows that striking a good balance between performance and security is hard.

Special thanks to gym again for the rigorous proofreading.

Online tukan sanctuary