Heap overflow in the necp_client_action syscall

Heap overflow in the necp_client_action syscall

One of the things that is important to us at GRIMM is making sure there is time to experiment, and explore new ways of approaching problems. We want to answer the big questions like “How can we find vulnerabilities that other tools and manual analysis has overlooked?” This is what we are passionate about. So when one of our engineers has an idea for a new fuzzer, we try to make time for them to put their idea to the test. This bug was found when Jeffball wrote a syscall fuzzer for MacOS and spotted this bug while looking into another crash. It is safe to say that the new fuzzer was a smashing success!

The following is a write-up of a heap overflow vulnerability found while Fuzzing the macOS necp_client_action syscall. The necp_client_action syscall is part of the Network Extension Control Policy (NECP) kernel subsystem. This bug was first found in the XNU kernel version 4570.1.46 and was patched in the 10.13.4 kernel update (version 4570.51.1). Exercising the bug results in a heap overflow which can be turned into an information leak and eventually arbitrary code execution in the kernel.

The write-up and the code can be found on our NotQuite0DayFriday repository here:
https://github.com/grimm-co/NotQuite0DayFriday/tree/master/2018.04.06-macos

Affected Versions: XNU kernel versions 4570.1.46 and later until 4570.51.1

Discovery Environment: macOS High Sierra inside VMware Fusion running a custom version of XNU 4570.1.46

Exercising

$ clang poc.c -o poc
$ ./poc
$ clang leak.c -o leak
$ ./leak
$ clang uaf.c -o uaf
$ ./uaf

Triage

While investigating the NECP subsystem during the development of a syscall fuzzer for the necp_client_action syscall, we traced one code path that lead to a memcpy with an unchecked user-provided length.

Looking through the necp_client_action function in bsd/net/necp_client.c, we can see that it is merely a wrapper for several different actions. The vulnerable case occurs when the action parameter to necp_client_action is set to NECP_CLIENT_UPDATE_CACHE.  necp_client_action then calls the function necp_client_update_cache.

The necp_client_update_cache function is meant to manage TCP related information for a connection. Thus, this function will error out if not called with a NECP client UUID that has a valid connection assigned to it. After checking the UUID, this function copies in a user provided necp_cache_buffer struct onto the stack. This struct, looks as follows:

typedef struct necp_cache_buffer {
    u_int8_t                necp_cache_buf_type;    // NECP_CLIENT_CACHE_TYPE_*
    u_int8_t                necp_cache_buf_ver;     // NECP_CLIENT_CACHE_TYPE_*_VER
    u_int32_t               necp_cache_buf_size;
    mach_vm_address_t       necp_cache_buf_addr;
} necp_cache_buffer;

If the buf_type and buf_ver are NECP_CLIENT_CACHE_TYPE_TFO and NECP_CLIENT_CACHE_TYPE_TFO_VER_1 respectively, necp_client_update_cache will attempt to read a necp_tcp_tfo_cache struct from the user and call tcp_heuristics_tfo_update with it. This struct, shown below, is notable as it includes a user controlled buffer of up to 0x10 characters and a length forthat buffer.

typedef struct necp_tcp_tfo_cache {
    u_int8_t            necp_tcp_tfo_cookie[NECP_TFO_COOKIE_LEN_MAX];
    u_int8_t                necp_tcp_tfo_cookie_len;
    u_int8_t                necp_tcp_tfo_heuristics_success:1;
    u_int8_t                necp_tcp_tfo_heuristics_loss:1;
    u_int8_t                necp_tcp_tfo_heuristics_middlebox:1;
    u_int8_t                necp_tcp_tfo_heuristics_success_req:1;
    u_int8_t                necp_tcp_tfo_heuristics_loss_req:1;
    u_int8_t                necp_tcp_tfo_heuristics_rst_data:1;
    u_int8_t                necp_tcp_tfo_heuristics_rst_req:1;
} necp_tcp_tfo_cache;

The tcp_heuristics_tfo_update (in bsd/netinet/tcp_cache.c) processes the necp_tcp_tfo_cache struct, in the following lines of code:

if (necp_buffer->necp_tcp_tfo_cookie_len != 0) {
    tcp_cache_set_cookie_common(&tcks,
        necp_buffer->necp_tcp_tfo_cookie,
        necp_buffer->necp_tcp_tfo_cookie_len);
}

If the cookie length is not zero, it will call tcp_cache_set_cookie_common, shown below. This function looks up tcp_cache struct associated with the current connection and copies the cookie to that struct.

static void tcp_cache_set_cookie_common(struct tcp_cache_key_src
    *tcks, u_char *cookie, u_int8_t len)
{
    struct tcp_cache_head *head;
    struct tcp_cache *tpcache;
    /* Call lookup/create function */
    tpcache = tcp_getcache_with_lock(tcks, 1, &head);
    if (tpcache == NULL)
        return;
    tpcache->tc_tfo_cookie_len = len;
    memcpy(tpcache->tc_tfo_cookie, cookie, len);
    tcp_cache_unlock(head);
}

However, the tc_tfo_cookie is at most 0x10 bytes long. Thus, if an attacker specified a parameter larger than 0x10 bytes, the memcpy would overflow the tcp_cache struct on the heap and write data past it. As the source cookie is located on the stack, the heap will be overflown with the stack contents. Immediately after the cookie on the stack is the rest of the Necp_tcp_tfo_cache struct.  However, the remaining fields in the struct only account for two bytes. After the end of the necp_tcp_tfo_cache struct, the user is not able to Control any other bytes that will be used in the overflow.

The tcp_getcache_with_lock function attempts to find an existing tcp_cache struct for the local and remote hosts associated with a connection.  If one is not found, a new one will be created. However, there is a limit on the Number of tcp_cache structures that will be created.

Bug Fix

So how was this bug fixed? While the source code for the XNU kernel has not been updated to reflect the latest available XNU kernel binary, we can take a look at the disassembly to understand the fix. In the 4570.51.1 kernel at address 0xFFFFFF800060DE59, the following instructions were added:

cmp ebx, 0x10
mov edx, 0x10
cmovb edx, ebx
...
copies edx bytes from necp_tcp_tfo_cookie (rsi) to tpcache (rdi)

This assembly compares the necp_tcp_tfo_cookie_len parameter to 0x10, and only if it is less than 0x10 does that value get used in the following memcpy. Thus any attempt to write more than 0x10 bytes results in 0x10 bytes being written.

Heap Analysis

This bug allows for a heap overflow in the kernel heap. More specifically, the overflown tcp_cache struct is in the kalloc.80 zone.  As such, any exploitation attempts will naturally look to overflow the tcp_cache struct into another allocation in the kalloc.80 zone. We began our analysis by looking for kalloc.80 allocations. While the number of kalloc.80 allocations was quite limited, we were able to find three useful allocations.

The first allocation is made by the necp_set_socket_attributes function (in bsd/net/necp_client.c). This function allows a user to tag a socket with a domain and account attribute. These strings are allowed to be any size and can be created, read, and freed whenever the user wants. These allocations are great for reading out any leaked content. Further they can also be used to groom the kernel heap. However, each socket can only create two NECP attribute strings. Thus the total number of NECP attribute strings is limited by the limit on open file descriptors.

The second allocation found was the IO vector metadata struct (struct uio, defined in bsd/sys/uio_internal.h). This struct is used for doing scatter/gather memory operations in the kernel. When the uio struct is Created via uio_create (in bsd/kern/kern_subr.c), the IO vector information array is stored immediately after the uio struct. As such, by adjusting the number of IO vectors stored in the uio struct, we can control size. By creating a uio struct with a single IO vector, the uio struct will be declared in the kalloc.80 zone. These structures can be declared and freed through the use of the recvmsg_x and sendmsg_x syscall. This struct is more useful for heap grooming than the NECP allocation strings, as one struct per thread can be created.

The third allocation is used in the POSIX shared memory subsystem (in bsd/kern/posix_shm.c). This subsystem allows processes to setup shared memory regions that are mapped in both processes. These shared memory regions are tracked with the pshminfo struct, shown below. This struct will be useful during exploitation due to the pshm_usecount reference counter that is early on in the struct.

#define PSHMNAMLEN  31  /* maximum name segment length we bother with */
struct pshminfo {
    unsigned int  pshm_flags;
    unsigned int  pshm_usecount;
    off_t   pshm_length;
    mode_t  pshm_mode;
    uid_t   pshm_uid;
    gid_t   pshm_gid;
    char    pshm_name[PSHMNAMLEN + 1];  /* segment name */
    struct pshmobj *pshm_memobjects;
    struct label* pshm_label;
};

Obtaining an information Leak

Now that we have our target allocations setup, we can begin working on the exploit. Our first goal will be to obtain an information leak and deduce the kernel slide. This is achieved by flooding the kalloc.80 zone with NECP attribute strings, freeing every other NECP attribute string, and then causing the overflow. Once the overflow has occurred, we will read out the NECP attribute strings looking for one that has had its contents changed. The new contents will contain the bytes from the kernel stack. This information leak is demonstrated in the included leak.c file.

Exploit Development

Unfortunately, we were unable to finish our exploit prior to the vulnerability being fixed. This section will describe our initial exploitation development. The proof of concept for this exploit development is contained in the Included uaf.c file. This proof of concept is not-deterministic and may require a few runs to execute properly.

Due to the design of the XNU kernel zone allocator, it is not possible to overflow into kernel heap metadata. As such, we must try to overwrite the contents of another kernel heap chunk. Further, as the heap overflow does not control the contents that are written, we must find a way to create more powerful exploit primitives.

As described previously, we will be concentrating on the pshminfo struct for exploitation. This struct contains a reference counter at byte 4. When SHM regions are deleted, this reference counter is decremented.  If the counter reaches zero, then the pshminfo struct is freed. By overflowing into the pshminfo’s reference counter and corrupting it, we will be able to turn this uncontrolled heap overflow into a use after free vulnerability.

Before we can cause the use after free, we must deal with a few Implementation details. First, in order to free the struct, the pshm_flags field must have the PSHM_DEFINED (0x2) or PSHM_ALLOCATED flags (0x4) set and the PSHM_INDELETE (0x80) and PSHM_ALLOCATING (0x100) flags not set. As the pshm_flags field is before the reference counter, we will need to overflow it as well.

The next issue that we will need to deal with is obtaining a pshminfo struct that is immediately after the tcp_cache struct. This is rather easy to achieve as the kalloc.80 zone is rather static and not used by many things in the kernel. However, we will need to identify which of the pshminfo structures is immediately after the overflown tcp_cache struct. However, as the pshm_flags field will be corrupted, it will most likely not be valid. Further, if the pshm_flags field is not set properly, shm_unlink will fail when called. Thus, we can detect the overflown SHM region, by calling shm_unlink on each of the SHM regions and looking for a failure.

Once we’ve identified the SHM region that was overflown, we’ll next need to correct the region’s pshminfo’s pshm_flags field. Unfortunately, we do cannot control the contents of the heap overflow, so we cannot directly set the value. However, during our testing, the value overflown into the pshm_flags field is uninitialized stack data. As such, the value can be different for each necp_client_action syscall. Thus, our proof of concept continually triggers the heap overflow until the flag is set to an acceptable value.

Now that we’ve managed to set a valid pshm_flags value and corrupt the Reference count, we can cause the use after free.  Prior to the most recent corruption, our proof of concept opens the overflown SHM region 252 more times. As such, we will most likely be able to close the SHM region more times than the Corrupted reference counter. Our proof of concept closes each of the SHM regions and creates a NECP attribute string immediately after. Once the pshminfo’s reference counter hits 0, it will be freed, and the next attribute string will be placed at the same memory location. As the kernel still maintains otherreferences to the SHM region, we will have caused a use after free condition on the pshminfo structure. To illustrate the use after free, our proof of concept prints out the information about the SHM region obtained via the proc_info syscall. As the pshminfo struct for the SHM region and our NECP attribute string are on the same memory region, the pshminfo struct will have been overwritten with the contents of our NECP attribute string (0x41).

The next steps in our exploitation plan was to setup the NECP attribute string such that the remaining references to the SHM region could cause another use after free. In the freed memory, a uio struct would be allocated which we could control and read via the NECP attribute string also occupying that region of memory. With the ability to read and modify a uio struct, the exploit should be able to achieve arbitrary read/write of kernel memory.

Timeline

10/24/17 - Bug discovered and began bug triage to understand the bug’s impact.

03/29/18 - Apple released macOS 10.13.4 with the patch included.

04/06/18 - Finished bug triage and analysis. Write-up made public.