A Brief Introduction to BPF and eBPF

Hey Habr! We inform you that we are preparing to release a book "Linux Observability with BPF".

A Brief Introduction to BPF and eBPF
As the BPF virtual machine continues to evolve and is actively used in practice, we have translated an article for you describing its main features and current state.

In recent years, programming tools and techniques have gained popularity to compensate for the limitations of the Linux kernel in cases where high-performance packet processing is required. One of the most popular methods of this kind is called core bypass (kernel bypass) and allows, skipping the network layer of the kernel, to perform all packet processing from user space. Bypassing the kernel also involves managing the network card from user space. In other words, when working with a network card, we rely on the driver user space.

By transferring full control of the network card to a user-space program, we reduce the overhead caused by the kernel (context switches, network layer processing, interrupts, etc.), which is quite important when running at speeds of 10Gb / s or higher. Bypassing the kernel plus a combination of other features (batch processing) and careful performance tuning (NUMA accounting, CPU isolation, etc.) fit the basics of high-performance user-space networking. Perhaps an exemplary example of this new approach to packet processing is DPDK from Intel (Data Plane Development Kit), although there are other well-known tools and techniques, including VPP from Cisco (Vector Packet Processing), Netmap and, of course, snab.

The organization of network interactions in user space has a number of disadvantages:

  • An OS kernel is an abstraction layer for hardware resources. Because user-space programs have to manage their resources directly, they also have to manage their own hardware. This often means programming your own drivers.
  • Since we are completely giving up kernel space, we are also giving up all the networking functionality provided by the kernel. User-space programs have to re-implement features that may already be provided by the kernel or operating system.
  • Programs work in a sandbox mode, which seriously limits their interaction and prevents them from integrating with other parts of the operating system.

In essence, when networking in user space, performance gains are achieved by moving packet processing from the kernel to user space. XDP does exactly the opposite: it moves network programs from user space (filters, converters, routing, etc.) to the kernel area. XDP allows us to execute the network function as soon as the packet hits the network interface and before it starts traveling up to the network subsystem of the kernel. As a result, the packet processing speed is significantly increased. However, how does the kernel allow the user to run their programs in kernel space? Before answering this question, let's look at what BPF is.

BPF and eBPF

Despite the not entirely clear name, BPF (Packet Filtering, Berkeley) is, in fact, a virtual machine model. This virtual machine was originally designed to handle packet filtering, hence the name.

One of the more well-known tools using BPF is tcpdump. When capturing packets with tcpdump the user can specify an expression for packet filtering. Only packets that match this expression will be captured. For example, the expression "tcp dst port 80” refers to all TCP packets arriving on port 80. The compiler can shorten this expression by converting it to BPF bytecode.

$ sudo tcpdump -d "tcp dst port 80"
(000) ldh [12] (001) jeq #0x86dd jt 2 jf 6
(002) ldb [20] (003) jeq #0x6 jt 4 jf 15
(004) ldh [56] (005) jeq #0x50 jt 14 jf 15
(006) jeq #0x800 jt 7 jf 15
(007) ldb [23] (008) jeq #0x6 jt 9 jf 15
(009) ldh [20] (010) jset #0x1fff jt 15 jf 11
(011) ldxb 4*([14]&0xf)
(012) ldh [x + 16] (013) jeq #0x50 jt 14 jf 15
(014) ret #262144
(015) ret #0

This is basically what the above program does:

  • Instruction (000): Loads the packet at offset 12, as a 16-bit word, into the accumulator. Offset 12 corresponds to the ethertype of the packet.
  • Instruction (001): compares the value in the accumulator with 0x86dd, that is, with the ethertype value for IPv6. If the result is true, then the program counter goes to instruction (002), and if not, then to (006).
  • Instruction (006): compares the value with 0x800 (ethertype value for IPv4). If the answer is true, then the program goes to (007), if not, then to (015).

And so on, until the packet filtering program returns a result. Usually it is boolean. Returning a non-zero value (instruction (014)) means that the packet matched, and returning zero (instruction (015)) means that the packet did not match.

The BPF virtual machine and its bytecode were proposed by Steve McCann and Van Jacobson in late 1992 when their paper came out. BSD Packet Filter: New architecture for user-level packet capture, for the first time this technology was presented at the Usenix conference in the winter of 1993.

Because BPF is a virtual machine, it defines the environment in which programs run. In addition to bytecode, it also defines a packet memory model (load instructions are implicitly applied to a packet), registers (A and X; accumulator and index registers), scratch memory storage, and an implicit program counter. Interestingly, the BPF bytecode was modeled after the Motorola 6502 ISA. As Steve McCann recalled in his plenary report at Sharkfest '11, he was familiar with build 6502 from high school when programming on the Apple II, and this knowledge influenced his work designing the BPF bytecode.

BPF support is implemented in the Linux kernel in version v2.5 and later, added mainly by Jay Schullist. The BPF code remained unchanged until 2011, when Eric Dumaset redesigned the BPF interpreter to work in JIT mode (Source: JIT for Packet Filters). After that, instead of interpreting the BPF bytecode, the kernel could directly convert BPF programs to the target architecture: x86, ARM, MIPS, etc.

Later, in 2014, Alexei Starovoitov proposed a new JIT mechanism for BPF. In fact, this new JIT became a new architecture based on BPF and was called eBPF. I think both VMs coexisted for some time, but packet filtering is currently implemented on top of eBPF. In fact, in many modern documentation examples, BPF is referred to as eBPF, and classical BPF is known today as cBPF.

eBPF extends the classic BPF virtual machine in several ways:

  • Relies on modern 64-bit architectures. eBPF uses 64-bit registers and increases the number of available registers from 2 (accumulator and X) to 10. eBPF also provides additional opcodes (BPF_MOV, BPF_JNE, BPF_CALL…).
  • Detached from the network layer subsystem. BPF was tied to the batch data model. Since it was used to filter packets, its code was in the subsystem that provided network interactions. However, the eBPF virtual machine is no longer bound to a data model and can be used for any purpose. So, now the eBPF program can be connected to tracepoint or to kprobe. This opens the door to eBPF instrumentation, performance analysis, and many other use cases in the context of other kernel subsystems. Now the eBPF code is located in its own path: kernel/bpf.
  • Global data stores called Maps. Maps are key-value stores that provide data exchange between user space and kernel space. eBPF provides several types of cards.
  • Secondary functions. In particular, to overwrite a package, calculate a checksum, or clone a package. These functions run inside the kernel and do not belong to user-space programs. In addition, system calls can be made from eBPF programs.
  • End calls. Program size in eBPF is limited to 4096 bytes. The end call feature allows an eBPF program to transfer control to a new eBPF program and thus bypass this limitation (up to 32 programs can be chained this way).

eBPF example

There are several examples for eBPF in the Linux kernel sources. They are available at samples/bpf/. To compile these examples, just type:

$ sudo make samples/bpf/

I will not write a new example for eBPF myself, but will use one of the samples available in samples/bpf/. I will look at some parts of the code and explain how it works. As an example, I chose the program tracex4.

In general, each of the examples in samples/bpf/ consists of two files. In this case:

  • tracex4_kern.c, contains source code to be executed in the kernel as eBPF bytecode.
  • tracex4_user.c, contains a program from user space.

In this case, we need to compile tracex4_kern.c to eBPF bytecode. At the moment in gcc there is no server part for eBPF. Fortunately, clang can produce eBPF bytecode. Makefile uses clang to compile tracex4_kern.c to the object file.

I mentioned above that one of the most interesting features of eBPF is maps. tracex4_kern defines one map:

struct pair {
    u64 val;
    u64 ip;
};  

struct bpf_map_def SEC("maps") my_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(long),
    .value_size = sizeof(struct pair),
    .max_entries = 1000000,
};

BPF_MAP_TYPE_HASH is one of the many card types offered by eBPF. In this case, it's just a hash. You may also have noticed the ad SEC("maps"). SEC is a macro used to create a new section of a binary file. Actually, in the example tracex4_kern two more sections are defined:

SEC("kprobe/kmem_cache_free")
int bpf_prog1(struct pt_regs *ctx)
{   
    long ptr = PT_REGS_PARM2(ctx);

    bpf_map_delete_elem(&my_map, &ptr); 
    return 0;
}
    
SEC("kretprobe/kmem_cache_alloc_node") 
int bpf_prog2(struct pt_regs *ctx)
{
    long ptr = PT_REGS_RC(ctx);
    long ip = 0;

    // ΠΏΠΎΠ»ΡƒΡ‡Π°Π΅ΠΌ ip-адрСс Π²Ρ‹Π·Ρ‹Π²Π°ΡŽΡ‰Π΅ΠΉ стороны kmem_cache_alloc_node() 
    BPF_KRETPROBE_READ_RET_IP(ip, ctx);

    struct pair v = {
        .val = bpf_ktime_get_ns(),
        .ip = ip,
    };
    
    bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
    return 0;
}   

These two functions allow you to remove an entry from the map (kprobe/kmem_cache_free) and add a new entry to the map (kretprobe/kmem_cache_alloc_node). All function names written in capital letters correspond to macros defined in bpf_helpers.h.

If I dump the sections of the object file, I should see that these new sections are already defined:

$ objdump -h tracex4_kern.o

tracex4_kern.o: file format elf64-little

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000000 0000000000000000 0000000000000000 00000040 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 kprobe/kmem_cache_free 00000048 0000000000000000 0000000000000000 00000040 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
2 kretprobe/kmem_cache_alloc_node 000000c0 0000000000000000 0000000000000000 00000088 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3 maps 0000001c 0000000000000000 0000000000000000 00000148 2**2
CONTENTS, ALLOC, LOAD, DATA
4 license 00000004 0000000000000000 0000000000000000 00000164 2**0
CONTENTS, ALLOC, LOAD, DATA
5 version 00000004 0000000000000000 0000000000000000 00000168 2**2
CONTENTS, ALLOC, LOAD, DATA
6 .eh_frame 00000050 0000000000000000 0000000000000000 00000170 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

There is also tracex4_user.c, main program. Basically, this program listens for events kmem_cache_alloc_node. When such an event occurs, the corresponding eBPF code is executed. The code saves the object's IP attribute to a map, and then the object is looped through the main program. Example:

$ sudo ./tracex4
obj 0xffff8d6430f60a00 is 2sec old was allocated at ip ffffffff9891ad90
obj 0xffff8d6062ca5e00 is 23sec old was allocated at ip ffffffff98090e8f
obj 0xffff8d5f80161780 is 6sec old was allocated at ip ffffffff98090e8f

How are the user space program and the eBPF program related? At initialization tracex4_user.c loads object file tracex4_kern.o using functions load_bpf_file.

int main(int ac, char **argv)
{
    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
    char filename[256];
    int i;

    snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

    if (setrlimit(RLIMIT_MEMLOCK, &r)) {
        perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
        return 1;
    }

    if (load_bpf_file(filename)) {
        printf("%s", bpf_log_buf);
        return 1;
    }

    for (i = 0; ; i++) {
        print_old_objects(map_fd[1]);
        sleep(1);
    }

    return 0;
}

By doing load_bpf_file probes defined in the eBPF file are added to /sys/kernel/debug/tracing/kprobe_events. Now we listen for these events and our program can do something when they happen.

$ sudo cat /sys/kernel/debug/tracing/kprobe_events
p:kprobes/kmem_cache_free kmem_cache_free
r:kprobes/kmem_cache_alloc_node kmem_cache_alloc_node

All other programs in sample/bpf/ are structured similarly. They always contain two files:

  • XXX_kern.c: eBPF program.
  • XXX_user.c: main program.

The eBPF program defines the maps and functions associated with a section. When the kernel emits an event of a certain type (for example, tracepoint), the bound functions are executed. Maps provide communication between a kernel program and a user-space program.

Conclusion

In this article, BPF and eBPF were discussed in general terms. I know that there is a lot of information and resources about eBPF today, so I will recommend a few more materials for further study.

I recommend to read:

Source: habr.com

Add a comment