BPF for the little ones, part zero: classic BPF

Berkeley Packet Filters (BPF) is a Linux kernel technology that has been on the front pages of English-language technical publications for several years now. Conferences are full of talks about the use and development of BPF. David Miller, Linux Network Subsystem Manager, announces his talk at Linux Plumbers 2018 "This talk is not about XDP" (XDP is one use case for BPF). Brendan Gregg gives talks called Linux BPF Superpowers. Toke HΓΈiland-JΓΈrgensen laughingthat the kernel is now a microkernel. Thomas Graf touts the idea that BPF is javascript for core.

There is still no systematic description of BPF on HabrΓ©, and therefore in a series of articles I will try to talk about the history of the technology, describe the architecture and development tools, and outline the areas of application and practice of using BPF. This, zero, article of the cycle tells the history and architecture of the classic BPF, and also reveals the secrets of the principles of work tcpdump, seccomp, strace, and much more.

The development of BPF is controlled by the Linux networking community, the main existing applications of BPF are related to networks and therefore, with permission @eucariot, I called the series "BPF for the little ones", in honor of the great series "Networks for the little ones".

Brief history of BPF(c)

Modern BPF technology is an improved and expanded version of the old technology of the same name, now called, to avoid confusion, classic BPF. Based on the classic BPF, a well-known utility was created tcpdump, mechanism seccomp, as well as the lesser known module xt_bpf for iptables and classifier cls_bpf. In modern Linux, classic BPF programs are automatically translated into the new form, however, from the user's point of view, the API has remained in place and new uses of classic BPF, as we will see in this article, are still found. For this reason, and also because following the history of the development of classical BPF in Linux, it will become clearer how and why it evolved into a modern form, I decided to start with an article about classic BPF.

In the late eighties of the last century, engineers from the famous Lawrence Berkeley Laboratory became interested in the question of how to properly filter network packets on modern hardware for the late eighties of the last century. The basic idea of ​​filtering, originally implemented in the CSPF (CMU/Stanford Packet Filter) technology, was to filter out unnecessary packets as early as possible, i.e. in kernel space, as this avoids copying unnecessary data into user space. To provide runtime security for running user code in kernel space, a sandboxed virtual machine was used.

However, the virtual machines for existing filters were designed to run on stack-based machines and did not work as efficiently on new RISC machines. As a result, the efforts of engineers from Berkeley Labs developed a new BPF technology (Berkeley Packet Filters), the virtual machine architecture of which was designed based on the Motorola 6502 processor, the workhorse of such well-known products as Apple II or NES. The new virtual machine increased the performance of filters dozens of times compared to existing solutions.

BPF Machine Architecture

We will get acquainted with the architecture in a working way, analyzing examples. However, to begin with, let's say that the machine had two 32-bit registers available to the user, the accumulator A and index register X, 64 bytes of memory (16 words) available for writing and subsequent reading, and a small set of instructions for working with these objects. Jump instructions were also available in programs to implement conditional expressions, however, to ensure the timely completion of the program, it was only possible to jump forward, i.e., in particular, it was forbidden to create loops.

The general scheme for starting the machine is as follows. The user creates a program for the BPF architecture and, using some kernel mechanism (for example, a system call), loads and connects the program to to some an event generator in the kernel (for example, an event is the arrival of the next packet on a network card). When an event occurs, the kernel runs the program (for example, in an interpreter), while the machine memory corresponds to to some kernel memory region (for example, the data of an incoming packet).

What has been said above will be enough for us to begin to analyze the examples: we will get acquainted with the system and command format as necessary. If you want to immediately study the command system of a virtual machine and learn about all its capabilities, then you can read the original article The BSD Packet Filter and/or the first half of the file Documentation/networking/filter.txt from the kernel documentation. In addition, you can view the presentation libpcap: An Architecture and Optimization Methodology for Packet Capture, in which McCanne, one of the authors of BPF, talks about the history of the creation libpcap.

We turn to the consideration of all significant examples of the use of classical BPF in Linux: tcpdump (libpcap), seccomp, xt_bpf, cls_bpf.

Tcpdump

The development of BPF was carried out in parallel with the development of the front-end for packet filtering - a well-known utility tcpdump. And, since this is the oldest and most famous example of using the classic BPF, available on many operating systems, we will study the technology with it and begin.

(I ran all the examples in this article on Linux 5.6.0-rc6. The output of some commands has been edited for readability.)

Example: Watching IPv6 Packets

Imagine we want to look at all IPv6 packets on an interface eth0. To do this, we can run the program tcpdump with simple filter ip6:

$ sudo tcpdump -i eth0 ip6

In this case, tcpdump will compile the filter ip6 into the BPF architecture bytecode and send it to the kernel (see details in the section tcpdump: download). The loaded filter will be run for every packet passing through the interface eth0. If the filter returns a non-null value n, then up to n bytes of the packet will be copied to user space and we will see it in the output tcpdump.

BPF for the little ones, part zero: classic BPF

It turns out that we can easily find out exactly which bytecode sent to the kernel tcpdump with the help of tcpdump, if we run it with the option -d:

$ sudo tcpdump -i eth0 -d ip6
(000) ldh      [12]
(001) jeq      #0x86dd          jt 2    jf 3
(002) ret      #262144
(003) ret      #0

On the zero line we run the command ldh [12], which stands for "load into the register A half a word (16 bits) located at address 12" and the only question is what kind of memory are we addressing? The answer is that at x begins (x+1)-th byte of the analyzed network packet. We read packets from the Ethernet interface eth0and this meansthat the packet looks like this (for simplicity, we assume that there are no VLAN tags in the packet):

       6              6          2
|Destination MAC|Source MAC|Ether Type|...|

So after executing the command ldh [12] in register A will be a field Ether Type β€” type of packet transmitted in this Ethernet frame. On line 1 we compare the contents of the register A (packet type) c 0x86ddand this and there is the type of IPv6 we are interested in. On line 1, in addition to the comparison command, there are two more columns - jt 2 ΠΈ jf 3 - labels to go to in case of a successful comparison (A == 0x86dd) and failed. So, in a successful case (IPv6), we go to line 2, and in an unsuccessful one, to line 3. On line 3, the program exits with code 0 (do not copy the packet), on line 2, the program exits with code 262144 (copy me a maximum of 256 kilobytes package).

More complicated example: look at TCP packets by destination port

Let's see how a filter looks like that copies all TCP packets with destination port 666. We will consider the case of IPv4, since the case of IPv6 is simpler. After studying this example, you can study the IPv6 filter yourself as an exercise (ip6 and tcp dst port 666) and a filter for the general case (tcp dst port 666). So, the filter we are interested in looks like this:

$ sudo tcpdump -i eth0 -d ip and tcp dst port 666
(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 10
(002) ldb      [23]
(003) jeq      #0x6             jt 4    jf 10
(004) ldh      [20]
(005) jset     #0x1fff          jt 10   jf 6
(006) ldxb     4*([14]&0xf)
(007) ldh      [x + 16]
(008) jeq      #0x29a           jt 9    jf 10
(009) ret      #262144
(010) ret      #0

We already know what lines 0 and 1 do. On line 2, we have already verified that this is an IPv4 packet (Ether Type = 0x800) and load into the register A 24th byte of the packet. Our package looks like

       14            8      1     1
|ethernet header|ip fields|ttl|protocol|...|

which means we load into the register A the Protocol field of the IP header, which is logical, because we only want to copy TCP packets. We compare Protocol with 0x6 (IPPROTO_TCP) on line 3.

On lines 4 and 5, we load the half words at address 20, and with the command jset check if one of the three is set flags - in the mask issued jset the top three bits have been cleared. Two out of three bits tell us if the packet is part of a fragmented IP packet, and if so, if it is the last fragment. The third bit is reserved and must be zero. We don't want to check for non-integer or broken packets, so we check all three bits.

Line 6 is the most interesting in this listing. Expression ldxb 4*([14]&0xf) means we load into the register X four least significant bits of the fifteenth byte of the packet, multiplied by 4. The four least significant bits of the fifteenth byte is the field Internet Header Length IPv4 header, which stores the length of the header in words, so you need to multiply by 4 afterwards. Interestingly, the expression 4*([14]&0xf) is a designation for a special addressing scheme that can only be used in this form and only for register X, i.e. we can't say either ldb 4*([14]&0xf) nor ldxb 5*([14]&0xf) (we can only specify a different offset, for example, ldxb 4*([16]&0xf)). It is clear that this addressing scheme was added to BPF exactly in order to receive X (index case) IPv4 header length.

Thus, on line 7, we are trying to load half a word, at the address (X+16). Remembering that 14 bytes are occupied by the Ethernet header, and X contains the length of the IPv4 header, we understand that in A the TCP destination port is loaded:

       14           X           2             2
|ethernet header|ip header|source port|destination port|

Finally, on line 8, we compare the destination port with the value we are looking for, and on lines 9 or 10 we return the result - to copy the package or not.

tcpdump: download

In the previous examples, we specifically didn't go into detail on exactly how we load the BPF bytecode into the kernel for packet filtering. Generally speaking, tcpdump ported to many systems and to work with filters tcpdump uses the library libpcap. In short, to put a filter on an interface with libpcap, you need to do the following:

To see how the function pcap_setfilter implemented in Linux, we use strace (some lines have been removed):

$ sudo strace -f -e trace=%network tcpdump -p -i eth0 ip
socket(AF_PACKET, SOCK_RAW, 768)        = 3
bind(3, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("eth0"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0
setsockopt(3, SOL_SOCKET, SO_ATTACH_FILTER, {len=4, filter=0xb00bb00bb00b}, 16) = 0
...

On the first two lines of output, we create raw socket to read all Ethernet frames and bind it to the interface eth0. From our first example we know that the filter ip will consist of four BPF instructions, and on the third line we see how, using the option SO_ATTACH_FILTER system call setsockopt we load and attach a filter of length 4. This is our filter.

It is worth noting that in the classic BPF, loading and attaching a filter always occurs as an atomic operation, while in the new version of BPF, the loading of the program and its binding to the event generator are separated in time.

hidden truth

A slightly more complete version of the output looks like this:

$ sudo strace -f -e trace=%network tcpdump -p -i eth0 ip
socket(AF_PACKET, SOCK_RAW, 768)        = 3
bind(3, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("eth0"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0
setsockopt(3, SOL_SOCKET, SO_ATTACH_FILTER, {len=1, filter=0xbeefbeefbeef}, 16) = 0
recvfrom(3, 0x7ffcad394257, 1, MSG_TRUNC, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
setsockopt(3, SOL_SOCKET, SO_ATTACH_FILTER, {len=4, filter=0xb00bb00bb00b}, 16) = 0
...

As mentioned above, we load and connect our filter on the socket on line 5, but what happens on lines 3 and 4? It turns out that this libpcap takes care of us - so that the output of our filter does not include packets that do not satisfy it, the library connects dummy filter ret #0 (drop all packets), puts the socket in non-blocking mode and tries to subtract any packets that may have remained from past filters.

In total, to filter packets on Linux using the classic BPF, you need to have a filter in the form of a structure like struct sock_fprog and an open socket, after which the filter can be attached to the socket using a system call setsockopt.

Interestingly, the filter can be attached to any socket, not just raw. Here example program that strips all but the first two bytes of all incoming UDP datagrams. (I added comments in the code so as not to clutter up the article.)

More about usage setsockopt to connect filters, see socket(7), but about writing your own filters of the form struct sock_fprog without help tcpdump we'll talk in the section We program BPF with our own hands.

Classic BPF and XXI century

BPF was included with Linux in 1997 and has been a workhorse for a long time. libpcap without much change (Linux-specific changes, of course, were, but they did not change the global picture). The first serious signs that BPF will evolve appeared in 2011, when Eric Dumazet proposed patch, which adds Just In Time Compiler to the core - a translator for translating BPF bytecode into native x86_64 code.

JIT compiler was the first in the chain of changes: in 2012 appeared ability to write filters for KjΓΈp SpenningsfjΓ¦r Clutch Kit (XNUMX) Minarelli XNUMXmm Tp pΓ₯ Wheelerworks.nl! Scootere, mopeder, sykler, elsykkel ..., using BPF, in January 2013 was added module xt_bpf, allowing you to write rules for iptables using BPF, and in October 2013 was added also a module cls_bpf, which allows you to write traffic classifiers using BPF.

We will look at all these examples in more detail shortly, but first it will be useful for us to learn how to write and compile arbitrary programs for BPF, since the capabilities provided by the library libpcap limited (simple example: filter generated by libpcap can return only two values ​​- 0 or 0x40000) or generally, as in the case of seccomp, are not applicable.

We program BPF with our own hands

Let's get acquainted with the binary format of BPF instructions, it is very simple:

   16    8    8     32
| code | jt | jf |  k  |

Each instruction occupies 64 bits, in which the first 16 bits are the instruction code, then there are two eight-bit indents, jt ΠΈ jf, and 32 bits for the argument K, whose purpose varies from team to team. For example, the command ret, which terminates the program has the code 6, and the return value is taken from the constant K. In C, a single BPF instruction is represented as a structure

struct sock_filter {
        __u16   code;
        __u8    jt;
        __u8    jf;
        __u32   k;
}

and the whole program - in the form of a structure

struct sock_fprog {
        unsigned short len;
        struct sock_filter *filter;
}

Thus, we can already write programs (for example, we know the instruction codes from [1]). This is what the filter will look like ip6 of our first example:

struct sock_filter code[] = {
        { 0x28, 0, 0, 0x0000000c },
        { 0x15, 0, 1, 0x000086dd },
        { 0x06, 0, 0, 0x00040000 },
        { 0x06, 0, 0, 0x00000000 },
};
struct sock_fprog prog = {
        .len = ARRAY_SIZE(code),
        .filter = code,
};

The program prog we can legally use in the call

setsockopt(sk, SOL_SOCKET, SO_ATTACH_FILTER, &prog, sizeof(prog))

It is not very convenient to write programs in the form of machine codes, but sometimes it is necessary (for example, for debugging, creating unit tests, writing articles on HabrΓ©, etc.). For convenience, the file <linux/filter.h> helper macros are defined - the same example as above could be rewritten as

struct sock_filter code[] = {
        BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 12),
        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_IPV6, 0, 1),
        BPF_STMT(BPF_RET|BPF_K, 0x00040000),
        BPF_STMT(BPF_RET|BPF_K, 0),
}

However, this option is not very convenient either. So the Linux kernel programmers also reasoned, and therefore in the directory tools/bpf kernels, you can find an assembler and a debugger to work with the classic BPF.

Assembly language is very similar to debug output tcpdump, but in addition we can specify symbolic labels. For example, here is a program that drops all packets except TCP/IPv4:

$ cat /tmp/tcp-over-ipv4.bpf
ldh [12]
jne #0x800, drop
ldb [23]
jneq #6, drop
ret #-1
drop: ret #0

By default, the assembler generates code in the format <количСство инструкций>,<code1> <jt1> <jf1> <k1>,..., for our example with TCP it will turn out

$ tools/bpf/bpf_asm /tmp/tcp-over-ipv4.bpf
6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 6,6 0 0 4294967295,6 0 0 0,

For the convenience of C programmers, another output format can be used:

$ tools/bpf/bpf_asm -c /tmp/tcp-over-ipv4.bpf
{ 0x28,  0,  0, 0x0000000c },
{ 0x15,  0,  3, 0x00000800 },
{ 0x30,  0,  0, 0x00000017 },
{ 0x15,  0,  1, 0x00000006 },
{ 0x06,  0,  0, 0xffffffff },
{ 0x06,  0,  0, 0000000000 },

This text can be copied into the type structure definition struct sock_filter, as we did at the beginning of this section.

Linux extensions and netsniff-ng

In addition to standard BPF instructions, Linux and tools/bpf/bpf_asm support and custom set. Basically, statements are used to access the fields of a structure. struct sk_buff, which describes the network packet in the kernel. However, there are other types of helper instructions, such as ldw cpu load into register A result of running a kernel function raw_smp_processor_id(). (In the new version of BPF, these non-standard extensions have been extended to provide a set of kernel helpers for programs to access memory, structures, and generate events.) Here is an interesting example of a filter where we copy only packet headers to user space using the extension poff, payload offset:

ld poff
ret a

BPF extensions cannot be used in tcpdump, but this is a good reason to get acquainted with the utility package netsniff-ng, which, among other things, contains an advanced program netsniff-ng, which, in addition to filtering using BPF, also contains an efficient traffic generator, and more advanced than tools/bpf/bpf_asm, a BPF assembler called bpfc. The package contains fairly detailed documentation, see also the links at the end of the article.

KjΓΈp SpenningsfjΓ¦r Clutch Kit (XNUMX) Minarelli XNUMXmm Tp pΓ₯ Wheelerworks.nl! Scootere, mopeder, sykler, elsykkel ...

So, we already know how to write BPF programs of arbitrary complexity and are ready to look at new examples, the first of which is the seccomp technology, which allows using BPF filters to control the set and set of system call arguments available to a given process and its descendants.

The first version of seccomp was added to the kernel in 2005 and was not very popular, as it provided only one option - to limit the set of system calls available to the process to the following: read, write, exit ΠΈ sigreturn, and the process that violated the rules was killed with SIGKILL. However, in 2012, the ability to use BPF filters was added to seccomp, allowing you to determine the set of allowed system calls and even perform checks on their arguments. (Interestingly, Chrome was one of the first users of this functionality, and the KRSI mechanism is currently being developed by the people at Chrome, based on the new version of BPF and allowing customization of Linux Security Modules.) Links to additional documentation can be found at the end of the article.

Note that there have already been articles on HabrΓ© about using seccomp, maybe someone will want to read them before (or instead of) reading the following subsections. In the article Containers and security: seccomp examples of using seccomp, both the 2007 version and the version using BPF (filters are generated using libseccomp), talk about the relationship of seccomp with Docker, and also provide many useful links. In the article Isolate daemons with systemd or "you don't need Docker for this!" talks specifically about how to blacklist or whitelist syscalls for systemd daemons.

Next, we'll see how to write and load filters for seccomp on bare C and with the help of the library libseccomp and what are the pros and cons of each option, and finally let's see how seccomp is used by the program strace.

Writing and loading filters for seccomp

We already know how to write BPF programs, so let's look at the seccomp API first. You can set the filter at the process level, while all child processes will inherit the restrictions. This is done with a system call. seccomp(2):

seccomp(SECCOMP_SET_MODE_FILTER, flags, &filter)

where &filter is a pointer to a structure already familiar to us struct sock_fprog, i.e. BPF program.

What is the difference between seccomp programs and socket programs? The context being passed. In the case of sockets, we were given a memory area containing the packet, and in the case of seccomp, we are given a structure like

struct seccomp_data {
    int   nr;
    __u32 arch;
    __u64 instruction_pointer;
    __u64 args[6];
};

Here nr is the number of the system call to be started, arch - current architecture (more on that below), args - up to six system call arguments, and instruction_pointer is a pointer to the user-space instruction that made this system call. Thus, for example, to load the system call number into the register A we have to say

ldw [0]

There are other features for seccomp programs, for example, access to the context is possible only by 32-bit alignment and you cannot load half a word or a byte - when trying to load a filter ldh [0] system call seccomp will return EINVAL. Loaded filters are checked by the function seccomp_check_filter() kernels. (Funnily enough, the original commit that added the seccomp functionality forgot to add the permission to use the instruction to this function mod (remainder of division) and now it is not available for seccomp BPF programs, since adding it will break ABI.)

Basically, we already know everything to write and read seccomp programs. Usually the logic of the program is arranged as a white or black list of system calls, for example, the program

ld [0]
jeq #304, bad
jeq #176, bad
jeq #239, bad
jeq #279, bad
good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
bad: ret #0

checks the blacklist of four system calls numbered 304, 176, 239, 279. What are these system calls? We cannot say for sure, since we do not know for which architecture the program was written. Therefore, the authors of seccomp offer start all programs with an architecture check (the current architecture is indicated in the context as a field arch structures struct seccomp_data). With the architecture checked, the beginning of the example would look like:

ld [4]
jne #0xc000003e, bad_arch ; SCMP_ARCH_X86_64

and then our system call numbers would get certain values.

Writing and loading filters for seccomp with libseccomp

Writing filters in native code or in BPF assembler allows you to have full control over the result, but at the same time, it is sometimes preferable to have portable and / or readable code. The library will help us with this. libseccomp, which provides a standard interface for writing black or white filters.

Let's, for example, write a program that runs a binary file of the user's choice, having previously set up a blacklist of system calls from the above article (the program has been simplified for greater readability, the full version can be found here):

#include <seccomp.h>
#include <unistd.h>
#include <err.h>

static int sys_numbers[] = {
        __NR_mount,
        __NR_umount2,
       // ... Π΅Ρ‰Π΅ 40 систСмных Π²Ρ‹Π·ΠΎΠ²ΠΎΠ² ...
        __NR_vmsplice,
        __NR_perf_event_open,
};

int main(int argc, char **argv)
{
        scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);

        for (size_t i = 0; i < sizeof(sys_numbers)/sizeof(sys_numbers[0]); i++)
                seccomp_rule_add(ctx, SCMP_ACT_TRAP, sys_numbers[i], 0);

        seccomp_load(ctx);

        execvp(argv[1], &argv[1]);
        err(1, "execlp: %s", argv[1]);
}

First we define an array sys_numbers of 40+ system call numbers to block. Then, we initialize the context ctx and tell the library what we want to allow (SCMP_ACT_ALLOW) all system calls by default (building blacklists is easier). Then, one by one, we add all the blacklisted system calls. As a response to a system call from the list, we request SCMP_ACT_TRAP, in which case seccomp will send a signal to the process SIGSYS with a description of which particular system call violated the rules. Finally, we load the program into the kernel with seccomp_load, which will compile the program and attach it to the process using a system call seccomp(2).

For successful compilation, the program must be linked with the library libseccomp, For example:

cc -std=c17 -Wall -Wextra -c -o seccomp_lib.o seccomp_lib.c
cc -o seccomp_lib seccomp_lib.o -lseccomp

Successful launch example:

$ ./seccomp_lib echo ok
ok

An example of a blocked system call:

$ sudo ./seccomp_lib mount -t bpf bpf /tmp
Bad system call

Use stracefor details:

$ sudo strace -e seccomp ./seccomp_lib mount -t bpf bpf /tmp
seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=50, filter=0x55d8e78428e0}) = 0
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0xboobdeadbeef, si_syscall=__NR_mount, si_arch=AUDIT_ARCH_X86_64} ---
+++ killed by SIGSYS (core dumped) +++
Bad system call

how can we know that the program was terminated due to the use of a forbidden system call mount(2).

In total, we wrote a filter using the library libseccomp, fitting the non-trivial code into four lines. In the example above, if there are a large number of system calls, the execution time can noticeably decrease, since the check is just a list of comparisons. For optimization, libseccomp recently had patch included, which adds filter attribute support SCMP_FLTATR_CTL_OPTIMIZE. If you set this attribute to 2, then the filter will be converted to a binary search program.

If you want to see how binary search filters work, take a look at simple script, which generates such programs in BPF assembler by dialing system call numbers, for example:

$ echo 1 3 6 8 13 | ./generate_bin_search_bpf.py
ld [0]
jeq #6, bad
jgt #6, check8
jeq #1, bad
jeq #3, bad
ret #0x7fff0000
check8:
jeq #8, bad
jeq #13, bad
ret #0x7fff0000
bad: ret #0

Nothing much faster can be written, since BPF programs cannot perform indentation transitions (we cannot do, for example, jmp A or jmp [label+X]) and therefore all transitions are static.

seccomp and strace

Everyone knows the utility strace - an indispensable tool in the study of the behavior of processes on Linux. However, many have also heard about performance issues when using this utility. The fact is that strace implemented with the help ptrace(2), and in this mechanism we cannot specify on which set of system calls we need to stop the process, i.e., for example, commands

$ time strace du /usr/share/ >/dev/null 2>&1

real    0m3.081s
user    0m0.531s
sys     0m2.073s

ΠΈ

$ time strace -e open du /usr/share/ >/dev/null 2>&1

real    0m2.404s
user    0m0.193s
sys     0m1.800s

work out in about the same time, although in the second case we want to trace only one system call.

New option --seccomp-bpfadded to strace version 5.3, allows you to speed up the process many times and the launch time under the trace of one system call is already comparable to the time of a normal launch:

$ time strace --seccomp-bpf -e open du /usr/share/ >/dev/null 2>&1

real    0m0.148s
user    0m0.017s
sys     0m0.131s

$ time du /usr/share/ >/dev/null 2>&1

real    0m0.140s
user    0m0.024s
sys     0m0.116s

(Here, of course, there is a slight deception in that we are tracing a non-main system call of this command. If we were tracing, for example, newfsstatthen strace would slow down as much as without --seccomp-bpf.)

How does this option work? Without her strace attaches to a process and starts it with PTRACE_SYSCALL. When the controlled process launches (any) system call, control is transferred strace, which looks at the arguments to the system call and invokes it with PTRACE_SYSCALL. After some time, the process completes the system call, and when it exits, control is transferred again. stracewhich looks at the return values ​​and starts the process with PTRACE_SYSCALL, and so on.

BPF for the little ones, part zero: classic BPF

With seccomp, however, this process can be streamlined exactly as we would like. Namely, if we want to look only at the system call X, then we can write a BPF filter that for X returns the value SECCOMP_RET_TRACE, and for calls not of interest to us - SECCOMP_RET_ALLOW:

ld [0]
jneq #X, ignore
trace: ret #0x7ff00000
ignore: ret #0x7fff0000

In this case strace initially starts the process as PTRACE_CONT, for each system call, our filter is processed, if the system call is not X, then the process continues, but if it X, then seccomp will transfer control stracewhich will look at the arguments and start the process as PTRACE_SYSCALL (because seccomp has no way to run a program on exit from a system call). When the system call returns, strace restart the process with PTRACE_CONT and will wait for new messages from seccomp.

BPF for the little ones, part zero: classic BPF

When using the option --seccomp-bpf there are two restrictions. Firstly, it will not be possible to attach to an already existing process (option -p Action strace) as it is not supported by seccomp. Secondly, there is no possibility not look at child processes, since seccomp filters are inherited by all child processes with no way to disable this.

A little more detail on how strace works with seccomp can learn from recent report. For us, the most interesting fact is that the classic BPF represented by seccomp still finds applications.

xt_bpf

Let's go back to the world of networks.

Background: a long time ago, in 2007, the core was added module xt_u32 for netfilter. It was written by analogy with an even more ancient traffic classifier cls_u32 and allowed you to write arbitrary binary rules for iptables using the following simple operations: load 32 bits from the package and do a set of arithmetic operations with them. For example,

sudo iptables -A INPUT -m u32 --u32 "6&0xFF=1" -j LOG --log-prefix "seen-by-xt_u32"

Loads the 32 bits of the IP header, starting at indent 6, and applies a mask to them 0xFF (take the low byte). This field protocol IP header and we compare it with 1 (ICMP). In one rule, you can combine many checks, and you can also execute the statement @ β€” move X bytes to the right. For example, the rule

iptables -m u32 --u32 "6&0xFF=0x6 && 0>>22&0x3C@4=0x29"

checks if TCP Sequence Number is not equal 0x29. I will not go into details further, since it is already clear that it is not very convenient to write such rules by hand. In the article BPF - the forgotten bytecode, there are several links with examples of using and generating rules for xt_u32. See also links at the end of this article.

Starting from 2013 module instead of module xt_u32 you can use BPF based module xt_bpf. Anyone who has read up to here should already be clear on how it works: run BPF bytecode as iptables rules. You can create a new rule, for example, like this:

iptables -A INPUT -m bpf --bytecode <Π±Π°ΠΉΡ‚ΠΊΠΎΠ΄> -j LOG

here <Π±Π°ΠΉΡ‚ΠΊΠΎΠ΄> is the code in assembler output format bpf_asm by default, for example,

$ cat /tmp/test.bpf
ldb [9]
jneq #17, ignore
ret #1
ignore: ret #0

$ bpf_asm /tmp/test.bpf
4,48 0 0 9,21 0 1 17,6 0 0 1,6 0 0 0,

# iptables -A INPUT -m bpf --bytecode "$(bpf_asm /tmp/test.bpf)" -j LOG

In this example, we are filtering all UDP packets. Context for a BPF program in a module xt_bpf, of course, points to the data of the packet, in the case of iptables, to the beginning of the IPv4 header. Return value from BPF program booleanWhere false means that the package did not match.

It is clear that the module xt_bpf supports more complex filters than the example above. Let's look at real examples from Cloudfare. Until recently they used the module xt_bpf to protect against DDoS attacks. In the article Introducing the BPF Tools they tell how (and why) they generate BPF filters and publish links to a set of utilities for creating such filters. For example, using the utility bpfgen you can create a BPF program that matches a DNS query for a name habr.com:

$ ./bpfgen --assembly dns -- habr.com
ldx 4*([0]&0xf)
ld #20
add x
tax

lb_0:
    ld [x + 0]
    jneq #0x04686162, lb_1
    ld [x + 4]
    jneq #0x7203636f, lb_1
    ldh [x + 8]
    jneq #0x6d00, lb_1
    ret #65535

lb_1:
    ret #0

In the program, we first load into the register X address of the beginning of the line x04habrx03comx00 inside the UDP datagram and then check the request: 0x04686162 <-> "x04hab" etc.

A little later, Cloudfare published the p0f -> BPF compiler code. In the article Introducing the p0f BPF compiler they talk about what p0f is and how to convert p0f signatures to BPF:

$ ./bpfgen p0f -- 4:64:0:0:*,0::ack+:0
39,0 0 0 0,48 0 0 8,37 35 0 64,37 0 34 29,48 0 0 0,
84 0 0 15,21 0 31 5,48 0 0 9,21 0 29 6,40 0 0 6,
...

Currently, Cloudfare no longer uses xt_bpf, since they moved to XDP - one of the options for using the new version of BPF, see below. L4Drop: XDP DDoS Mitigations.

cls_bpf

The last example of using the classic BPF in the core is the classifier cls_bpf for the traffic control subsystem in Linux, added to Linux at the end of 2013 and conceptually replacing the ancient cls_u32.

However, we will not now describe the work cls_bpf, since from the point of view of knowledge about the classic BPF, this will not give us anything - we have already become acquainted with all the functionality. In addition, in subsequent articles talking about Extended BPF, we will meet with this classifier more than once.

Another reason not to talk about using classic BPF c cls_bpf is that compared to Extended BPF, in this case, the scope of applicability is drastically narrowed: classical programs cannot change the contents of packages and cannot save state between calls.

So it's time to say goodbye to classic BPF and look into the future.

Farewell to classic BPF

We looked at how BPF technology, developed in the early nineties, successfully lived a quarter of a century and found new applications until the end. However, just like the transition from stack machines to RISC, which served as an impetus for the development of the classic BPF, in the 32s there was a transition from 64-bit to XNUMX-bit machines and the classic BPF became obsolete. In addition, the capabilities of the classic BPF are very limited, and in addition to the outdated architecture - we do not have the ability to save state between calls to BPF programs, there is no possibility of direct interaction with the user, there is no possibility of interaction with the kernel, except for reading a limited number of structure fields sk_buff and launching the simplest helper functions, you cannot change the contents of the packages and redirect them.

In fact, at present, only the API interface remains from the classic BPF in Linux, and inside the kernel all classical programs, whether socket filters or seccomp filters, are automatically translated into the new format, Extended BPF. (We will explain exactly how this happens in the next article.)

The transition to the new architecture began in 2013, when Alexei Starovoitov proposed a BPF upgrade scheme. In 2014 the corresponding patches began to appear in the core. As far as I understand, it was originally planned to only optimize the architecture and the JIT-compiler to work more efficiently on 64-bit machines, but instead, these optimizations marked the beginning of a new chapter in Linux development.

Later articles in this series will cover the architecture and applications of the new technology, originally known as internal BPF, then extended BPF, and now just BPF.

references

  1. Steven McCanne and Van Jacobson, "The BSD Packet Filter: A New Architecture for User-level Packet Capture", https://www.tcpdump.org/papers/bpf-usenix93.pdf
  2. Steven McCanne, "libpcap: An Architecture and Optimization Methodology for Packet Capture", https://sharkfestus.wireshark.org/sharkfest.11/presentations/McCanne-Sharkfest'11_Keynote_Address.pdf
  3. tcpdump, libpcap: https://www.tcpdump.org/
  4. IPtable U32 Match Tutorial.
  5. BPF - the forgotten bytecode: https://blog.cloudflare.com/bpf-the-forgotten-bytecode/
  6. Introducing the BPF Tool: https://blog.cloudflare.com/introducing-the-bpf-tools/
  7. bpf_cls: http://man7.org/linux/man-pages/man8/tc-bpf.8.html
  8. A seccomp overview: https://lwn.net/Articles/656307/
  9. https://github.com/torvalds/linux/blob/master/Documentation/userspace-api/seccomp_filter.rst
  10. habr: Containers and security: seccomp
  11. habr: Isolate daemons with systemd or "you don't need Docker for this!"
  12. Paul Chaignon, "strace --seccomp-bpf: a look under the hood", https://fosdem.org/2020/schedule/event/debugging_strace_bpf/
  13. netsniff-ng: http://netsniff-ng.org/

Source: habr.com

Add a comment