Ose Folasaga Puupuu ile BPF ma le eBPF

Talofa, Habr! Matou te fia taʻu atu ia te oe o loʻo matou saunia se tusi mo le tatalaina."Linux Observability ma le BPF".

Ose Folasaga Puupuu ile BPF ma le eBPF
Talu ai o loʻo faʻaauau pea ona faʻaleleia le masini komepiuta BPF ma o loʻo faʻaaogaina i le faʻatinoga, ua matou faʻaliliuina mo oe se tala e faʻamatala ai ona gafatia autu ma le tulaga o loʻo iai nei.

I tausaga talu ai nei, ua faʻatuputeleina ai le lauiloa o meafaigaluega ma faiga faʻapipiʻi e totogi ai tapulaʻa o le fatu Linux i mataupu e manaʻomia ai le gaosiga o pusa maualuga. O se tasi o metotia sili ona lauiloa o lenei ituaiga e taʻua fa'amama ole fatu (kernel bypass) ma fa'ataga, fa'aaloa'i le vaega o feso'ota'iga fatu, e fa'atino uma le fa'agaioiina o pepa mai avanoa fa'aoga. O le pasia o le fatu e aofia ai foʻi le pulea o le network card mai avanoa fa'aoga. I se isi faaupuga, pe a galulue ma se kata fesoʻotaʻiga, matou te faʻalagolago i le avetaavale avanoa fa'aoga.

E ala i le tuʻuina atu o le pulea atoatoa o le kata fesoʻotaʻiga i se polokalame faʻaoga-avanoa, matou te faʻaitiitia ai le kernel i luga o le ulu (suiga faʻamatalaga, faʻaogaina o fesoʻotaʻiga, faʻalavelave, ma isi), e taua tele pe a tamoʻe i le saoasaoa o le 10Gb / s pe sili atu. Kernel bypass fa'atasi ai ma se tu'ufa'atasiga o isi vaega (fa'asologa o vaega) ma le fa'alogo lelei o fa'atinoga (NUMA accounting, CPU fa'aesea, ma isi) e fetaui ma faʻavae o le faʻaogaina o fesoʻotaʻiga maualuga i luga ole avanoa faʻaoga. Masalo o se faʻataʻitaʻiga faʻataʻitaʻiga o lenei auala fou i le faʻaogaina o pusa DPDK mai le Intel (Pusa Atina'e Va'alele Fa'amatalaga), e ui lava o loʻo i ai isi meafaigaluega lauiloa ma metotia, e aofia ai Cisco's VPP (Vector Packet Processing), Netmap ma, ioe, snab.

O le faʻatulagaina o fesoʻotaʻiga fesoʻotaʻiga i avanoa faʻaoga e iai le tele o faʻaletonu:

  • Ole fatu ole OS ose vaega fa'a'ese'ese mo punaoa meafaigaluega. Talu ai e tatau ona faʻatautaia saʻo e polokalame avanoa avanoa tagata, e tatau foi ona latou pulea a latou lava meafaigaluega. O lona uiga e masani ona fa'apolokalame au lava avetaavale.
  • Talu ai o loʻo matou tuʻuina atoa le avanoa o le kernel, o loʻo matou tuʻuina atu foʻi galuega faʻaogaina uma e saunia e le fatu. E tatau i polokalame fa'aoga-avanoa ona toe fa'atino galuega e mafai ona tu'uina atu e le fatu po'o le faiga fa'aoga.
  • Polokalama o loʻo faʻaogaina i le sandbox mode, lea e matua faʻatapulaʻaina ai a latou fegalegaleaiga ma taofia ai i latou mai le tuʻufaʻatasia ma isi vaega o le faiga faʻaogaina.

O lona uiga, pe a feso'ota'i feso'ota'iga i le avanoa e fa'aoga ai, e maua fa'amanuiaga fa'atinoga e ala i le fa'agaoioiga o pa'u mai le fatu i le avanoa fa'aoga. O le XDP e fa'afeagai tonu lava: na te fa'aosoina polokalame feso'ota'iga mai le avanoa e fa'aoga ai (filifili, fa'ai'uga, ta'avale, ma isi) i totonu o le kernel space. O le XDP e mafai ai ona matou faia se galuega fesoʻotaʻiga i le taimi lava e maua ai e se afifi se fesoʻotaʻiga fesoʻotaʻiga ma aʻo leʻi amata ona alu i luga i totonu o le kernel network subsystem. O se taunuuga, o le saoasaoa o le gaosiga o pusa e matua faateleina. Ae peitaʻi, faʻafefea ona faʻatagaina e le fatu le tagata faʻaoga e faʻatino a latou polokalame i le kernel space? A'o le'i taliina le fesili lea, se'i o tatou va'ai po'o le a le BPF.

BPF ma eBPF

E ui lava i le igoa fenumiai, BPF (Berkeley Packet Filtering), o le mea moni, o se faʻataʻitaʻiga masini masini. O lenei masini masini na muai fuafuaina e taulimaina le faʻamamaina o pepa, o le mea lea o le igoa.

O se tasi o meafaigaluega sili ona lauiloa e faʻaaoga ai le BPF tcpdump. Pe a pu'eina pepa fa'aaoga tcpdump e mafai e le tagata fa'aoga ona fa'amaonia se fa'aaliga e fa'amama ai afifi. Na'o afifi e fetaui ma lenei fa'aaliga o le a pu'eina. Mo se faʻataʻitaʻiga, o le faaupuga "tcp dst port 80” e faasino i pepa TCP uma e taunuʻu i luga o le taulaga 80. E mafai e le tagata faʻapipiʻi ona faapuupuuina lenei faʻamatalaga e ala i le faʻaliliuina i le BPF bytecode.

$ sudo tcpdump -d "tcp dst port 80"
(000) ldh [12] (001) jeq #0x86dd jt 2 jf 6
(002) ldb [20] (003) jeq #0x6 jt 4 jf 15
(004) ldh [56] (005) jeq #0x50 jt 14 jf 15
(006) jeq #0x800 jt 7 jf 15
(007) ldb [23] (008) jeq #0x6 jt 9 jf 15
(009) ldh [20] (010) jset #0x1fff jt 15 jf 11
(011) ldxb 4*([14]&0xf)
(012) ldh [x + 16] (013) jeq #0x50 jt 14 jf 15
(014) ret #262144
(015) ret #0

O le mea lea e fai e le polokalame o loʻo i luga:

  • Faatonuga (000): Tu'u le afifi ile offset 12, o se upu 16-bit, i totonu ole accumulator. Offset 12 e fetaui ma le ethertype o le afifi.
  • Faatonuga (001): faʻatusatusa le tau i le accumulator ma le 0x86dd, o lona uiga, faʻatasi ma le ethertype tau mo IPv6. Afai e sa'o le fa'ai'uga, ona alu lea o le fata o polokalame i le fa'atonuga (002), ae a leai, ona alu lea i le (006).
  • Faatonuga (006): faatusatusa le tau i le 0x800 (ethertype tau mo IPv4). Afai e moni le tali, ona alu lea o le polokalame i le (007), a leai, ona alu lea i le (015).

Ma fa'asolo atu se'ia toe fa'afo'i mai e le polokalame o le fa'amama pepa se fa'ai'uga. E masani lava o le Boolean. O le toe fa'afo'iina o se tau e le-zero (fa'atonuga (014)) o lona uiga ua talia le pepa, ma le toe fa'afo'iina o se tau leai (015)) o lona uiga e le'i taliaina le pepa.

O le masini komepiuta a le BPF ma lona bytecode na fuafuaina e Steve McCann ma Van Jacobson i le faaiuga o le 1992 ina ua lomia a latou pepa. BSD Packet Filter: Fa'ata'ita'iga Fou mo Tagata Fa'aaoga-Tulaga Pu'ega Pu'ega, o lenei tekinolosi na muamua tuʻuina atu i le Usenix conference i le taumalulu o le 1993.

Talu ai o le BPF o se masini komepiuta, e faʻamalamalamaina le siosiomaga o loʻo taʻavale ai polokalame. I le faaopoopo atu i le bytecode, o loʻo faʻamatalaina ai foʻi le faʻataʻitaʻiga o manatuaga (o faʻatonuga o uta o loʻo faʻaogaina tonu i le vaega), tusi resitala (A ma X; tusi resitala ma faʻamaufaʻailoga), teuina o manatuaga, ma se faʻailoga polokalame faʻapitoa. O le mea e malie ai, o le BPF bytecode na faʻataʻitaʻiina i le Motorola 6502 ISA. E pei ona manatua e Steve McCann i lana lipoti a le fono tele i Sharkfest '11, sa masani o ia i le fausiaina o le 6502 mai ana polokalame o aso aoga maualuga i luga o le Apple II, ma o lenei malamalama na aafia ai lana galuega i le mamanuina o le BPF bytecode.

O le lagolago a le BPF o loʻo faʻatinoina i le Linux kernel i versions v2.5 ma maualuga atu, faʻaopoopoina faʻapitoa e taumafaiga a Jay Schullist. O le BPF code na tumau pea e le suia seia oʻo i le 2011, ina ua toe faʻatulagaina e Eric Dumaset le faʻaliliuga BPF e tamoʻe i le JIT mode (Source: JIT mo fa'amama taga). A maeʻa lenei mea, o le fatu, nai lo le faʻamatalaina o le BPF bytecode, e mafai ona faʻaliliu saʻo polokalame BPF i le fausaga faʻatulagaina: x86, ARM, MIPS, ma isi.

Mulimuli ane, i le 2014, na tuʻuina atu ai e Alexey Starovoitov se faiga fou JIT mo BPF. O le mea moni, o lenei JIT fou na avea ma fausaga fou a le BPF ma sa taʻua o le eBPF. Ou te manatu o VM uma e lua o loʻo ola faatasi mo sina taimi, ae o le taimi nei o le faʻamamaina o pusa o loʻo faʻatinoina e faʻavae i luga o le eBPF. O le mea moni, i le tele o faʻataʻitaʻiga o faʻamaumauga faʻaonaponei, o le BPF ua malamalama o le eBPF, ma o le BPF masani e taʻua i aso nei o le cBPF.

eBPF faʻalauteleina le masini komepiuta masani BPF i le tele o auala:

  • Fa'avae ile fausaga fa'aonaponei 64-bit. E fa'aaoga e le eBPF tusi resitala 64-bit ma fa'aopoopo le numera o tusi resitala avanoa mai le 2 (accumulator ma X) i le 10. e tu'uina atu fo'i e le eBPF isi opcodes (BPF_MOV, BPF_JNE, BPF_CALL...).
  • Tu'u'ese mai le so'oga fa'apalapala feso'ota'iga. O le BPF na nonoa i le faʻataʻitaʻiga faʻamaumauga o faʻamaumauga. Talu ai sa fa'aaogaina mo le fa'amama o pepa, o lona fa'ailoga sa tu i totonu o le subsystem e maua ai feso'ota'iga feso'otaiga. Ae ui i lea, o le eBPF masini masini e le o toe nonoa i le faʻataʻitaʻiga faʻamaumauga ma e mafai ona faʻaaogaina mo soʻo se faʻamoemoe. O lea la, o lea e mafai ona fesoʻotaʻi le polokalame eBPF i tracepoint poʻo kprobe. Ole mea lea e tatala ai le ala ile eBPF meafaifaʻaili, suʻesuʻega faʻatinoga, ma le tele o isi faʻaoga faʻaoga i le tulaga o isi kernel subsystems. O lea la o le eBPF code o loʻo tu i lona lava ala: kernel / bpf.
  • Faleoloa o fa'amaumauga a le lalolagi e ta'ua o Fa'afanua. O fa'afanua o fa'atauga taua e mafai ai ona fesuia'i fa'amatalaga i le va o le avanoa fa'aoga ma le kernel space. eBPF e maua ai le tele o ituaiga fa'afanua.
  • Galuega lua. Aemaise lava, e toe tusi se afifi, fuafua se siaki tupe, poʻo le faʻapipiʻiina o se afifi. O nei galuega e tamomoe i totonu o le fatu ma e le o ni polokalame avanoa mo tagata. E mafai fo'i ona e faia telefoni mai polokalame eBPF.
  • Fa'ai'u telefoni. Ole tele ole polokalame ile eBPF e gata ile 4096 bytes. O le fa'aogaina o le si'usi'u e mafai ai e se polokalame eBPF ona tu'uina atu le fa'atonuga i se polokalame fou eBPF ma fa'afefea ai lenei tapula'a (e o'o atu i le 32 polokalame e mafai ona feso'ota'i i lenei auala).

eBPF: fa'ata'ita'iga

E tele faʻataʻitaʻiga mo le eBPF i le Linux kernel puna. O lo'o maua i fa'ata'ita'iga/bpf/. Ina ia tuufaatasia nei faʻataʻitaʻiga, naʻo le ulufale:

$ sudo make samples/bpf/

O le a ou le tusia se faʻataʻitaʻiga fou mo le eBPF aʻu lava, ae o le a faʻaaogaina se tasi o faʻataʻitaʻiga o loʻo maua i samples/bpf/. O le a ou vaʻavaʻai i nisi o vaega o le code ma faʻamatalaina pe faʻapefea ona galue. Mo se faʻataʻitaʻiga, na ou filifilia le polokalama tracex4.

I se tulaga lautele, o faʻataʻitaʻiga taʻitasi i faʻataʻitaʻiga / bpf / e aofia ai faila e lua. I lenei tulaga:

  • tracex4_kern.c, o loʻo i ai le code source e faʻatino i totonu o le fatu e pei o le eBPF bytecode.
  • tracex4_user.c, o lo'o i ai se polokalame mai avanoa fa'aoga.

I lenei tulaga, e tatau ona tatou tuufaatasia tracex4_kern.c i le eBPF bytecode. I le taimi nei i gcc e leai se pito i tua mo le eBPF. O le mea e lelei ai, clang e mafai ona gaosia eBPF bytecode. Makefile faʻaaoga clang mo le tuufaatasia tracex4_kern.c i le faila mea.

Na ou taʻua i luga o se tasi o mea sili ona manaia o le eBPF o faʻafanua. tracex4_kern fa'amatalaina le fa'afanua tasi:

struct pair {
    u64 val;
    u64 ip;
};  

struct bpf_map_def SEC("maps") my_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(long),
    .value_size = sizeof(struct pair),
    .max_entries = 1000000,
};

BPF_MAP_TYPE_HASH o se tasi o le tele o ituaiga o kata e ofoina mai e le eBPF. I le tulaga lea, ua na'o se hash. Atonu na e matauina foi se faasalalauga SEC("maps"). SEC o se macro e faʻaaogaina e fatu ai se vaega fou o se faila faila. O le mea moni, i le faʻataʻitaʻiga tracex4_kern e lua isi vaega ua fa'amatalaina:

SEC("kprobe/kmem_cache_free")
int bpf_prog1(struct pt_regs *ctx)
{   
    long ptr = PT_REGS_PARM2(ctx);

    bpf_map_delete_elem(&my_map, &ptr); 
    return 0;
}
    
SEC("kretprobe/kmem_cache_alloc_node") 
int bpf_prog2(struct pt_regs *ctx)
{
    long ptr = PT_REGS_RC(ctx);
    long ip = 0;

    // получаем ip-адрес вызывающей стороны kmem_cache_alloc_node() 
    BPF_KRETPROBE_READ_RET_IP(ip, ctx);

    struct pair v = {
        .val = bpf_ktime_get_ns(),
        .ip = ip,
    };
    
    bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
    return 0;
}   

O nei galuega e lua e mafai ai ona e tapeina se mea mai le faafanua (kprobe/kmem_cache_free) ma fa'aopoopo se fa'amatalaga fou i le fa'afanua (kretprobe/kmem_cache_alloc_node). O igoa galuega uma o lo'o tusia i mataitusi tetele e fetaui ma macros ua fa'amatalaina i totonu bpf_helpers.h.

Afai ou te lafoaia vaega o le faila faila, e tatau ona ou iloa o nei vaega fou ua uma ona faʻamalamalamaina:

$ objdump -h tracex4_kern.o

tracex4_kern.o: file format elf64-little

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000000 0000000000000000 0000000000000000 00000040 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 kprobe/kmem_cache_free 00000048 0000000000000000 0000000000000000 00000040 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
2 kretprobe/kmem_cache_alloc_node 000000c0 0000000000000000 0000000000000000 00000088 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3 maps 0000001c 0000000000000000 0000000000000000 00000148 2**2
CONTENTS, ALLOC, LOAD, DATA
4 license 00000004 0000000000000000 0000000000000000 00000164 2**0
CONTENTS, ALLOC, LOAD, DATA
5 version 00000004 0000000000000000 0000000000000000 00000168 2**2
CONTENTS, ALLOC, LOAD, DATA
6 .eh_frame 00000050 0000000000000000 0000000000000000 00000170 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

E iai foʻi tracex4_user.c, polokalame autu. O le mea moni, o lenei polokalame e faʻalogo i mea tutupu kmem_cache_alloc_node. A tupu se mea faapena, o le eBPF code tutusa e faʻatinoina. E fa'asaoina e le code le uiga IP o le mea i totonu o se fa'afanua, ona fa'asolosolo lea o le mea i le polokalame autu. Faataitaiga:

$ sudo ./tracex4
obj 0xffff8d6430f60a00 is 2sec old was allocated at ip ffffffff9891ad90
obj 0xffff8d6062ca5e00 is 23sec old was allocated at ip ffffffff98090e8f
obj 0xffff8d5f80161780 is 6sec old was allocated at ip ffffffff98090e8f

E fa'afefea ona feso'ota'i se polokalame avanoa fa'aoga ma se polokalame eBPF? I le amataga tracex4_user.c utaina se faila mea tracex4_kern.o fa'aaogaina le galuega load_bpf_file.

int main(int ac, char **argv)
{
    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
    char filename[256];
    int i;

    snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

    if (setrlimit(RLIMIT_MEMLOCK, &r)) {
        perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
        return 1;
    }

    if (load_bpf_file(filename)) {
        printf("%s", bpf_log_buf);
        return 1;
    }

    for (i = 0; ; i++) {
        print_old_objects(map_fd[1]);
        sleep(1);
    }

    return 0;
}

E ala i le faia load_bpf_file su'esu'e fa'amatalaina i le faila eBPF ua fa'aopoopo i /sys/kernel/debug/tracing/kprobe_events. O lea matou te fa'alogo mo nei mea na tutupu ma e mafai e la matou polokalama ona fai se mea pe a tupu.

$ sudo cat /sys/kernel/debug/tracing/kprobe_events
p:kprobes/kmem_cache_free kmem_cache_free
r:kprobes/kmem_cache_alloc_node kmem_cache_alloc_node

O isi polokalame uma i fa'ata'ita'iga/bpf/ o lo'o fa'atulagaina tutusa. O lo'o iai pea faila e lua:

  • XXX_kern.c: polokalame eBPF.
  • XXX_user.c: polokalame autu.

O le polokalame eBPF e iloa ai fa'afanua ma galuega e feso'ota'i ma se vaega. Pe a tuʻuina atu e le fatu se mea na tupu o se ituaiga faapitoa (mo se faʻataʻitaʻiga, tracepoint), ua fa'atinoina galuega fa'amau. O kata e maua ai fesoʻotaʻiga i le va o le kernel program ma le tagata faʻaoga avanoa avanoa.

iʻuga

O lenei tusiga na talanoaina le BPF ma le eBPF i tulaga lautele. Ou te iloa o loʻo i ai le tele o faʻamatalaga ma punaoa e uiga i le eBPF i aso nei, o lea o le a ou fautuaina ai ni nai punaoa mo nisi suʻesuʻega.

Ou te fautuaina le faitau:

puna: www.habr.com

Faaopoopo i ai se faamatalaga