Sumo Pfupi yeBPF uye eBPF

Mhoro, Habr! Tinoda kukuzivisai kuti tiri kugadzirira bhuku rekuti ribude."Linux Kucherechedzwa neBPF".

Sumo Pfupi yeBPF uye eBPF
Sezvo iyo BPF chaiyo muchina ichiramba ichishanduka uye ichishandiswa nesimba mukuita, isu takashandurira iwe chinyorwa chinotsanangura kugona kwayo kukuru uye mamiriro azvino.

Mumakore achangopfuura, maturusi ezvirongwa uye matekiniki zvave kuwedzera kufarirwa kubhadharira zvipimo zveLinux kernel mune zviitiko apo yakakwirira-inoshanda packet process inodiwa. Imwe yenzira dzakakurumbira dzemhando iyi inonzi kernel bypass (kernel bypass) uye inobvumira, ichipfuura kernel network layer, kuita ese epacket kugadzirisa kubva munzvimbo yemushandisi. Kupfuura kernel kunosanganisirawo kudzora network kadhi kubva nzvimbo yemushandisi. Mune mamwe mazwi, pakushanda nekadhi yetiweki, tinovimba nemutyairi nzvimbo yemushandisi.

Nokuendesa kutonga kwakazara kwekadhi yetiweki kune purogiramu yemushandisi-nzvimbo, tinoderedza kernel pamusoro (context switching, network layer processing, interrupts, etc.), iyo inonyanya kukosha kana uchimhanya nekumhanya kwe10Gb / s kana kupfuura. Kernel bypass pamwe nemusanganiswa wezvimwe zvinhu (batch processing) uye kunyatsoita tuning (NUMA accounting, CPU kuparadzaniswa, zvichingodaro) zvinoenderana nezvakakosha zve-high-performance network processing munzvimbo yemushandisi. Zvichida muenzaniso wemuenzaniso weiyi nzira nyowani yekugadziriswa kwepaketi ndeye DPDK kubva kuIntel (Data Plane Development Kit), kunyangwe paine mamwe maturusi uye matekiniki anozivikanwa, anosanganisira Cisco's VPP (Vector Packet Processing), Netmap uye, hongu, snab.

Kuronga kupindirana kwetiweki munzvimbo yevashandisi kune akati wandei akashata:

  • Iyo OS kernel ndeye abstraction layer ye Hardware zviwanikwa. Nekuti mushandisi nzvimbo zvirongwa zvinofanirwa kubata zviwanikwa zvavo zvakanangana, ivo vanofanirwawo kubata yavo hardware. Izvi zvinowanzoreva kuve nekuronga vatyairi vako.
  • Nekuti isu tiri kusiya kernel nzvimbo zvachose, isu tiri kusiyawo ese enetwork mashandiro akapihwa nekernel. Zvirongwa zvemushandisi zvenzvimbo zvinofanirwa kusimudzira maficha angave atopihwa nekernel kana sisitimu yekushandisa.
  • Zvirongwa zvinoshanda mu sandbox modhi, iyo inodzika zvakanyanya kupindirana kwavo uye inovadzivirira kubva mukubatanidzwa nezvimwe zvikamu zveiyo inoshanda sisitimu.

Muchidimbu, kana network ikaitika munzvimbo yevashandisi, kubudirira kwekuita kunowanikwa nekufambisa pakiti kugadzirisa kubva kukernel kuenda kunzvimbo yemushandisi. XDP inoita zvakapesana chaizvo: inofambisa zvirongwa zvetiweki kubva munzvimbo yemushandisi (mafirita, zvinogadzirisa, nzira, nezvimwewo) kuenda kunzvimbo yekernel. XDP inotitendera kuti tiite network basa nekukurumidza kana pakiti yarova network interface uye isati yatanga kukwira kumusoro kune kernel network subsystem. Somugumisiro, iyo packet processing speed inowedzera zvakanyanya. Nekudaro, iyo kernel inobvumira sei mushandisi kuita zvirongwa zvavo munzvimbo yekernel? Tisati tapindura mubvunzo uyu, ngatitarisei kuti BPF chii.

BPF uye eBPF

Pasinei nezita rinovhiringidza, BPF (Berkeley Packet Filtering) iri, chaizvoizvo, muenzaniso wemashini. Uyu muchina chaiwo wakagadzirirwa kubata kusefa kwepaketi, ndosaka zita.

Imwe yezvishandiso zvakakurumbira kushandisa BPF ndeye tcpdump. Pakutora mapaketi uchishandisa tcpdump mushandisi anogona kudoma chirevo chekusefa mapaketi. Mapaketi chete anoenderana nechirevo ichi ndiwo achatorwa. Somuenzaniso, izwi rokuti "tcp dst port 80” inoreva mapaketi ese eTCP anosvika pachiteshi 80. Muunganidzi anogona kupfupisa kutaura uku nekuchishandura kuBPF bytecode.

$ sudo tcpdump -d "tcp dst port 80"
(000) ldh [12] (001) jeq #0x86dd jt 2 jf 6
(002) ldb [20] (003) jeq #0x6 jt 4 jf 15
(004) ldh [56] (005) jeq #0x50 jt 14 jf 15
(006) jeq #0x800 jt 7 jf 15
(007) ldb [23] (008) jeq #0x6 jt 9 jf 15
(009) ldh [20] (010) jset #0x1fff jt 15 jf 11
(011) ldxb 4*([14]&0xf)
(012) ldh [x + 16] (013) jeq #0x50 jt 14 jf 15
(014) ret #262144
(015) ret #0

Izvi ndizvo zvinoitwa nechirongwa chiri pamusoro apa:

  • Instruction (000): Inotakura packet pa offset 12, sezwi 16-bit, mu accumulator. Offset 12 inoenderana ne ethertype yepakiti.
  • Instruction (001): inoenzanisa kukosha mu accumulator ne 0x86dd, kureva, ne ethertype kukosha kwe IPv6. Kana mhedzisiro iri yechokwadi, saka iyo counter yepurogiramu inoenda kune rairo (002), uye kana zvisiri, ipapo ku (006).
  • Instruction (006): inoenzanisa kukosha ne0x800 (ethertype kukosha kweIPv4). Kana mhinduro iri yechokwadi, ipapo chirongwa chinoenda ku (007), kana zvisiri, ipapo ku (015).

Uye zvichingodaro kusvikira purogiramu yekusefa yepakiti inodzorera chigumisiro. Izvi zvinowanzova Boolean. Kudzosa kukosha kusiri zero (kuraira (014)) kunoreva kuti pakiti yakagamuchirwa, uye kudzorera zero kukosha (murayiridzo (015)) zvinoreva kuti packet haina kugamuchirwa.

Iyo BPF virtual muchina uye bytecode yayo yakakurudzirwa naSteve McCann naVan Jacobson mukupera kwa1992 pakadhindwa bepa ravo. BSD Packet Sefa: New Architecture yeMushandisi-Level Packet Capture, tekinoroji iyi yakatanga kuratidzwa pamusangano weUsenix munguva yechando ye1993.

Nekuti BPF muchina chaiwo, inotsanangura nharaunda umo zvirongwa zvinomhanya. Pamusoro peiyo bytecode, zvakare inotsanangura iyo batch memory modhi (mirayiridzo yemutoro inoiswa zvakajeka kune batch), marejista (A uye X; accumulator uye index marejista), kukwenya ndangariro chengetedzo, uye isina kujeka chirongwa counter. Sezvineiwo, iyo BPF bytecode yakateedzerwa mushure meiyo Motorola 6502 ISA. Sezvo Steve McCann akarangarira mune yake plenary report paSharkfest '11, aiziva kuvaka 6502 kubva kuchirongwa chake chemazuva ekusekondari paApple II, uye ruzivo urwu rwakakonzera basa rake rekugadzira BPF bytecode.

BPF rutsigiro runoitwa muLinux kernel mushanduro v2.5 uye yepamusoro, yakawedzerwa zvakanyanya nekuedza kwaJay Schullist. BPF kodhi yakaramba isina kuchinjwa kusvika 2011, apo Eric Dumaset akagadzira patsva muturikiri weBPF kuti ashande muJIT mode (Kwakabva: JIT yemafirita epaketi) Mushure meizvi, kernel, panzvimbo yekududzira BPF bytecode, yaigona kushandura zvakananga zvirongwa zveBPF kune iyo yakanangwa yekuvaka: x86, ARM, MIPs, nezvimwe.

Gare gare, muna 2014, Alexey Starovoitov akaronga nzira itsva yeJIT yeBPF. Muchokwadi, iyi JIT nyowani yakava itsva BPF-yakavakirwa architecture uye yakanzi eBPF. Ini ndinofunga ese maVM akagara kwenguva yakati, asi parizvino kusefa kwepaketi kunoitwa zvichibva paEBPF. Muchokwadi, mumienzaniso yakawanda yezvinyorwa zvazvino, BPF inonzwisiswa seBPF, uye yekirasi BPF nhasi inozivikanwa secBPF.

eBPF inowedzera iyo yakasarudzika BPF chaiyo muchina munzira dzinoverengeka:

  • Kubva pane zvemazuva ano 64-bit zvivakwa. eBPF inoshandisa 64-bit register uye inowedzera nhamba yezvinyorwa zviripo kubva ku2 (accumulator uye X) kusvika ku10. eBPF inopawo mamwe maopcode (BPF_MOV, BPF_JNE, BPF_CALL...).
  • Yakabviswa kubva kune network layer subsystem. BPF yakasungirirwa kune batch data modhi. Sezvo yaishandiswa kusefa kwepaketi, kodhi yayo yaive mune subsystem inopa network kutaurirana. Nekudaro, iyo eBPF chaiyo muchina haichasungirirwa kune data data uye inogona kushandiswa kune chero chinangwa. Saka, ikozvino chirongwa cheBPF chinogona kubatanidzwa kune tracepoint kana kprobe. Izvi zvinovhura nzira yeBPF instrumentation, ongororo yekuita, uye mamwe akawanda ekushandisa kesi mumamiriro emamwe kernel subsystems. Ikozvino iyo eBPF kodhi iri munzira yayo yega: kernel/bpf.
  • Zvitoro zvepasi rose zvinonzi Mepu. Mepu zvitoro zvakakosha-zvinokosha zvinogonesa kuchinjanisa data pakati pemushandisi nzvimbo nenzvimbo yekernel. eBPF inopa akati wandei marudzi emamepu.
  • Secondary mabasa. Kunyanya, kunyora pasuru zvakare, kuverenga cheki, kana kutevedzera pasuru. Aya mabasa anomhanya mukati me kernel uye haasi mapurogiramu emushandisi-nzvimbo. Iwe unogona zvakare kufona system kubva kuEBPF zvirongwa.
  • Pedzisa mafoni. Saizi yechirongwa muBPF inogumira ku4096 bytes. Iyo muswe yekufona ficha inobvumira eBPF chirongwa chekuendesa kutonga kune itsva eBPF chirongwa uye nekudaro kudarika ichi chinogumira (kusvika makumi matatu nemaviri zvirongwa zvinogona kubatanidzwa neiyi nzira).

eBPF: muenzaniso

Kune akati wandei mienzaniso yeBPF muLinux kernel masosi. Iwo anowanikwa pamasampuli/bpf/. Kuunganidza iyi mienzaniso, ingo pinda:

$ sudo make samples/bpf/

Ini handisi kuzonyora muenzaniso mutsva weBPF pachangu, asi ndichashandisa imwe yemasampuli inowanikwa mumasampuli/bpf/. Ini ndichatarisa zvimwe zvikamu zvekodhi uye kutsanangura kuti inoshanda sei. Semuenzaniso, ndakasarudza purogiramu tracex4.

Kazhinji, imwe neimwe yemuenzaniso mumasampuli/bpf/ ine mafaera maviri. Muchiitiko ichi:

  • tracex4_kern.c, ine sosi kodhi inofanirwa kuuraiwa mu kernel seBPF bytecode.
  • tracex4_user.c, ine chirongwa kubva munzvimbo yemushandisi.

Muchiitiko ichi, tinofanira kuunganidza tracex4_kern.c kune eBPF bytecode. Iye zvino mu gcc hapana backend yeEBPF. Sezvineiwo, clang inogona kuburitsa eBPF bytecode. Makefile anoshandisa clang yekubatanidza tracex4_kern.c kune chinhu faira.

Ndataura pamusoro apa kuti chimwe chezvinhu zvinonakidza zveBPF mamepu. tracex4_kern inotsanangura mepu imwe chete:

struct pair {
    u64 val;
    u64 ip;
};  

struct bpf_map_def SEC("maps") my_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(long),
    .value_size = sizeof(struct pair),
    .max_entries = 1000000,
};

BPF_MAP_TYPE_HASH ndeimwe yemhando dzakawanda dzemakadhi anopihwa ne eBPF. Muchiitiko ichi, inongova hashi. Unogonawo kuona ad SEC("maps"). SEC ndeye macro inoshandiswa kugadzira chikamu chitsva chebhinari faira. Chokwadi, muenzaniso tracex4_kern zvimwe zvikamu zviviri zvinotsanangurwa:

SEC("kprobe/kmem_cache_free")
int bpf_prog1(struct pt_regs *ctx)
{   
    long ptr = PT_REGS_PARM2(ctx);

    bpf_map_delete_elem(&my_map, &ptr); 
    return 0;
}
    
SEC("kretprobe/kmem_cache_alloc_node") 
int bpf_prog2(struct pt_regs *ctx)
{
    long ptr = PT_REGS_RC(ctx);
    long ip = 0;

    // ΠΏΠΎΠ»ΡƒΡ‡Π°Π΅ΠΌ ip-адрСс Π²Ρ‹Π·Ρ‹Π²Π°ΡŽΡ‰Π΅ΠΉ стороны kmem_cache_alloc_node() 
    BPF_KRETPROBE_READ_RET_IP(ip, ctx);

    struct pair v = {
        .val = bpf_ktime_get_ns(),
        .ip = ip,
    };
    
    bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
    return 0;
}   

Aya maviri mabasa anotendera iwe kudzima yekupinda kubva pamepu (kprobe/kmem_cache_free) uye wedzera imwe yekupinda pamepu (kretprobe/kmem_cache_alloc_node) Mazita ese emabasa akanyorwa nemavara makuru anoenderana nemacros anotsanangurwa mukati bpf_helpers.h.

Kana ndikarasa zvikamu zvechinhu faira, ndinofanira kuona kuti zvikamu zvitsva izvi zvakatotsanangurwa:

$ objdump -h tracex4_kern.o

tracex4_kern.o: file format elf64-little

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000000 0000000000000000 0000000000000000 00000040 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 kprobe/kmem_cache_free 00000048 0000000000000000 0000000000000000 00000040 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
2 kretprobe/kmem_cache_alloc_node 000000c0 0000000000000000 0000000000000000 00000088 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3 maps 0000001c 0000000000000000 0000000000000000 00000148 2**2
CONTENTS, ALLOC, LOAD, DATA
4 license 00000004 0000000000000000 0000000000000000 00000164 2**0
CONTENTS, ALLOC, LOAD, DATA
5 version 00000004 0000000000000000 0000000000000000 00000168 2**2
CONTENTS, ALLOC, LOAD, DATA
6 .eh_frame 00000050 0000000000000000 0000000000000000 00000170 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

Kune zvakare tracex4_user.c, purogiramu huru. Chaizvoizvo, chirongwa ichi chinoteerera kune zviitiko kmem_cache_alloc_node. Kana chiitiko chakadaro chikaitika, iyo inoenderana eBPF kodhi inoitwa. Iyo kodhi inochengetedza iyo IP hunhu hwechinhu mumepu, uye chinhu chinobva chasunungurwa kuburikidza nechirongwa chikuru. Muenzaniso:

$ sudo ./tracex4
obj 0xffff8d6430f60a00 is 2sec old was allocated at ip ffffffff9891ad90
obj 0xffff8d6062ca5e00 is 23sec old was allocated at ip ffffffff98090e8f
obj 0xffff8d5f80161780 is 6sec old was allocated at ip ffffffff98090e8f

Ko chirongwa chemushandisi nzvimbo uye chirongwa cheBPF chine hukama sei? Pakutanga tracex4_user.c inotakura chinhu faira tracex4_kern.o kushandisa basa load_bpf_file.

int main(int ac, char **argv)
{
    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
    char filename[256];
    int i;

    snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

    if (setrlimit(RLIMIT_MEMLOCK, &r)) {
        perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
        return 1;
    }

    if (load_bpf_file(filename)) {
        printf("%s", bpf_log_buf);
        return 1;
    }

    for (i = 0; ; i++) {
        print_old_objects(map_fd[1]);
        sleep(1);
    }

    return 0;
}

Ndichiri kuita load_bpf_file probes inotsanangurwa muEBPF faira inowedzerwa kune /sys/kernel/debug/tracing/kprobe_events. Iye zvino tinoteerera zviitiko izvi uye purogiramu yedu inogona kuita chimwe chinhu pazvinoitika.

$ sudo cat /sys/kernel/debug/tracing/kprobe_events
p:kprobes/kmem_cache_free kmem_cache_free
r:kprobes/kmem_cache_alloc_node kmem_cache_alloc_node

Mamwe mapurogiramu ese mumuenzaniso/bpf/ akagadzirwa zvakafanana. Anogara aine mafaera maviri:

  • XXX_kern.c: eBPF chirongwa.
  • XXX_user.c: purogiramu huru.

Chirongwa cheBPF chinozivisa mamepu uye mabasa ane chekuita nechikamu. Kana kernel yaburitsa chiitiko cheimwe mhando (semuenzaniso, tracepoint), mabasa akasungwa anoitwa. Iwo makadhi anopa kutaurirana pakati pe kernel chirongwa uye mushandisi nzvimbo chirongwa.

mhedziso

Ichi chinyorwa chakakurukura nezveBPF neBPF mune zvakajairika. Ndinoziva kune ruzivo rwakawanda uye zviwanikwa nezve eBPF nhasi, saka ini ndichakurudzira zvimwe zviwanikwa zvekuwedzera kudzidza.

Ndinokurudzira kuverenga:

Source: www.habr.com

Voeg