Kev Taw Qhia luv luv rau BPF thiab eBPF

Hlo Habr! Peb qhia rau koj tias peb tab tom npaj tso tawm ib phau ntawv "Linux Observability nrog BPF".

Kev Taw Qhia luv luv rau BPF thiab eBPF
Raws li BPF lub tshuab virtual txuas ntxiv txhim kho thiab nquag siv hauv kev xyaum, peb tau txhais ib tsab xov xwm rau koj piav qhia nws cov yam ntxwv tseem ceeb thiab lub xeev tam sim no.

Nyob rau hauv xyoo tas los no, cov cuab yeej programming thiab cov tswv yim tau txais txiaj ntsig los them rau cov kev txwv ntawm Linux kernel nyob rau hauv rooj plaub uas yuav tsum tau ua cov ntaub ntawv ua tau zoo. Ib txoj kev nrov tshaj plaws ntawm hom no yog hu ua tub ntxhais bypass (kernel bypass) thiab tso cai, hla lub network txheej ntawm lub ntsiav, ua txhua pob ntawv ua tiav los ntawm cov neeg siv qhov chaw. Bypassing lub kernel kuj yuav tswj cov network card los ntawm neeg siv qhov chaw. Hauv lwm lo lus, thaum ua haujlwm nrog daim npav network, peb cia siab rau tus tsav tsheb neeg siv qhov chaw.

Los ntawm kev hloov tag nrho cov kev tswj ntawm daim npav network mus rau qhov kev pab cuam neeg siv, peb txo cov nyiaj siv ua haujlwm los ntawm cov ntsiav (cov ntsiab lus hloov pauv, txheej txheej txheej txheem network, cuam tshuam, thiab lwm yam), uas yog qhov tseem ceeb heev thaum khiav ntawm qhov nrawm ntawm 10Gb / s lossis siab dua. Bypassing lub kernel ntxiv rau kev sib xyaw ua ke ntawm lwm cov nta (batch ua) thiab ua tib zoo saib xyuas (NUMA accounting, CPU cais, thiab lwm yam) haum rau lub hauv paus ntawm kev ua haujlwm siab ntawm cov neeg siv-chaw networking. Tej zaum ib qho piv txwv piv txwv ntawm txoj hauv kev tshiab no rau kev ntim khoom yog DPDK los ntawm Intel (Data Plane Development Kit), txawm hais tias muaj lwm cov cuab yeej paub zoo thiab cov tswv yim, suav nrog VPP los ntawm Cisco (Vector Packet Processing), Netmap thiab, tau kawg, snab.

Lub koom haum ntawm kev sib cuam tshuam hauv network hauv cov neeg siv chaw muaj ntau qhov tsis zoo:

  • OS kernel yog txheej abstraction rau cov khoom siv kho vajtse. Vim tias cov neeg siv-chaw ua haujlwm yuav tsum tswj hwm lawv cov peev txheej ncaj qha, lawv kuj yuav tsum tswj hwm lawv tus kheej kho vajtse. Qhov no feem ntau txhais tau tias programming koj tus kheej tsav tsheb.
  • Txij li thaum peb muab tag nrho qhov chaw kernel, peb tseem muab tag nrho cov kev sib txuas ua haujlwm muab los ntawm cov ntsiav. Cov neeg siv-chaw cov kev pab cuam yuav tsum rov ua cov yam ntxwv uas twb tau muab los ntawm cov ntsiav lossis kev ua haujlwm.
  • Cov kev zov me nyuam ua haujlwm hauv hom sandbox, uas txwv tsis pub lawv cov kev sib cuam tshuam thiab tiv thaiv lawv los ntawm kev sib koom ua ke nrog lwm qhov chaw ntawm lub operating system.

Hauv qhov tseem ceeb, thaum sib txuas lus hauv cov neeg siv qhov chaw, kev ua tau zoo tau ua tiav los ntawm kev txav pob ntawv los ntawm cov ntsiav mus rau qhov chaw siv. XDP ua raws nraim li qhov tsis sib xws: nws txav cov kev pab cuam network los ntawm cov neeg siv qhov chaw (filters, converters, routing, thiab lwm yam) mus rau thaj tsam kernel. XDP tso cai rau peb ua haujlwm hauv lub network sai li sai tau thaum pob ntawv tsoo lub network interface thiab ua ntej nws pib taug kev mus rau lub network subsystem ntawm cov ntsiav. Raws li qhov tshwm sim, cov pob ntawv ua haujlwm ceev tau nce ntxiv. Txawm li cas los xij, ua li cas cov ntsiav tso cai rau tus neeg siv los khiav lawv cov kev pab cuam hauv qhov chaw kernel? Ua ntej teb cov lus nug no, cia saib seb BPF yog dab tsi.

BPF thiab eBPF

Txawm hais tias tsis yog lub npe meej meej, BPF (Packet Filtering, Berkeley) yog, qhov tseeb, lub tshuab virtual. Lub tshuab virtual no yog Ameslikas tsim los tswj cov pob ntawv lim, yog li lub npe.

Ib qho ntawm cov cuab yeej paub zoo siv BPF yog tcpdump. Thaum ntes pob ntawv nrog tcpdump tus neeg siv tuaj yeem qhia qhov qhia rau pob ntawv lim. Tsuas yog cov pob ntawv uas phim cov lus qhia no yuav raug ntes. Piv txwv li, cov lus hais "tcp dst port 80” hais txog tag nrho cov pob ntawv TCP tuaj txog ntawm qhov chaw nres nkoj 80. Lub compiler tuaj yeem txo qhov kev qhia no los ntawm kev hloov nws mus rau BPF bytecode.

$ sudo tcpdump -d "tcp dst port 80"
(000) ldh [12] (001) jeq #0x86dd jt 2 jf 6
(002) ldb [20] (003) jeq #0x6 jt 4 jf 15
(004) ldh [56] (005) jeq #0x50 jt 14 jf 15
(006) jeq #0x800 jt 7 jf 15
(007) ldb [23] (008) jeq #0x6 jt 9 jf 15
(009) ldh [20] (010) jset #0x1fff jt 15 jf 11
(011) ldxb 4*([14]&0xf)
(012) ldh [x + 16] (013) jeq #0x50 jt 14 jf 15
(014) ret #262144
(015) ret #0

Qhov no yog qhov tseem ceeb ntawm qhov program saum toj no ua:

  • Kev qhia (000): Loads lub pob ntawv ntawm offset 12, raws li 16-ntsis lo lus, mus rau hauv lub accumulator. Offset 12 sib raug rau ethertype ntawm pob ntawv.
  • Kev qhia (001): piv tus nqi hauv lub accumulator nrog 0x86dd, uas yog, nrog tus nqi ethertype rau IPv6. Yog tias qhov tshwm sim muaj tseeb, ces qhov program txee mus rau kev qhia (002), thiab yog tias tsis yog, ces mus rau (006).
  • Kev qhia (006): piv tus nqi nrog 0x800 (ethertype tus nqi rau IPv4). Yog tias cov lus teb muaj tseeb, ces qhov kev zov me nyuam mus rau (007), yog tias tsis yog, ces mus rau (015).

Thiab yog li ntawd, kom txog thaum lub pob ntawv lim dej rov qab tau txais txiaj ntsig. Feem ntau nws yog boolean. Kev xa rov qab tus nqi tsis yog xoom (qhia (014)) txhais tau hais tias lub pob ntawv sib tw, thiab rov qab xoom (qhia (015)) txhais tau tias pob ntawv tsis sib xws.

BPF lub tshuab virtual thiab nws cov bytecode tau thov los ntawm Steve McCann thiab Van Jacobson thaum xyoo 1992 thaum lawv daim ntawv tawm. BSD Packet Filter: Cov qauv tshiab rau cov neeg siv qib pob ntawv ntes, thawj zaug no thev naus laus zis tau nthuav tawm ntawm lub rooj sib tham Usenix thaum lub caij ntuj no xyoo 1993.

Vim tias BPF yog lub tshuab virtual, nws txhais tau hais tias ib puag ncig hauv cov kev pab cuam khiav. Ntxiv nrog rau bytecode, nws kuj txhais tau hais tias lub pob ntawv cim xeeb qauv (cov lus qhia thauj khoom yog siv rau lub pob ntawv), sau npe (A thiab X; accumulator thiab index registers), khawb lub cim xeeb cia, thiab ib qho kev cuam tshuam rau lub txee. Qhov zoo siab, BPF bytecode tau ua qauv tom qab Motorola 6502 ISA. Raws li Steve McCann nco qab hauv nws daim ntawv qhia tag nrho ntawm Sharkfest '11, nws tau paub txog kev tsim 6502 los ntawm tsev kawm theem siab thaum programming ntawm Apple II, thiab qhov kev paub no cuam tshuam rau nws txoj haujlwm tsim BPF bytecode.

Kev txhawb nqa BPF yog siv rau hauv Linux ntsiav hauv version v2.5 thiab tom qab ntawd, ntxiv los ntawm Jay Schullist. BPF code tseem tsis tau hloov mus txog rau xyoo 2011, thaum Eric Dumaset rov tsim tus neeg txhais lus BPF los ua haujlwm hauv JIT hom (Source: JIT rau Packet Filters). Tom qab ntawd, tsis txhob txhais cov BPF bytecode, lub ntsiav tuaj yeem hloov BPF cov kev pabcuam ncaj qha rau lub hom phiaj architecture: x86, ARM, MIPS, thiab lwm yam.

Tom qab ntawd, nyob rau hauv 2014, Alexei Starovoitov tau thov ib tug tshiab JIT mechanism rau BPF. Qhov tseeb, JIT tshiab no tau los ua ib qho kev tsim kho tshiab raws li BPF thiab hu ua eBPF. Kuv xav tias ob qho tib si VMs sib koom ua ke rau qee lub sijhawm, tab sis cov ntawv lim dej tam sim no tau siv rau saum eBPF. Qhov tseeb, hauv ntau cov qauv ntaub ntawv niaj hnub no, BPF raug xa mus rau eBPF, thiab classical BPF yog hu ua cBPF niaj hnub no.

eBPF txuas rau classic BPF lub tshuab virtual hauv ntau txoj hauv kev:

  • Relies rau niaj hnub 64-ntsis architectures. eBPF siv 64-ntsis sau npe thiab nce tus naj npawb ntawm cov ntawv sau npe los ntawm 2 (accumulator thiab X) mus rau 10. eBPF kuj muab cov opcodes ntxiv (BPF_MOV, BPF_JNE, BPF_CALL…).
  • Detached los ntawm lub network txheej subsystem. BPF tau khi rau batch data model. Txij li thaum nws tau siv los lim cov pob ntawv, nws cov lej yog nyob rau hauv lub subsystem uas muab kev sib tshuam hauv network. Txawm li cas los xij, eBPF lub tshuab virtual tsis raug khi rau cov qauv ntaub ntawv thiab tuaj yeem siv rau txhua lub hom phiaj. Yog li, tam sim no qhov kev pab cuam eBPF tuaj yeem txuas nrog tracepoint lossis kprobe. Qhov no qhib qhov rooj rau eBPF kev ntsuas, kev ntsuas kev ua tau zoo, thiab ntau lwm yam kev siv nyob rau hauv cov ntsiab lus ntawm lwm cov kernel subsystems. Tam sim no tus lej eBPF nyob hauv nws txoj kev: ntsiav / bpf.
  • Cov khw muag ntaub ntawv thoob ntiaj teb hu ua Maps. Maps yog cov khw muag khoom tseem ceeb uas muab cov ntaub ntawv sib pauv ntawm cov neeg siv qhov chaw thiab qhov chaw kernel. eBPF muab ntau hom ntawv.
  • Secondary functions. Tshwj xeeb, txhawm rau sau ib pob, suav cov tshev, lossis clone ib pob. Cov haujlwm no khiav hauv cov ntsiav thiab tsis yog rau cov neeg siv qhov chaw. Tsis tas li ntawd, kev hu xov tooj tuaj yeem ua los ntawm eBPF cov kev pab cuam.
  • Xaus hu. Qhov kev zov me nyuam loj hauv eBPF yog txwv rau 4096 bytes. Qhov kawg hu feature tso cai rau ib qho kev pab cuam eBPF hloov kev tswj mus rau ib qho kev pab cuam eBPF tshiab thiab yog li hla qhov kev txwv no (txog li 32 qhov kev pab cuam tuaj yeem raug chained li no).

npr eBPF

Muaj ntau qhov piv txwv rau eBPF hauv Linux ntsiav qhov chaw. Lawv muaj nyob rau ntawm cov qauv / bpf / . Txhawm rau sau cov piv txwv no, tsuas yog ntaus ntawv:

$ sudo make samples/bpf/

Kuv yuav tsis sau tus qauv tshiab rau eBPF kuv tus kheej, tab sis yuav siv ib qho ntawm cov qauv muaj nyob hauv cov qauv / bpf / . Kuv yuav saib qee qhov ntawm cov lej thiab piav qhia nws ua haujlwm li cas. Ua piv txwv, kuv xaiv qhov program tracex4.

Feem ntau, txhua qhov piv txwv hauv cov qauv / bpf / muaj ob cov ntaub ntawv. Hauv qhov no:

  • tracex4_kern.c, muaj cov cai los ua kom tiav hauv cov ntsiav li eBPF bytecode.
  • tracex4_user.c, muaj ib qho kev pab cuam los ntawm cov neeg siv qhov chaw.

Hauv qhov no, peb yuav tsum tau sau ua ke tracex4_kern.c rau eBPF bytecode. Tam sim no nyob rau hauv gcc tsis muaj qhov server rau eBPF. Hmoov zoo, clang tuaj yeem tsim eBPF bytecode. Makefile siv clang mus sau tracex4_kern.c mus rau cov ntaub ntawv khoom.

Kuv tau hais saum toj no tias ib qho ntawm qhov nthuav tshaj plaws ntawm eBPF yog daim ntawv qhia. tracex4_kern txhais ib daim ntawv qhia:

struct pair {
    u64 val;
    u64 ip;
};  

struct bpf_map_def SEC("maps") my_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(long),
    .value_size = sizeof(struct pair),
    .max_entries = 1000000,
};

BPF_MAP_TYPE_HASH yog ib qho ntawm ntau daim npav uas muab los ntawm eBPF. Hauv qhov no, nws tsuas yog hash. Tej zaum koj kuj tau pom qhov ad SEC("maps"). SEC yog macro siv los tsim ib ntu tshiab ntawm cov ntaub ntawv binary. Qhov tseeb, hauv qhov piv txwv tracex4_kern ob ntu ntxiv tau txhais:

SEC("kprobe/kmem_cache_free")
int bpf_prog1(struct pt_regs *ctx)
{   
    long ptr = PT_REGS_PARM2(ctx);

    bpf_map_delete_elem(&my_map, &ptr); 
    return 0;
}
    
SEC("kretprobe/kmem_cache_alloc_node") 
int bpf_prog2(struct pt_regs *ctx)
{
    long ptr = PT_REGS_RC(ctx);
    long ip = 0;

    // ΠΏΠΎΠ»ΡƒΡ‡Π°Π΅ΠΌ ip-адрСс Π²Ρ‹Π·Ρ‹Π²Π°ΡŽΡ‰Π΅ΠΉ стороны kmem_cache_alloc_node() 
    BPF_KRETPROBE_READ_RET_IP(ip, ctx);

    struct pair v = {
        .val = bpf_ktime_get_ns(),
        .ip = ip,
    };
    
    bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
    return 0;
}   

Ob txoj haujlwm no tso cai rau koj tshem tawm qhov nkag ntawm daim ntawv qhia (kprobe/kmem_cache_free) thiab ntxiv ib qho tshiab nkag rau hauv daim ntawv qhia (kretprobe/kmem_cache_alloc_node). Tag nrho cov npe ua haujlwm sau ua cov tsiaj ntawv loj sib haum rau macros tau teev tseg hauv bpf_helpers.h.

Yog tias kuv pov tseg cov ntu ntawm cov khoom siv, kuv yuav tsum pom tias cov ntu tshiab no twb tau hais tseg lawm:

$ objdump -h tracex4_kern.o

tracex4_kern.o: file format elf64-little

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000000 0000000000000000 0000000000000000 00000040 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 kprobe/kmem_cache_free 00000048 0000000000000000 0000000000000000 00000040 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
2 kretprobe/kmem_cache_alloc_node 000000c0 0000000000000000 0000000000000000 00000088 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3 maps 0000001c 0000000000000000 0000000000000000 00000148 2**2
CONTENTS, ALLOC, LOAD, DATA
4 license 00000004 0000000000000000 0000000000000000 00000164 2**0
CONTENTS, ALLOC, LOAD, DATA
5 version 00000004 0000000000000000 0000000000000000 00000168 2**2
CONTENTS, ALLOC, LOAD, DATA
6 .eh_frame 00000050 0000000000000000 0000000000000000 00000170 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

Kuj muaj tracex4_user.c, qhov program loj. Yeej, qhov kev pab cuam no mloog cov xwm txheej kmem_cache_alloc_node. Thaum muaj qhov xwm txheej zoo li no tshwm sim, tus lej eBPF raug tua. Cov cai khaws cov khoom tus IP tus cwj pwm rau hauv daim ntawv qhia, thiab tom qab ntawd cov khoom yog looped los ntawm lub ntsiab program. Piv txwv:

$ sudo ./tracex4
obj 0xffff8d6430f60a00 is 2sec old was allocated at ip ffffffff9891ad90
obj 0xffff8d6062ca5e00 is 23sec old was allocated at ip ffffffff98090e8f
obj 0xffff8d5f80161780 is 6sec old was allocated at ip ffffffff98090e8f

Tus neeg siv qhov chaw thiab qhov kev pab cuam eBPF cuam tshuam li cas? Thaum pib tracex4_user.c loads object file tracex4_kern.o siv lub luag haujlwm load_bpf_file.

int main(int ac, char **argv)
{
    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
    char filename[256];
    int i;

    snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

    if (setrlimit(RLIMIT_MEMLOCK, &r)) {
        perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
        return 1;
    }

    if (load_bpf_file(filename)) {
        printf("%s", bpf_log_buf);
        return 1;
    }

    for (i = 0; ; i++) {
        print_old_objects(map_fd[1]);
        sleep(1);
    }

    return 0;
}

Thaum ua load_bpf_file kev sojntsuam uas tau teev tseg hauv cov ntaub ntawv eBPF tau ntxiv rau /sys/kernel/debug/tracing/kprobe_events. Tam sim no peb mloog cov xwm txheej no thiab peb txoj haujlwm tuaj yeem ua qee yam thaum lawv tshwm sim.

$ sudo cat /sys/kernel/debug/tracing/kprobe_events
p:kprobes/kmem_cache_free kmem_cache_free
r:kprobes/kmem_cache_alloc_node kmem_cache_alloc_node

Tag nrho lwm cov kev pab cuam hauv cov qauv / bpf / raug teeb tsa zoo ib yam. Lawv ib txwm muaj ob cov ntaub ntawv:

  • XXX_kern.c: eBPF program.
  • XXX_user.c: lub ntsiab program.

Qhov kev pab cuam eBPF txhais cov duab qhia chaw thiab cov haujlwm cuam tshuam nrog ib ntu. Thaum lub kernel emits ib qho kev tshwm sim ntawm ib yam (piv txwv li, tracepoint), cov haujlwm ua haujlwm tau ua tiav. Maps muab kev sib txuas lus ntawm cov kev pab cuam kernel thiab cov neeg siv qhov chaw.

xaus

Hauv tsab xov xwm no, BPF thiab eBPF tau tham txog cov ntsiab lus dav dav. Kuv paub tias muaj ntau cov ntaub ntawv thiab cov peev txheej hais txog eBPF hnub no, yog li kuv yuav pom zoo ob peb yam ntaub ntawv ntxiv rau kev kawm ntxiv.

Kuv pom zoo nyeem:

Tau qhov twg los: www.hab.com

Ntxiv ib saib