Okwu mmalite nke BPF na eBPF

Hey Habr! Anyị na-agwa gị na anyị na-akwado ịhapụ akwụkwọ "Linux Observability na BPF".

Okwu mmalite nke BPF na eBPF
Ka igwe mebere BPF na-aga n'ihu na-etolite ma na-arụsi ọrụ ike na omume, anyị asụgharịala gị otu akụkọ na-akọwa njirimara ya na ọnọdụ ugbu a.

N'afọ ndị na-adịbeghị anya, ngwá ọrụ mmemme na usoro enwetala ewu ewu iji kwụọ ụgwọ maka njedebe nke kernel Linux n'ọnọdụ ebe achọrọ nhazi ngwugwu dị elu. Otu n'ime ụzọ kachasị ewu ewu nke ụdị a ka a na-akpọ isi uzo (kernel bypass) ma na-enye ohere, na-awụpụ oyi akwa netwọk nke kernel, iji rụọ nhazi ngwugwu niile site na oghere onye ọrụ. Ịgafe kernel gụnyekwara ijikwa kaadị netwọk site na ohere onye ọrụ. N'ikwu ya n'ụzọ ọzọ, mgbe ị na-arụ ọrụ na kaadị netwọk, anyị na-adabere na ọkwọ ụgbọala ohere onye ọrụ.

Site na ịnyefe njikwa kaadị netwọkụ zuru oke na mmemme ohere onye ọrụ, anyị na-ebelata oke nke kernel (context switches, nhazi oyi akwa netwọkụ, nkwụsị, wdg), nke dị ezigbo mkpa mgbe ị na-agba ọsọ na 10 Gb / s ma ọ bụ karịa. Ịgafe kernel gbakwunyere nchikota atụmatụ ndị ọzọ (nhazi ogbe) na nlezigharị anya nke ọma (NUMA ndekọ, Mwepu CPU, wdg) dabara na ntọala nke ịkparịta ụka n'igwe-ohere ọrụ dị elu. Ikekwe ihe atụ atụ nke ụzọ ọhụrụ a maka nhazi ngwugwu bụ DPDK sitere na Intel (Ngwa mmepe ụgbọ elu data), ọ bụ ezie na e nwere ngwá ọrụ na usoro ndị ọzọ a ma ama, gụnyere VPP sitere na Cisco (Vector Packet Processing), Netmap na, n'ezie, snab.

Nhazi nke mmekọrịta netwọk na oghere onye ọrụ nwere ọtụtụ ọghọm:

  • kernel OS bụ oyi akwa abstraction maka akụrụngwa ngwaike. N'ihi na mmemme oghere ndị ọrụ ga-ejikwa akụrụngwa ha ozugbo, ha ga-ejikwa ngwaike nke ha. Nke a na-apụtakarị ịhazi ndị ọkwọ ụgbọala nke gị.
  • Ebe ọ bụ na anyị na-ahapụ kpamkpam kernel oghere, anyị na-ahapụkwa ọrụ netwọk niile nke kernel na-enye. Mmemme oghere onye ọrụ ga-emerịrị atụmatụ nke kernel ma ọ bụ sistemu eji arụ ọrụ ewepụtaralarị.
  • Mmemme na-arụ ọrụ na ọnọdụ igbe igbe, nke na-egbochi mmekọrịta ha nke ọma ma gbochie ha ijikọ na akụkụ ndị ọzọ nke sistemụ arụmọrụ.

N'ezie, mgbe ịkparịta ụka n'Ịntanet na ohere onye ọrụ, a na-enweta uru arụmọrụ site na ịkwaga nhazi ngwugwu site na kernel gaa na oghere onye ọrụ. XDP na-eme kpọmkwem ihe dị iche: ọ na-ebuga mmemme netwọk site na oghere onye ọrụ (nzacha, ntụgharị, ntụgharị, wdg) gaa na mpaghara kernel. XDP na-enye anyị ohere ịrụ ọrụ netwọk ozugbo ngwa ngwa ngwugwu ahụ dabara na interface netwọkụ yana tupu ịmalite njem ruo na sistemụ netwọkụ nke kernel. N'ihi ya, ọsọ nhazi ngwugwu na-abawanye nke ukwuu. Agbanyeghị, kedu ka kernel si ekwe ka onye ọrụ mee mmemme ha na oghere kernel? Tupu ịza ajụjụ a, ka anyị leba anya n'ihe BPF bụ.

BPF na eBPF

N'agbanyeghị aha ahụ edoghị anya kpamkpam, BPF (Packet Filtering, Berkeley) bụ, n'ezie, ụdị igwe mebere. Emebere igwe a mebere ka ọ na-edozi nzacha ngwugwu, ya mere aha ya.

Otu n'ime ihe ndị a ma ama nke ọma na-eji BPF bụ tcpdump. Mgbe ejide ngwugwu na tcpdump onye ọrụ nwere ike ịkọwapụta okwu maka nzacha ngwugwu. Naanị ngwugwu dabara na okwu a ka a ga-ejide. Dịka ọmụmaatụ, okwu ahụ bụ "tcp dst port 80” na-ezo aka na ngwugwu TCP niile na-abata na ọdụ ụgbọ mmiri 80. Onye nchịkọta nwere ike belata okwu a site na ịtụgharị ya na BPF bytecode.

$ sudo tcpdump -d "tcp dst port 80"
(000) ldh [12] (001) jeq #0x86dd jt 2 jf 6
(002) ldb [20] (003) jeq #0x6 jt 4 jf 15
(004) ldh [56] (005) jeq #0x50 jt 14 jf 15
(006) jeq #0x800 jt 7 jf 15
(007) ldb [23] (008) jeq #0x6 jt 9 jf 15
(009) ldh [20] (010) jset #0x1fff jt 15 jf 11
(011) ldxb 4*([14]&0xf)
(012) ldh [x + 16] (013) jeq #0x50 jt 14 jf 15
(014) ret #262144
(015) ret #0

Nke a bụ isi ihe mmemme dị n'elu na-eme:

  • Ntuziaka (000): Na-ebunye ngwugwu ahụ na nkwụsị 12, dị ka okwu 16-bit, n'ime nchịkọta. Offset 12 dabara na ethertype nke ngwugwu ahụ.
  • Ntuziaka (001): tụlere uru dị na mkpokọta na 0x86dd, ya bụ, yana uru ethertype maka IPv6. Ọ bụrụ na nsonaazụ ya bụ eziokwu, mgbe ahụ counter mmemme na-aga na ntụziaka (002), ma ọ bụrụ na ọ bụghị, gaa na (006).
  • Ntuziaka (006): atụnyere uru ya na 0x800 (uru ethertype maka IPv4). Ọ bụrụ na azịza ya bụ eziokwu, mmemme ga-aga na (007), ọ bụrụ na ọ bụghị, gaa na (015).

Ya mere, ruo mgbe mmemme nzacha ngwugwu weghachiri nsonaazụ. Ọ na-abụkarị boolean. Iweghachi uru na-abụghị efu (ntuziaka (014)) pụtara na ngwugwu ahụ dakọtara, na iweghachi efu (ntụziaka (015)) pụtara na ngwugwu ahụ adabaghị.

Steve McCann na Van Jacobson tụpụtara igwe mebere BPF na bytecode ya na ngwụcha 1992 mgbe akwụkwọ ha pụtara. BSD Packet Filter: Ihe owuwu ọhụrụ maka ijide ngwugwu ọkwa onye ọrụ, na nke mbụ teknụzụ a gosipụtara na ogbako Usenix n'oge oyi nke 1993.

N'ihi na BPF bụ igwe mebere, ọ na-akọwapụta gburugburu ebe mmemme na-arụ. Na mgbakwunye na bytecode, ọ na-akọwapụtakwa ụdị ebe nchekwa ngwugwu (a na-etinye ntụzịaka ibu n'ụzọ doro anya na ngwugwu), ndekọ (A na X; ndekọ ndekọ ndekọ na ndeksi), nchekwa ebe nchekwa ọkọcha, yana ngwa ngwa mmemme. N'ụzọ na-akpali mmasị, e mere ka BPF bytecode dị ka Motorola 6502 ISA. Dị ka Steve McCann chetara na nke ya akụkọ zuru ezu na Sharkfest '11, ọ maara nke ọma na-ewu 6502 si ụlọ akwụkwọ sekọndrị mgbe mmemme na Apple II, na nke a ihe ọmụma metụtara ọrụ ya emebe BPF bytecode.

A na-emejuputa nkwado BPF na Linux kernel na ụdị v2.5 na emesịa, nke Jay Schullist gbakwunyere na ya. Koodu BPF agbanwebeghị ruo 2011, mgbe Eric Dumaset megharịrị onye ntụgharị okwu BPF ka ọ rụọ ọrụ na ọnọdụ JIT (Isi mmalite: JIT maka nzacha ngwugwu). Mgbe nke ahụ gasịrị, kama ịkọwa BPF bytecode, kernel nwere ike ịtụgharị mmemme BPF ozugbo gaa na ihe owuwu ebumnuche: x86, ARM, MIPS, wdg.

Mgbe e mesịrị, na 2014, Alexei Starovoitov tụrụ aro a ọhụrụ JIT usoro maka BPF. N'ezie, JIT ọhụrụ a ghọrọ ihe owuwu ọhụrụ dabere na BPF wee kpọọ ya eBPF. Echere m na VM abụọ ahụ birikọ ọnụ ruo oge ụfọdụ, mana a na-emejuputa nzacha ngwugwu n'elu eBPF. N'ezie, n'ọtụtụ ihe atụ akwụkwọ ọgbara ọhụrụ, a na-akpọ BPF dị ka eBPF, na BPF oge gboo ka amara taa dị ka cBPF.

eBPF gbatịrị igwe mebere BPF kpochapụrụ n'ọtụtụ ụzọ:

  • Na-adabere na architectures 64-bit ọgbara ọhụrụ. eBPF na-eji ndekọ 64-bit ma na-abawanye ọnụ ọgụgụ nke ndekọ dịnụ site na 2 (accumulator na X) ruo 10. eBPF na-enyekwa opcodes ndị ọzọ (BPF_MOV, BPF_JNE, BPF_CALL…).
  • Ewepụ ya na sistemụ oyi akwa netwọkụ. Ejikọtara BPF na ụdị data batch. Ebe ọ bụ na ejiri ya nyochaa ngwugwu, koodu ya dị na sistemu nke na-enye mmekọrịta netwọkụ. Agbanyeghị, igwe mebere eBPF adịkwaghị ejikọta ya na ụdị data enwere ike iji ya mee ihe ọ bụla. Yabụ, ugbu a enwere ike jikọọ mmemme eBPF na tracepoint ma ọ bụ na kprobe. Nke a na-emepe ụzọ maka ngwa eBPF, nyocha arụmọrụ, na ọtụtụ ikpe ojiji ndị ọzọ na ọnọdụ nke sistemụ kernel ndị ọzọ. Ugbu a koodu eBPF dị n'ụzọ nke ya: kernel/bpf.
  • Ụlọ ahịa data zuru ụwa ọnụ a na-akpọ Maps. Maapụ bụ ụlọ ahịa dị mkpa nke na-enye mgbanwe data n'etiti oghere onye ọrụ na oghere kernel. eBPF na-enye ọtụtụ ụdị kaadị.
  • Ọrụ nke abụọ. Karịsịa, iji degharịa ngwugwu, gbakọọ checksum, ma ọ bụ mechie ngwugwu. Ọrụ ndị a na-arụ ọrụ n'ime kernel na anaghị eso na mmemme ohere onye ọrụ. Na mgbakwunye, enwere ike ịkpọ oku sistemụ site na mmemme eBPF.
  • Kwụsị oku. Ogo mmemme na eBPF bụ naanị 4096 bytes. Akụkụ oku njedebe na-enye ohere mmemme eBPF ịnyefe njikwa na mmemme eBPF ọhụrụ wee si otú a gafere mmachi a (ihe ruru mmemme 32 nwere ike ịgbụ otu a).

ọmụmaatụ eBPF

Enwere ọtụtụ ọmụmaatụ maka eBPF na isi mmalite kernel Linux. Ha dị na sample/bpf/. Iji chịkọta ihe atụ ndị a, naanị pịnye:

$ sudo make samples/bpf/

Agaghị m ede ihe atụ ọhụrụ maka eBPF n'onwe m, mana m ga-eji otu n'ime ihe nlele dị na sample/bpf/. M ga-eleba anya n'akụkụ ụfọdụ nke koodu ahụ ma kọwaa otú o si arụ ọrụ. Dị ka ihe atụ, m họọrọ mmemme tracex4.

N'ozuzu, ihe atụ nke ọ bụla na sample/bpf/ nwere faịlụ abụọ. N'okwu a:

  • tracex4_kern.c, nwere koodu isi mmalite a ga-egbu na kernel dị ka eBPF bytecode.
  • tracex4_user.c, nwere mmemme sitere na oghere onye ọrụ.

N'okwu a, anyị kwesịrị ikpokọta tracex4_kern.c na eBPF bytecode. N'oge a gcc enweghị akụkụ nkesa maka eBPF. Ọ dabara nke ọma, clang nwere ike ịmepụta eBPF bytecode. Makefile na -eji clang iji chịkọta tracex4_kern.c na faịlụ ihe.

Ekwuru m n'elu na otu n'ime ihe kacha amasị eBPF bụ maapụ. tracex4_kern na-akọwa otu maapụ:

struct pair {
    u64 val;
    u64 ip;
};  

struct bpf_map_def SEC("maps") my_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(long),
    .value_size = sizeof(struct pair),
    .max_entries = 1000000,
};

BPF_MAP_TYPE_HASH bụ otu n'ime ọtụtụ ụdị kaadị eBPF na-enye. N'okwu a, ọ bụ naanị hash. Ị nwekwara ike ịhụla mgbasa ozi ahụ SEC("maps"). SEC bụ nnukwu ihe eji mepụta ngalaba ọhụrụ nke faịlụ ọnụọgụ abụọ. N'ezie, n'ihe atụ tracex4_kern A kọwapụtara ngalaba abụọ ọzọ:

SEC("kprobe/kmem_cache_free")
int bpf_prog1(struct pt_regs *ctx)
{   
    long ptr = PT_REGS_PARM2(ctx);

    bpf_map_delete_elem(&my_map, &ptr); 
    return 0;
}
    
SEC("kretprobe/kmem_cache_alloc_node") 
int bpf_prog2(struct pt_regs *ctx)
{
    long ptr = PT_REGS_RC(ctx);
    long ip = 0;

    // получаем ip-адрес вызывающей стороны kmem_cache_alloc_node() 
    BPF_KRETPROBE_READ_RET_IP(ip, ctx);

    struct pair v = {
        .val = bpf_ktime_get_ns(),
        .ip = ip,
    };
    
    bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
    return 0;
}   

Ọrụ abụọ a na-enye gị ohere iwepu ntinye na maapụ (kprobe/kmem_cache_free) ma tinye ntinye ọhụrụ na maapụ (kretprobe/kmem_cache_alloc_node). Aha ọrụ niile edere na mkpụrụedemede ukwu dabara na macro akọwapụtara na ya bpf_helpers.h.

Ọ bụrụ na m tufuo akụkụ nke faịlụ ihe ahụ, a ga m ahụ na akọwapụtalarị ngalaba ọhụrụ ndị a:

$ objdump -h tracex4_kern.o

tracex4_kern.o: file format elf64-little

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000000 0000000000000000 0000000000000000 00000040 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 kprobe/kmem_cache_free 00000048 0000000000000000 0000000000000000 00000040 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
2 kretprobe/kmem_cache_alloc_node 000000c0 0000000000000000 0000000000000000 00000088 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3 maps 0000001c 0000000000000000 0000000000000000 00000148 2**2
CONTENTS, ALLOC, LOAD, DATA
4 license 00000004 0000000000000000 0000000000000000 00000164 2**0
CONTENTS, ALLOC, LOAD, DATA
5 version 00000004 0000000000000000 0000000000000000 00000168 2**2
CONTENTS, ALLOC, LOAD, DATA
6 .eh_frame 00000050 0000000000000000 0000000000000000 00000170 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

E nwekwara tracex4_user.c, isi mmemme. N'ụzọ bụ isi, mmemme a na-ege ntị maka mmemme kmem_cache_alloc_node. Mgbe mmemme dị otú ahụ mere, a na-eme koodu eBPF kwekọrọ. Koodu ahụ na-echekwa njirimara IP ihe ahụ na maapụ, wee gbanye ihe ahụ site na mmemme bụ isi. Ọmụmaatụ:

$ sudo ./tracex4
obj 0xffff8d6430f60a00 is 2sec old was allocated at ip ffffffff9891ad90
obj 0xffff8d6062ca5e00 is 23sec old was allocated at ip ffffffff98090e8f
obj 0xffff8d5f80161780 is 6sec old was allocated at ip ffffffff98090e8f

Kedu ka mmemme oghere onye ọrụ na mmemme eBPF siri metụta? Na mbido tracex4_user.c na-ebu faịlụ ihe tracex4_kern.o iji ọrụ load_bpf_file.

int main(int ac, char **argv)
{
    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
    char filename[256];
    int i;

    snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

    if (setrlimit(RLIMIT_MEMLOCK, &r)) {
        perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
        return 1;
    }

    if (load_bpf_file(filename)) {
        printf("%s", bpf_log_buf);
        return 1;
    }

    for (i = 0; ; i++) {
        print_old_objects(map_fd[1]);
        sleep(1);
    }

    return 0;
}

Site n'ime load_bpf_file A na-agbakwunye nyocha akọwapụtara na faịlụ eBPF na /sys/kernel/debug/tracing/kprobe_events. Ugbu a, anyị na-ege ntị maka ihe omume ndị a na mmemme anyị nwere ike ime ihe mgbe ha mere.

$ sudo cat /sys/kernel/debug/tracing/kprobe_events
p:kprobes/kmem_cache_free kmem_cache_free
r:kprobes/kmem_cache_alloc_node kmem_cache_alloc_node

Mmemme ndị ọzọ niile dị na sample/bpf/ ka ahaziri otu a. Ha na-enwe faịlụ abụọ mgbe niile:

  • XXX_kern.c: mmemme eBPF.
  • XXX_user.c: isi mmemme.

Mmemme eBPF na-akọwapụta maapụ na ọrụ ndị metụtara ngalaba. Mgbe kernel na-ewepụta ihe omume nke otu ụdị (dịka ọmụmaatụ, tracepoint), A na-arụ ọrụ ejikọta. Maapụ na-enye nkwukọrịta n'etiti mmemme kernel na mmemme ohere onye ọrụ.

nkwubi

N'ime edemede a, a tụlere BPF na eBPF n'ozuzu okwu. Amaara m na enwere ọtụtụ ozi na akụrụngwa gbasara eBPF taa, yabụ m ga-akwado ihe ole na ole maka ọmụmụ ihe ọzọ.

Ana m akwado ịgụ:

isi: www.habr.com

Tinye a comment