Site na High Ceph Latency gaa na Kernel Patch na-eji eBPF/BCC

Site na High Ceph Latency gaa na Kernel Patch na-eji eBPF/BCC

Linux nwere ọnụ ọgụgụ buru ibu nke ngwaọrụ maka debugging kernel na ngwa. Ọtụtụ n'ime ha nwere mmetụta ọjọọ na arụmọrụ ngwa na enweghị ike iji ya mee ihe.

Afọ ole na ole gara aga enwere e mepụtala ngwá ọrụ ọzọ - eBPF. Ọ na-eme ka o kwe omume ịchọta kernel na ngwa ndị ọrụ na obere ego na-enweghị mkpa iwughachi mmemme na ibunye modul ndị ọzọ n'ime kernel.

Enweelarị ọtụtụ ngwa ngwa na-eji eBPF, na n'isiokwu a, anyị ga-eleba anya ka esi ede njirimara profaịlụ nke gị dabere na ọbá akwụkwọ. PythonBCC. Akụkọ ahụ dabere na ihe omume n'ezie. Anyị ga-esi na nsogbu pụta iji dozie iji gosi ka enwere ike iji ngwa ndị dị adị n'ọnọdụ ụfọdụ.

Ceph dị nwayọọ

Agbakwunyela onye ọbịa ọhụrụ na ụyọkọ Ceph. Mgbe ịkwaga ụfọdụ data na ya, anyị chọpụtara na ọsọ nke nhazi dee arịrịọ site na ya dị ala karịa na sava ndị ọzọ.

Site na High Ceph Latency gaa na Kernel Patch na-eji eBPF/BCC
N'adịghị ka nyiwe ndị ọzọ, onye ọbịa a jiri bcache na kernel Linux 4.15 ọhụrụ. Nke a bụ oge mbụ ejiri onye nhazi nhazi a mee ihe ebe a. Ma n'oge ahụ, o doro anya na mgbọrọgwụ nke nsogbu ahụ nwere ike ịbụ ihe ọ bụla.

Na-enyocha onye ọbịa

Ka anyị bido site na ilele ihe na-eme n'ime usoro ceph-osd. Maka nke a anyị ga-eji zuru oke и flamescope (ihe gbasara nke ị nwere ike ịgụ ebe a):

Site na High Ceph Latency gaa na Kernel Patch na-eji eBPF/BCC
Foto a na-agwa anyị na ọrụ ahụ fdatasync() nọrọ ọtụtụ oge na-eziga arịrịọ maka ọrụ generic_make_request(). Nke a pụtara na o yikarịrị ka ihe kpatara nsogbu ahụ bụ ebe na-abụghị osd daemon n'onwe ya. Nke a nwere ike ịbụ kernel ma ọ bụ diski. Mmepụta iostat gosipụtara nnukwu latency na nhazi arịrịọ site na diski bcache.

Mgbe anyị na-elele onye ọbịa ahụ, anyị chọpụtara na systemd-udevd daemon na-eri nnukwu oge CPU - ihe dịka 20% na ọtụtụ cores. Nke a bụ omume iju, yabụ ịkwesịrị ịchọpụta ihe kpatara ya. Ebe ọ bụ na Systemd-udevd na-arụ ọrụ na uevents, anyị kpebiri ileba anya na ha udevadm nlekota oru. Ọ na-apụta na e mepụtara ọnụ ọgụgụ dị ukwuu nke mgbanwe mgbanwe maka ngwaọrụ ngọngọ ọ bụla na usoro. Nke a bụ ihe a na-adịghị ahụkebe, yabụ anyị ga-elele ihe na-ebute ihe omume ndị a niile.

Iji ngwa ngwa BCC

Dịka anyị chọpụtala, kernel (na ceph daemon na oku sistemụ) na-etinye oge dị ukwuu n'ime ya. generic_make_request(). Ka anyị gbalịa ịlele ọsọ nke ọrụ a. N'ime Bcc Enweelarị ọmarịcha akụrụngwa - funclatency. Anyị ga-achọpụta daemon site na PID ya na nkeji 1 nke abụọ n'etiti mpụta wee wepụta nsonaazụ ya na milliseconds.

Site na High Ceph Latency gaa na Kernel Patch na-eji eBPF/BCC
Njirimara a na-arụkarị ọrụ ngwa ngwa. Naanị ihe ọ na-eme bụ ịnyefe arịrịọ na kwụ n'ahịrị ọkwọ ụgbọ ala ngwaọrụ.

Bcache bụ ngwaọrụ dị mgbagwoju anya nke nwere diski atọ n'ezie:

  • ngwaọrụ na-akwado (nke echekwara diski), na nke a ọ bụ HDD dị nwayọọ;
  • ngwaọrụ caching (caching disk), ebe a bụ otu akụkụ nke ngwaọrụ NVMe;
  • bcache mebere ngwaọrụ nke ngwa na-eji.

Anyị maara na nnyefe arịrịọ adịghị ngwa, mana kedu n'ime ngwaọrụ ndị a? Anyị ga-eme nke a obere oge ma emechaa.

Anyị maara ugbu a na ihe omume nwere ike ịkpata nsogbu. Ịchọta ihe kpọmkwem na-akpata ọgbọ ha adịghị mfe. Ka anyị were ya na nke a bụ ụdị sọftụwia a na-ewepụta kwa oge. Ka anyị hụ ụdị sọftụwia na-agba na sistemụ site na iji edemede exsnoop site na otu Ngwa ngwa ngwa BCC. Ka anyị mee ya ma zipu mmepụta na faịlụ.

Dịka ọmụmaatụ dịka nke a:

/usr/share/bcc/tools/execsnoop  | tee ./execdump

Anyị agaghị egosi mmepụta execsnoop zuru ezu ebe a, mana otu ahịrị mmasị anyị dị ka nke a:

sh 1764905 5802 0 sudo arcconf getconfig 1 AD | grep Temperature | awk -F '[:/]' '{print $2}' | sed 's/^ ([0-9]*) C.*/1/'

Kọlụm nke atọ bụ PPID (PID) nke usoro a. Usoro na PID 5802 tụgharịrị bụrụ otu n'ime eriri nke sistemu nlekota anyị. Mgbe ị na-elele nhazi nke usoro nleba anya, ahụrụ paramita na-ezighi ezi. A na-ewere ọnọdụ okpomọkụ nke nkwụnye HBA kwa sekọnd 30, nke na-adịkarị karịa ka ọ dị mkpa. Mgbe ịgbanwere oge nlele ahụ ka ọ bụrụ ogologo oge, anyị chọpụtara na nkwụsị nhazi arịrịọ na onye ọbịa a adịkwaghị apụta ìhè ma e jiri ya tụnyere ndị ọbịa ndị ọzọ.

Mana amabeghị ihe kpatara ngwaọrụ bcache ji nwayọ nwayọ. Anyị kwadebere ikpo okwu ule nwere nhazi yiri ya ma gbalịa imepụtaghachi nsogbu ahụ site na ịgba ọsọ fio na bcache, na-agba ọsọ udevadm na-akpali akpali kwa oge iji mepụta ihe omume.

Ederede Ngwa dabere na BCC

Ka anyị gbalịa dee ngwa dị mfe iji chọpụta ma gosipụta oku kacha nwayọ generic_make_request(). Anyị nwekwara mmasị na aha mbanye nke a na-akpọ ọrụ a.

Atụmatụ ahụ dị mfe:

  • Debanye aha kpoprobe on generic_make_request():
    • Anyị na-echekwa aha diski n'ime ebe nchekwa, nweta site na arụmụka ọrụ;
    • Anyị na-echekwa akara oge.

  • Debanye aha kretprobe maka nloghachi si generic_make_request():
    • Anyị na-enweta stampụ nke ugbu a;
    • Anyị na-achọ stampụ oge echekwara wee jiri ya tụnyere nke dị ugbu a;
    • Ọ bụrụ na nsonaazụ ya karịrị nke akọwapụtara, anyị ga-ahụ aha diski echekwara wee gosipụta ya na ọnụ.

Kprobes и kretprobes jiri usoro nkwụsịtụ iji gbanwee koodu ọrụ na ofufe. Ị nwere ike ịgụ akwụkwọ и mma isiokwu na isiokwu a. Ọ bụrụ na ị na-elele koodu nke dị iche iche utilities na Bcc, mgbe ahụ, ị ​​ga-ahụ na ha nwere otu ihe owuwu. Ya mere, n'isiokwu a, anyị ga-awụpụ arụmụka edemede ma gaa n'ihu na mmemme BPF n'onwe ya.

Ederede eBPF n'ime edemede Python dị ka nke a:

bpf_text = “”” # Here will be the bpf program code “””

Iji gbanwee data n'etiti ọrụ, mmemme eBPF na-eji tebụl hash. Anyị ga-emekwa otu ihe ahụ. Anyị ga-eji usoro PID dị ka igodo, wee kọwapụta usoro dị ka uru:

struct data_t {
	u64 pid;
	u64 ts;
	char comm[TASK_COMM_LEN];
	u64 lat;
	char disk[DISK_NAME_LEN];
};

BPF_HASH(p, u64, struct data_t);
BPF_PERF_OUTPUT(events);

N'ebe a, anyị debanyere tebụl hash a na-akpọ p, na ụdị igodo u64 na uru nke ụdị data nhazi_t. Tebụlụ a ga-adị na ọnọdụ nke mmemme BPF anyị. Nnukwu BPF_PERF_OUTPUT na-edebanye aha tebụl ọzọ a na-akpọ ihe, nke a na-eji maka ya nnyefe data banye ohere onye ọrụ.

Mgbe ị na-atụ oge igbu oge n'etiti ịkpọ ọrụ na nlọghachi site na ya, ma ọ bụ n'etiti oku na ọrụ dị iche iche, ịkwesịrị iburu n'uche na data enwetara ga-abụrịrị otu ọnọdụ. N'ikwu ya n'ụzọ ọzọ, ịkwesịrị icheta maka mmalite mmalite nke ọrụ nwere ike ime. Anyị nwere ike ịlele nkwụsịtụ n'etiti ịkpọ ọrụ na nhazi nke otu usoro na ịlaghachi na ọrụ ahụ na usoro nke usoro ọzọ, ma nke a nwere ike ọ gaghị abaghị uru. Ezi ihe atụ ebe a ga-abụ ihe bara uru biolatin, ebe edobere igodo tebụl hash ka ọ bụrụ ntụnye aka arịrịọ nhazi, nke na-egosipụta otu arịrịọ diski.

Ọzọ, anyị kwesịrị ide koodu ga-arụ ọrụ mgbe a na-akpọ ọrụ a na-amụ:

void start(struct pt_regs *ctx, struct bio *bio) {
	u64 pid = bpf_get_current_pid_tgid();
	struct data_t data = {};
	u64 ts = bpf_ktime_get_ns();
	data.pid = pid;
	data.ts = ts;
	bpf_probe_read_str(&data.disk, sizeof(data.disk), (void*)bio->bi_disk->disk_name);
	p.update(&pid, &data);
}

N'ebe a, arụmụka mbụ nke ọrụ a na-akpọ ga-anọchi anya dị ka arụmụka nke abụọ generic_make_request(). Mgbe nke a gasịrị, anyị na-enweta PID nke usoro na ọnọdụ nke anyị na-arụ ọrụ, na timestamp dị ugbu a na nanoseconds. Anyị na-ede ya niile na nke ọhụrụ ahọpụtara struct data_t data. Anyị na-enweta aha diski site na nhazi ahụ bio, nke a na-agafe mgbe ị na-akpọ generic_make_request(), ma chekwaa ya n'otu nhazi ahụ data. Nzọụkwụ ikpeazụ bụ ịgbakwunye ntinye na tebụl hash nke ekwuru na mbụ.

A ga-akpọ ọrụ na-esonụ na nloghachi si generic_make_request():

void stop(struct pt_regs *ctx) {
    u64 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    struct data_t* data = p.lookup(&pid);
    if (data != 0 && data->ts > 0) {
        bpf_get_current_comm(&data->comm, sizeof(data->comm));
        data->lat = (ts - data->ts)/1000;
        if (data->lat > MIN_US) {
            FACTOR
            data->pid >>= 32;
            events.perf_submit(ctx, data, sizeof(struct data_t));
        }
        p.delete(&pid);
    }
}

Ọrụ a yiri nke gara aga: anyị na-achọpụta PID nke usoro na stampụ oge, mana etinyela ebe nchekwa maka nhazi data ọhụrụ. Kama, anyị na-achọ tebụl hash maka nhazi dị adị na-eji igodo == PID dị ugbu a. Ọ bụrụ na achọtara ihe owuwu ahụ, mgbe ahụ, anyị ga-achọpụta aha usoro ịgba ọsọ ma tinye ya na ya.

Ngbanwe ọnụọgụ abụọ anyị na-eji ebe a chọrọ iji nweta eriri GID. ndị ahụ. PID nke isi usoro malitere eri na ọnọdụ nke anyị na-arụ ọrụ. Ọrụ anyị na-akpọ bpf_get_current_pid_tgid() na-eweghachi ma GID nke eri ahụ na PID ya n'otu uru 64-bit.

Mgbe ị na-emepụta na njedebe, anyị enweghị mmasị ugbu a na eri ahụ, mana anyị nwere mmasị na isi usoro. Mgbe atụnyere igbu oge na-apụta na ọnụ ụzọ enyere, anyị gafere usoro anyị data banye ohere onye ọrụ site na tebụl ihe, mgbe nke ahụ gasịrị, anyị na-ehichapụ ntinye site na p.

N'edemede python nke ga-ebu koodu a, anyị kwesịrị iji dochie MIN_US na FACTOR na nkwụsị oge na nkeji oge, nke anyị ga-agafe na arụmụka:

bpf_text = bpf_text.replace('MIN_US',str(min_usec))
if args.milliseconds:
	bpf_text = bpf_text.replace('FACTOR','data->lat /= 1000;')
	label = "msec"
else:
	bpf_text = bpf_text.replace('FACTOR','')
	label = "usec"

Ugbu a, anyị kwesịrị ịkwado mmemme BPF site na BPF nnukwu na debanye aha sample:

b = BPF(text=bpf_text)
b.attach_kprobe(event="generic_make_request",fn_name="start")
b.attach_kretprobe(event="generic_make_request",fn_name="stop")

Anyị ga-ekpebikwa data nhazi_t na edemede anyị, ma ọ bụghị ya, anyị agaghị enwe ike ịgụ ihe ọ bụla:

TASK_COMM_LEN = 16	# linux/sched.h
DISK_NAME_LEN = 32	# linux/genhd.h
class Data(ct.Structure):
	_fields_ = [("pid", ct.c_ulonglong),
            	("ts", ct.c_ulonglong),
            	("comm", ct.c_char * TASK_COMM_LEN),
            	("lat", ct.c_ulonglong),
            	("disk",ct.c_char * DISK_NAME_LEN)]

Nzọụkwụ ikpeazụ bụ iwepụta data na ọnụ:

def print_event(cpu, data, size):
    global start
    event = ct.cast(data, ct.POINTER(Data)).contents
    if start == 0:
        start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d   %-1s %s   %s" % (time_s, event.comm, event.pid, event.lat, label, event.disk))

b["events"].open_perf_buffer(print_event)
# format output
start = 0
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Edemede n'onwe ya dị na GitHub. Ka anyị gbalịa ịgba ya n'elu ikpo okwu ule ebe fio na-agba ọsọ, na-ede na bcache, wee kpọọ udevadm monitor:

Site na High Ceph Latency gaa na Kernel Patch na-eji eBPF/BCC
N'ikpeazụ! Ugbu a anyị na-ahụ na ihe dị ka ngwaọrụ bcache na-akwụsị bụ n'ezie oku na-akwụsị akwụsị generic_make_request() maka diski echekwara.

Gwuo n'ime kernel

Kedu ihe na-ebelata ngwa ngwa n'oge nnyefe arịrịọ? Anyị na-ahụ na igbu oge na-eme ọbụna tupu mmalite nke mkpesa arịrịọ, ya bụ. aza ajụjụ maka otu arịrịọ maka mpụta ọzọ nke ọnụ ọgụgụ na ya (/proc/diskstats ma ọ bụ iostat) amalitebeghị. Enwere ike ịnwapụta nke a n'ụzọ dị mfe site na ịgba ọsọ iostat ka ị na-emepụtagharị nsogbu ahụ, ma ọ bụ Biolatency edemede BCC, nke dabere na mmalite na njedebe nke mkpesa arịrịọ. Ọ nweghị akụrụngwa ndị a ga-egosi nsogbu maka arịrịọ na diski echekwara.

Ọ bụrụ na anyị eleba anya na ọrụ ahụ generic_make_request(), mgbe ahụ, anyị ga-ahụ na tupu arịrịọ ahụ amalite ịza ajụjụ, a na-akpọ ọrụ abụọ ọzọ. Mbụ - generic_make_request_checks(), na-eme nyocha na izi ezi nke arịrịọ ahụ gbasara ntọala diski. Nke abụọ - blk_queue_enter(), nke nwere ihe ịma aka na-adọrọ mmasị chere_event_interruptible():

ret = wait_event_interruptible(q->mq_freeze_wq,
	(atomic_read(&q->mq_freeze_depth) == 0 &&
	(preempt || !blk_queue_preempt_only(q))) ||
	blk_queue_dying(q));

N'ime ya, kernel na-echere kwụ n'ahịrị ka ọ tọhapụ. Ka anyị tụọ igbu oge blk_queue_enter():

~# /usr/share/bcc/tools/funclatency  blk_queue_enter -i 1 -m               	 
Tracing 1 functions for "blk_queue_enter"... Hit Ctrl-C to end.

 	msecs           	: count 	distribution
     	0 -> 1      	: 341  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 316  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 255  	|****************************************|
     	2 -> 3      	: 0    	|                                    	|
     	4 -> 7      	: 0    	|                                    	|
     	8 -> 15     	: 1    	|                                    	|

Ọ dị ka anyị nọ nso n'ihe ngwọta. Ọrụ ndị a na-eji eme ka ifriizi/iwepụ kwụ n'ahịrị bụ blk_mq_freeze_queue и blk_mq_unfreeze_queue. A na-eji ha mgbe ọ dị mkpa ịgbanwe ntọala kwụ n'ahịrị arịrịọ, nke nwere ike ịdị ize ndụ maka arịrịọ na kwụ n'ahịrị a. Mgbe ị na-akpọ blk_mq_freeze_queue() ọrụ blk_freeze_queue_start() a na-abawanye counter q->mq_freeze_depth. Mgbe nke a gasị, kernel na-echere kwụ n'ahịrị ka ọ bata blk_mq_freeze_queue_wait().

Oge ọ na-ewe iji kpochapụ kwụ n'ahịrị a dabara na nkwụsị diski ka kernel na-eche ka ọrụ niile kwụ n'ahịrị mechaa. Ozugbo kwụ n'ahịrị tọhapụrụ, a na-etinye mgbanwe ntọala. Mgbe nke a gasịrị, a na-akpọ ya blk_mq_unfreeze_queue(), decrementing counter friza_depth.

Ugbu a, anyị maara nke ọma iji dozie ọnọdụ ahụ. Iwu udevadm na-akpalite na-eme ka ntọala maka ngwaọrụ ngọngọ tinye n'ọrụ. A kọwara ntọala ndị a na iwu udev. Anyị nwere ike ịchọta ntọala ndị na-eme ka kwụ n'ahịrị site n'ịgbalị ịgbanwe ha site na sysfs ma ọ bụ site na ilele koodu isi mmalite kernel. Anyị nwekwara ike ịnwale ọrụ BCC Chọpụta, nke ga-ewepụta akara nchịkọta kernel na ebe ọrụ maka oku ọ bụla gaa na ọdụ blk_freeze_queue, dịka ọmụmaatụ:

~# /usr/share/bcc/tools/trace blk_freeze_queue -K -U
PID 	TID 	COMM        	FUNC        	 
3809642 3809642 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	elevator_switch+0x29 [kernel]
    	elv_iosched_store+0x197 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

3809631 3809631 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	queue_requests_store+0xb6 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

Iwu Udev na-agbanwe obere oge ma na-emekarị nke a n'ụzọ a na-achịkwa. Ya mere, anyị na-ahụ na ọbụna itinye ụkpụrụ edoberelarị na-akpata mmụba na igbu oge na-ebufe arịrịọ site na ngwa ahụ na diski. N'ezie, ịmepụta ihe omume udev mgbe enweghị mgbanwe na nhazi diski (dịka ọmụmaatụ, ngwaọrụ anaghị agbanye / kwụsịrị) abụghị ezigbo omume. Otú ọ dị, anyị nwere ike inyere kernel aka ka ọ ghara ịrụ ọrụ na-adịghị mkpa na ifriizi n'ahịrị arịrịọ ma ọ bụrụ na ọ dịghị mkpa. Atọ obere eme dozie ọnọdụ ahụ.

mmechi

eBPF bụ ngwá ọrụ dị ike ma dị ike. N’isiokwu ahụ, anyị lere anya n’otu ihe mere eme ma gosi ntakịrị akụkụ nke ihe a pụrụ ime. Ọ bụrụ na ị nwere mmasị ịzụlite akụrụngwa BCC, ọ bara uru ileba anya nkuzi nkuzi, nke na-akọwa ihe ndị bụ isi nke ọma.

Enwere ngwa nbipu na profaịlụ ndị ọzọ na-atọ ụtọ dabere na eBPF. Otu n'ime ha - bpftrace, nke na-enye gị ohere ịde ike otu-liners na obere mmemme n'asụsụ dị ka awk. Ọzọ - ebpf_exporter, na-enye gị ohere ịnakọta metric dị ala, nke dị elu ozugbo n'ime ihe nkesa prometheus gị, na-enwe ike imecha nweta ọhụụ mara mma na ọbụna ọkwa.

isi: www.habr.com

Tinye a comment