Mai ka High Ceph Latency a hiki i ka Kernel Patch e hoʻohana ana i ka eBPF/BCC

Mai ka High Ceph Latency a hiki i ka Kernel Patch e hoʻohana ana i ka eBPF/BCC

He nui nā mea hana a Linux no ka hoʻopau ʻana i ka kernel a me nā noi. ʻO ka hapa nui o lākou he hopena maikaʻi ʻole i ka hana noi a ʻaʻole hiki ke hoʻohana ʻia i ka hana.

ʻElua mau makahiki i hala aku nei ua kūkulu ʻia kekahi mea hana - eBPF. Hiki iā ia ke ʻimi i ka kernel a me nā noi mea hoʻohana me ka haʻahaʻa haʻahaʻa a ʻaʻole pono e kūkulu hou i nā polokalamu a hoʻouka i nā modula ʻaoʻao ʻekolu i loko o ka kernel.

Nui nā pono noi e hoʻohana ana i ka eBPF, a ma kēia ʻatikala e nānā mākou pehea e kākau ai i kāu pono ponoʻī e pili ana i ka waihona. PythonBCC. Hoʻokumu ʻia ka ʻatikala ma nā hanana maoli. E hele mākou mai kahi pilikia e hoʻoponopono e hōʻike pehea e hoʻohana ʻia ai nā pono hana i nā kūlana kikoʻī.

Ua lohi ʻo Ceph

Ua hoʻohui ʻia kahi mea hoʻokipa hou i ka hui Ceph. Ma hope o ka neʻe ʻana i kekahi o ka ʻikepili iā ia, ua ʻike mākou ua ʻoi aku ka haʻahaʻa o ka wikiwiki o ka hana ʻana i nā noi kākau ma mua o nā kikowaena ʻē aʻe.

Mai ka High Ceph Latency a hiki i ka Kernel Patch e hoʻohana ana i ka eBPF/BCC
ʻAʻole like me nā kahua ʻē aʻe, ua hoʻohana kēia host i ka bcache a me ka linux 4.15 kernel hou. ʻO kēia ka manawa mua i hoʻohana ʻia ai kahi pūʻali o kēia hoʻonohonoho ma aneʻi. A i kēlā manawa ua maopopo ke kumu o ka pilikia i ke ʻano o kekahi mea.

Ke noiʻi ʻana i ka mea hoʻokipa

E hoʻomaka kākou ma ka nānā ʻana i nā mea e hana nei i loko o ke kaʻina hana ceph-osd. No kēia e hoʻohana mākou ʻala и ʻōpana ahi (ʻoi aku e pili ana i kahi āu e heluhelu ai maanei):

Mai ka High Ceph Latency a hiki i ka Kernel Patch e hoʻohana ana i ka eBPF/BCC
Hōʻike ke kiʻi iā mākou i ka hana fdatasync() ua lōʻihi ka manawa e hoʻouna i kahi noi i nā hana generic_make_request(). ʻO ia ke kumu o ke kumu o nā pilikia aia ma waho o ka daemon osd ponoʻī. Hiki paha kēia i ka kernel a i ʻole nā ​​disks. Ua hōʻike ka iostat output i kahi latency kiʻekiʻe i ka hana ʻana i nā noi e nā bcache disks.

I ka nānā ʻana i ka host, ʻike mākou ua hoʻopau ka systemd-udevd daemon i ka nui o ka manawa CPU - ma kahi o 20% ma kekahi mau cores. He ʻano ʻano ʻē kēia, no laila pono ʻoe e ʻike i ke kumu. No ka hana ʻana o Systemd-udevd me uevents, ua hoʻoholo mākou e nānā iā lākou ma o udevadm monitor. Ua ʻike ʻia he nui o nā hanana hoʻololi i hana ʻia no kēlā me kēia mea poloka i ka ʻōnaehana. He mea maʻamau kēia, no laila pono mākou e nānā i ka mea e hoʻopuka ai i kēia mau hanana āpau.

Ke hoʻohana nei i ka BCC Toolkit

E like me kā mākou i ʻike mua ai, ʻo ka kernel (a me ka ceph daemon i ke kelepona ʻōnaehana) e hoʻolilo i ka manawa nui i generic_make_request(). E hoao kakou e ana i ka mama o keia hana. IN BCC Aia kekahi mea hoʻohana maikaʻi - hana pono. E ʻimi mākou i ka daemon e kāna PID me kahi manawa 1 kekona ma waena o nā huahana a hoʻopuka i ka hopena i nā milliseconds.

Mai ka High Ceph Latency a hiki i ka Kernel Patch e hoʻohana ana i ka eBPF/BCC
Hana wikiwiki kēia hiʻohiʻona. ʻO nā mea a pau e hāʻawi i ka noi i ka pila o ka mea hoʻokele.

Bcache he mea paʻakikī i loaʻa i ʻekolu disks:

  • mea kākoʻo (cached disk), i kēia hihia he HDD lohi;
  • mea hoʻokohu (caching disk), eia kahi ʻāpana o ka mea NVMe;
  • ka bcache virtual device e holo ai ka polokalamu.

Ua ʻike mākou ua lohi ka hoʻouna ʻana i ke noi, akā no wai o kēia mau mea hana? E hana mākou i kēia ma hope iki.

Ua ʻike mākou i kēia manawa hiki i nā uevents ke kumu i nā pilikia. ʻAʻole maʻalahi ka ʻimi ʻana i ke kumu o kā lākou hanauna. E manaʻo kākou he ʻano polokalamu kēia i hoʻokuʻu ʻia i kēlā me kēia manawa. E ʻike kākou i ke ʻano o ka polokalamu e holo ana ma ka ʻōnaehana me ka hoʻohana ʻana i kahi palapala execsnoop mai ka mea like ʻO ka pahu hana BCC. E holo kāua a hoʻouna i ka hopena i kahi faila.

No ka laʻana penei:

/usr/share/bcc/tools/execsnoop  | tee ./execdump

ʻAʻole mākou e hōʻike i ka hoʻopuka piha o execsnoop ma aneʻi, akā hoʻokahi laina hoihoi iā mākou e like me kēia:

sh 1764905 5802 0 sudo arcconf getconfig 1 AD | grep Temperature | awk -F '[:/]' '{print $2}' | sed 's/^ ([0-9]*) C.*/1/'

ʻO ke kolamu ʻekolu ka PPID (PID makua) o ke kaʻina hana. ʻO ke kaʻina hana me PID 5802 ua lilo ia i kekahi o nā pae o kā mākou ʻōnaehana nānā. I ka nānā ʻana i ka hoʻonohonoho ʻana o ka ʻōnaehana nānā, ʻike ʻia nā ʻāpana hewa. Lawe ʻia ka mahana o ka adapter HBA i kēlā me kēia 30 kekona, ʻoi aku ka nui o ka manawa ma mua o ka pono. Ma hope o ka hoʻololi ʻana i ka wā hoʻopaʻa i kahi lōʻihi, ʻike mākou ʻaʻole kū hou ka hoʻoponopono noi ʻana i kēia host i hoʻohālikelike ʻia me nā pūʻali ʻē aʻe.

Akā ʻaʻole maopopo ke kumu i lohi loa ai ka hāmeʻa bcache. Hoʻomākaukau mākou i kahi kahua hoʻāʻo me kahi hoʻonohonoho like a hoʻāʻo e hana hou i ka pilikia ma ka holo ʻana i ka fio ma bcache, e holo ana i kēlā me kēia manawa i ka udevadm trigger e hana i nā uevents.

Ke kākau ʻana i nā mea paahana BCC

E ho'āʻo kākou e kākau i kahi mea hoʻohana maʻalahi e ʻimi a hōʻike i nā kelepona lohi loa generic_make_request(). Makemake pū mākou i ka inoa o ka drive i kapa ʻia ai kēia hana.

He maʻalahi ka papahana:

  • Kakau inoa kprobe maluna o generic_make_request():
    • Mālama mākou i ka inoa disk i ka hoʻomanaʻo, hiki ke loaʻa ma o ka hoʻopaʻapaʻa hana;
    • Mālama mākou i ka hōʻailona manawa.

  • Kakau inoa kretprobe no ka hoi ana mai generic_make_request():
    • Loaʻa iā mākou ka hōʻailona manawa o kēia manawa;
    • Ke nānā nei mākou i ka hōʻailona manawa i mālama ʻia a hoʻohālikelike ʻia me ka mea i kēia manawa;
    • Inā ʻoi aku ka hopena ma mua o ka mea i kuhikuhi ʻia, a laila ʻike mākou i ka inoa disk i mālama ʻia a hōʻike iā ia ma ka pahu.

Kprobes и kretprobes e hoʻohana i kahi hana breakpoint e hoʻololi i ke code hana ma ka lele. Hiki iā ʻoe ke heluhelu palapala и maikaʻi loa ʻatikala e pili ana i kēia kumuhana. Inā ʻoe e nānā i ke code o nā pono like ʻole ma BCC, a laila hiki iā ʻoe ke ʻike he ʻano like ko lākou. No laila ma kēia ʻatikala e hoʻokuʻu mākou i nā ʻōlelo hoʻopaʻapaʻa parsing a neʻe i ka papahana BPF ponoʻī.

ʻO ka kikokikona eBPF i loko o ka palapala python e like me kēia:

bpf_text = “”” # Here will be the bpf program code “””

E hoʻololi i ka ʻikepili ma waena o nā hana, hoʻohana nā polokalamu eBPF nā papa hash. E hana like mākou. E hoʻohana mākou i ke kaʻina hana PID ma ke kī, a wehewehe i ke ʻano e like me ka waiwai:

struct data_t {
	u64 pid;
	u64 ts;
	char comm[TASK_COMM_LEN];
	u64 lat;
	char disk[DISK_NAME_LEN];
};

BPF_HASH(p, u64, struct data_t);
BPF_PERF_OUTPUT(events);

Ma ʻaneʻi mākou e hoʻopaʻa inoa i kahi papa hash i kapa ʻia p, me ke ʻano kī u64 a he waiwai o ke ano struct data_t. E loaʻa ana ka papaʻaina ma ka pōʻaiapili o kā mākou papahana BPF. Hoʻopaʻa inoa ʻo BPF_PERF_OUTPUT macro i kekahi papa i kapa ʻia hanana, i hoohanaia no lawe ʻikepili i loko o kahi mea hoʻohana.

Ke ana ʻana i nā lohi ma waena o ke kāhea ʻana i kahi hana a me ka hoʻi ʻana mai ia mea, a i ʻole ma waena o nā kelepona i nā hana like ʻole, pono ʻoe e noʻonoʻo pono i ka ʻikepili i loaʻa i ka pōʻaiapili like. I nā huaʻōlelo ʻē aʻe, pono ʻoe e hoʻomanaʻo e pili ana i ka hoʻomaka like ʻana o nā hana. Loaʻa iā mākou ka hiki ke ana i ka latency ma waena o ke kāhea ʻana i kahi hana i loko o ka pōʻaiapili o kahi kaʻina hana a me ka hoʻi ʻana mai kēlā hana ma ke ʻano o kahi kaʻina hana ʻē aʻe, akā ʻaʻole pono kēia. He laʻana maikaʻi ma ʻaneʻi pono hana biolatency, kahi i hoʻonoho ʻia ai ke kī pākaukau hash i kahi kuhikuhi i struct request, e hōʻike ana i hoʻokahi noi disk.

A laila, pono mākou e kākau i ke code e holo i ka wā i kapa ʻia ai ka hana ma lalo o ke aʻo ʻana:

void start(struct pt_regs *ctx, struct bio *bio) {
	u64 pid = bpf_get_current_pid_tgid();
	struct data_t data = {};
	u64 ts = bpf_ktime_get_ns();
	data.pid = pid;
	data.ts = ts;
	bpf_probe_read_str(&data.disk, sizeof(data.disk), (void*)bio->bi_disk->disk_name);
	p.update(&pid, &data);
}

Maanei e hoʻololi ʻia ka hoʻopaʻapaʻa mua o ka hana i kapa ʻia e like me ka pane ʻelua generic_make_request(). Ma hope o kēia, loaʻa iā mākou ka PID o ke kaʻina hana i ka pōʻaiapili a mākou e hana nei, a me ka timestamp o kēia manawa i nā nanoseconds. Kākau mākou i nā mea a pau i kahi koho hou struct data_t ikepili. Loaʻa iā mākou ka inoa disk mai ka hale Bio, i hala ke kahea ana generic_make_request(), a mālama iā ia ma ka hale like ʻikepili. ʻO ka hana hope e hoʻohui i kahi komo i ka papaʻaina hash i ʻōlelo ʻia ma mua.

E kāhea ʻia kēia hana i ka hoʻi ʻana mai generic_make_request():

void stop(struct pt_regs *ctx) {
    u64 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    struct data_t* data = p.lookup(&pid);
    if (data != 0 && data->ts > 0) {
        bpf_get_current_comm(&data->comm, sizeof(data->comm));
        data->lat = (ts - data->ts)/1000;
        if (data->lat > MIN_US) {
            FACTOR
            data->pid >>= 32;
            events.perf_submit(ctx, data, sizeof(struct data_t));
        }
        p.delete(&pid);
    }
}

Ua like kēia hana me ka mea ma mua: ʻike mākou i ka PID o ke kaʻina hana a me ka timestamp, akā ʻaʻole e hoʻokaʻawale i ka hoʻomanaʻo no ka hoʻolālā ʻikepili hou. Ma kahi o, ʻimi mākou i ka papaʻaina hash no kahi hoʻolālā e hoʻohana nei i ke kī == PID o kēia manawa. Inā loaʻa ka hale, a laila ʻike mākou i ka inoa o ke kaʻina holo a hoʻohui iā ia.

Pono ka hoʻololi binary a mākou e hoʻohana ai ma aneʻi e kiʻi i ka thread GID. ka poe. PID o ke kaʻina hana nui i hoʻomaka i ke kaula i loko o ka pōʻaiapili a mākou e hana nei. ʻO ka hana a mākou e kāhea nei bpf_get_current_pid_tgid() e hoʻihoʻi i ka GID o ke kaula a me kāna PID i kahi waiwai 64-bit hoʻokahi.

Ke hoʻopuka ʻana i ka terminal, ʻaʻole mākou makemake i kēia manawa i ke kaula, akā makemake mākou i ke kaʻina hana nui. Ma hope o ka hoʻohālikelike ʻana i ka hopena o ka lohi me kahi paepae i hāʻawi ʻia, e hala mākou i kā mākou hale ʻikepili i loko o ka mea hoʻohana ma o ka papaʻaina hanana, a laila holoi mākou i ke komo ʻana mai p.

Ma ka palapala python e hoʻouka i kēia code, pono mākou e hoʻololi i ka MIN_US a me FACTOR me nā paepae lohi a me nā ʻāpana manawa, a mākou e hele ai i nā manaʻo:

bpf_text = bpf_text.replace('MIN_US',str(min_usec))
if args.milliseconds:
	bpf_text = bpf_text.replace('FACTOR','data->lat /= 1000;')
	label = "msec"
else:
	bpf_text = bpf_text.replace('FACTOR','')
	label = "usec"

I kēia manawa pono mākou e hoʻomākaukau i ka papahana BPF ma o BPF macro a hoʻopaʻa inoa i nā laʻana:

b = BPF(text=bpf_text)
b.attach_kprobe(event="generic_make_request",fn_name="start")
b.attach_kretprobe(event="generic_make_request",fn_name="stop")

Pono mākou e hoʻoholo struct data_t i kā mākou palapala, inā ʻaʻole hiki iā mākou ke heluhelu i kekahi mea:

TASK_COMM_LEN = 16	# linux/sched.h
DISK_NAME_LEN = 32	# linux/genhd.h
class Data(ct.Structure):
	_fields_ = [("pid", ct.c_ulonglong),
            	("ts", ct.c_ulonglong),
            	("comm", ct.c_char * TASK_COMM_LEN),
            	("lat", ct.c_ulonglong),
            	("disk",ct.c_char * DISK_NAME_LEN)]

ʻO ka hana hope e hoʻopuka i ka ʻikepili i ka terminal:

def print_event(cpu, data, size):
    global start
    event = ct.cast(data, ct.POINTER(Data)).contents
    if start == 0:
        start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d   %-1s %s   %s" % (time_s, event.comm, event.pid, event.lat, label, event.disk))

b["events"].open_perf_buffer(print_event)
# format output
start = 0
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Loaʻa ka palapala ponoʻī ma GItHub. E ho'āʻo kāua e holo ma luna o kahi kahua hoʻāʻo kahi e holo ai ʻo fio, kākau i bcache, a kāhea aku i ka monitor udevadm:

Mai ka High Ceph Latency a hiki i ka Kernel Patch e hoʻohana ana i ka eBPF/BCC
ʻO ka hope loa! I kēia manawa, ʻike mākou ʻo ka mea i like me kahi hāmeʻa stalling bcache he kelepona paʻa generic_make_request() no kahi diski huna.

E ʻeli i loko o ka Kernel

He aha ka mea e lohi nei i ka wā o ka hoʻouna noi? ʻIke mākou i ka hiki ʻana o ka lohi ma mua o ka hoʻomaka ʻana o ka noi moʻohelu kālā, ʻo ia hoʻi. ʻAʻole i hoʻomaka ka helu ʻana i kahi noi kikoʻī no ka hoʻopuka hou ʻana o nā helu ma luna o ia (/proc/diskstats a i ʻole iostat). Hiki ke hōʻoia maʻalahi kēia ma ka holo ʻana i ka iostat i ka wā e hana hou ana i ka pilikia, a i ʻole BCC palapala biolatency, e pili ana i ka hoʻomaka a me ka hopena o ka noi ʻana i ka moʻohelu kālā. ʻAʻohe o kēia mau mea pono e hōʻike i nā pilikia no nā noi i ka disk cache.

Inā mākou e nānā i ka hana generic_make_request(), a laila e ʻike mākou ma mua o ka hoʻomaka ʻana o ka noi ʻana i ka helu helu, ua kāhea ʻia ʻelua mau hana. Ka mua - generic_make_request_checks(), hana i nā mākaʻikaʻi i ka pono o ka noi e pili ana i nā hoʻonohonoho disk. Ka lua - blk_queue_enter(), he paʻakikī hoihoi kali_event_interruptible():

ret = wait_event_interruptible(q->mq_freeze_wq,
	(atomic_read(&q->mq_freeze_depth) == 0 &&
	(preempt || !blk_queue_preempt_only(q))) ||
	blk_queue_dying(q));

I loko o ia mea, ke kali nei ka kernel no ka wehe ʻana o ka pila. E ana kāua i ka lohi blk_queue_enter():

~# /usr/share/bcc/tools/funclatency  blk_queue_enter -i 1 -m               	 
Tracing 1 functions for "blk_queue_enter"... Hit Ctrl-C to end.

 	msecs           	: count 	distribution
     	0 -> 1      	: 341  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 316  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 255  	|****************************************|
     	2 -> 3      	: 0    	|                                    	|
     	4 -> 7      	: 0    	|                                    	|
     	8 -> 15     	: 1    	|                                    	|

Me he mea lā ua kokoke mākou i kahi hoʻonā. ʻO nā hana i hoʻohana ʻia no ka hoʻopaʻa ʻana/unfreeze i kahi queue blk_mq_freeze_queue и blk_mq_unfreeze_queue. Hoʻohana ʻia lākou i ka wā e pono ai e hoʻololi i nā hoʻonohonoho queue noi, kahi mea weliweli no nā noi i kēia pila. I ke kahea ana blk_mq_freeze_queue() hana blk_freeze_queue_start() hoʻonui ka counter q->mq_freeze_depth. Ma hope o kēia, kali ka kernel a pau ka pila blk_mq_freeze_queue_wait().

ʻO ka manawa e hoʻomaʻemaʻe ai i kēia queue ua like ia me ka disc latency ʻoiai ke kali nei ka kernel a pau nā hana queued e hoʻopau. Ke pau ka pila, hoʻohana ʻia nā hoʻololi. Mahope iho ua kapaia blk_mq_unfreeze_queue(), e hoemi ana i ka counter paʻa_hohonu.

I kēia manawa ua lawa kā mākou ʻike e hoʻoponopono i ke kūlana. Na ka udevadm trigger kauoha e hoʻohana i nā hoʻonohonoho no ka mea poloka. Hōʻike ʻia kēia mau hoʻonohonoho i nā lula udev. Hiki iā mākou ke ʻike i nā hoʻonohonoho e maloʻo nei i ka pila ma ka hoʻāʻo ʻana e hoʻololi iā lākou ma o sysfs a i ʻole ma ka nānā ʻana i ke kumu kumu kernel. Hiki iā mākou ke hoʻāʻo i ka pono BCC kiʻi, ka mea e hoʻopuka i ka kernel a me ka userspace stack traces no kēlā me kēia kelepona i ka pahu blk_freeze_queueno ka laʻana:

~# /usr/share/bcc/tools/trace blk_freeze_queue -K -U
PID 	TID 	COMM        	FUNC        	 
3809642 3809642 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	elevator_switch+0x29 [kernel]
    	elv_iosched_store+0x197 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

3809631 3809631 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	queue_requests_store+0xb6 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

ʻAʻole liʻiliʻi nā lula Udev a maʻamau kēia e hana ʻia ma ke ʻano kaohi. No laila ke ʻike nei mākou ʻo ka hoʻopili ʻana i nā waiwai i hoʻonohonoho mua ʻia ke kumu o ka spike i ka lohi i ka hoʻololi ʻana i ka noi mai ka noi i ka disk. ʻOiaʻiʻo, ʻo ka hana ʻana i nā hanana udev inā ʻaʻohe hoʻololi i ka hoʻonohonoho disk (no ka laʻana, ʻaʻole i kau ʻia / wehe ʻia ka mea hana) ʻaʻole ia he hana maikaʻi. Eia naʻe, hiki iā mākou ke kōkua i ka kernel ʻaʻole e hana i nā hana pono ʻole a hoʻokuʻu i ka queue noi inā ʻaʻole pono. Ekolu liʻiliʻi hoʻopaʻa hoʻoponopono i ke kūlana.

Panina

ʻO ka eBPF kahi mea maʻalahi a ikaika loa. Ma ka ʻatikala ua nānā mākou i kahi hihia kūpono a hōʻike i kahi ʻāpana liʻiliʻi o ka mea hiki ke hana. Inā makemake ʻoe i ka hoʻomohala ʻana i nā pono BCC, pono e nānā aʻo kūhelu, e wehewehe pono ana i nā kumu.

Aia kekahi mau mea hana debugging a me ka hoʻopili ʻana e pili ana i ka eBPF. ʻO kekahi o lākou - bpftrace, ka mea e hiki ai iā ʻoe ke kākau i nā papa kuhikuhi hoʻokahi a me nā polokalamu liʻiliʻi ma ka ʻōlelo like awk. ʻO kekahi - ebpf_exporter, hiki iā ʻoe ke hōʻiliʻili i nā metric haʻahaʻa haʻahaʻa kiʻekiʻe i kāu kikowaena prometheus, me ka hiki ke loaʻa i nā hiʻohiʻona nani a me nā mākaʻikaʻi.

Source: www.habr.com

Pākuʻi i ka manaʻo hoʻopuka