Los ntawm High Ceph Latency mus rau Kernel Patch siv eBPF / BCC

Los ntawm High Ceph Latency mus rau Kernel Patch siv eBPF / BCC

Linux muaj ntau cov cuab yeej rau kev debugging lub ntsiav thiab cov ntawv thov. Feem ntau ntawm lawv muaj kev cuam tshuam tsis zoo rau daim ntawv thov kev ua haujlwm thiab tsis tuaj yeem siv hauv kev tsim khoom.

Ob peb xyoos dhau los muaj lwm lub cuab yeej tau tsim - PEB. Nws ua rau nws muaj peev xwm taug qab cov ntsiav thiab cov neeg siv daim ntawv thov nrog cov nyiaj siv ua haujlwm qis thiab tsis tas yuav tsim kho cov kev pab cuam thiab thauj cov khoom thib peb rau hauv cov ntsiav.

Muaj ntau cov ntawv thov kev pabcuam uas siv eBPF, thiab hauv tsab xov xwm no peb yuav saib yuav ua li cas sau koj tus kheej cov khoom siv hluav taws xob raws li lub tsev qiv ntawv PythonBCC. Kab lus yog raws li cov xwm txheej tiag tiag. Peb yuav dhau los ntawm qhov teeb meem los kho kom pom tias cov khoom siv hluav taws xob uas twb muaj lawm tuaj yeem siv tau li cas hauv cov xwm txheej tshwj xeeb.

Ceph yog qeeb

Ib tus tswv tsev tshiab tau ntxiv rau Ceph pawg. Tom qab tsiv teb tsaws qee cov ntaub ntawv rau nws, peb pom tias qhov ceev ntawm kev sau ntawv thov los ntawm nws qis dua li ntawm lwm cov servers.

Los ntawm High Ceph Latency mus rau Kernel Patch siv eBPF / BCC
Tsis zoo li lwm lub platform, tus tswv tsev no siv bcache thiab linux 4.15 kernel tshiab. Qhov no yog thawj zaug uas tus tswv tsev ntawm qhov kev teeb tsa no tau siv ntawm no. Thiab nyob rau lub sijhawm ntawd nws tau pom tseeb tias lub hauv paus ntawm qhov teeb meem tuaj yeem yog qhov tseeb.

Tshawb xyuas tus tswv

Cia peb pib los ntawm saib dab tsi tshwm sim hauv cov txheej txheem ceph-osd. Rau qhov no peb yuav siv zoo tag nrho ΠΈ nplaim taws (ntau ntxiv txog qhov koj tuaj yeem nyeem no):

Los ntawm High Ceph Latency mus rau Kernel Patch siv eBPF / BCC
Daim duab qhia peb tias muaj nuj nqi fdatasync() siv sijhawm ntau xa ib daim ntawv thov rau kev ua haujlwm generic_make_request(). Qhov no txhais tau hais tias feem ntau yuav ua rau muaj teeb meem yog qhov chaw sab nraum osd daemon nws tus kheej. Qhov no tuaj yeem yog kernel lossis disks. Cov zis iostat pom qhov latency siab hauv kev thov los ntawm bcache disks.

Thaum kuaj xyuas tus tswv tsev, peb pom tias systemd-udevd daemon siv ntau lub sijhawm CPU - txog 20% ​​ntawm ntau lub cores. Qhov no yog tus cwj pwm coj txawv txawv, yog li koj yuav tsum paub vim li cas. Txij li Systemd-udevd ua haujlwm nrog uevents, peb txiav txim siab los saib lawv dhau udevadm saib. Nws hloov tawm hais tias ib tug loj tus naj npawb ntawm cov xwm txheej hloov pauv tau tsim rau txhua lub cuab yeej thaiv hauv qhov system. Qhov no yog qhov txawv heev, yog li peb yuav tau saib dab tsi ua rau tag nrho cov xwm txheej no.

Siv BCC Toolkit

Raws li peb tau pom lawm, cov ntsiav (thiab ceph daemon hauv lub kaw lus hu) siv sijhawm ntau hauv generic_make_request(). Cia peb sim ntsuas qhov ceev ntawm qhov ua haujlwm no. IN BCC Twb muaj ib qho khoom siv zoo heev - Funclatency. Peb yuav taug qab tus daemon los ntawm nws cov PID nrog 1 lub sijhawm thib ob ntawm cov zis thiab tso tawm cov txiaj ntsig hauv milliseconds.

Los ntawm High Ceph Latency mus rau Kernel Patch siv eBPF / BCC
Qhov no feem ntau ua haujlwm sai. Txhua yam nws ua yog dhau qhov kev thov mus rau cov cuab yeej tsav tsheb queue.

Bcache yog ib tug complex ntaus ntawv uas yeej muaj peb disks:

  • thaub qab ntaus ntawv (cached disk), nyob rau hauv cov ntaub ntawv no nws yog ib tug qeeb HDD;
  • caching ntaus ntawv (caching disk), ntawm no yog ib qho kev faib ntawm NVMe ntaus ntawv;
  • lub bcache virtual ntaus ntawv uas daim ntawv thov khiav.

Peb paub tias qhov kev thov kis tau qeeb, tab sis qhov twg ntawm cov khoom siv no? Peb mam li daws qhov no me ntsis tom qab.

Tam sim no peb paub tias uevents yuav ua rau muaj teeb meem. Nrhiav qhov tseeb ua rau lawv tiam neeg tsis yooj yim li. Cia peb xav tias qhov no yog qee yam software uas tau pib ua ntu zus. Cia peb pom dab tsi ntawm software khiav ntawm lub system siv ib tsab ntawv execsnoob los ntawm tib yam BCC cov khoom siv hluav taws xob. Cia peb khiav nws thiab xa cov zis mus rau ib daim ntawv.

Piv txwv li no:

/usr/share/bcc/tools/execsnoop  | tee ./execdump

Peb yuav tsis qhia tag nrho cov txiaj ntsig ntawm execsnoop ntawm no, tab sis ib kab ntawm kev txaus siab rau peb zoo li qhov no:

sh 1764905 5802 0 sudo arcconf getconfig 1 AD | grep Temperature | awk -F '[:/]' '{print $2}' | sed 's/^ ([0-9]*) C.*/1/'

Sab thib peb yog PPID (niam txiv PID) ntawm cov txheej txheem. Cov txheej txheem nrog PID 5802 tau dhau los ua ib qho ntawm peb txoj kev saib xyuas. Thaum kuaj xyuas qhov kev teeb tsa ntawm kev saib xyuas, pom qhov tsis raug. Qhov kub ntawm HBA adapter tau coj txhua txhua 30 vib nas this, uas yog ntau dua li qhov tsim nyog. Tom qab hloov lub sijhawm kuaj mus rau qhov ntev dua, peb pom tias qhov kev thov ua haujlwm latency ntawm tus tswv tsev no tsis sawv tawm lawm piv rau lwm tus tswv.

Tab sis nws tseem tsis tau paub meej tias vim li cas bcache ntaus ntawv thiaj li qeeb. Peb tau npaj lub platform sim nrog qhov kev teeb tsa zoo ib yam thiab sim rov tsim qhov teeb meem los ntawm kev khiav fio ntawm bcache, ib ntus khiav udevadm trigger los tsim uevents.

Sau cov cuab yeej siv BCC

Cia peb sim sau cov khoom siv yooj yim kom taug qab thiab tso saib cov hu qeeb tshaj plaws generic_make_request(). Peb kuj txaus siab rau lub npe ntawm tus tsav uas lub luag haujlwm no hu ua.

Txoj kev npaj yog yooj yim:

  • Sau npe kprobe ua rau generic_make_request():
    • Peb khaws lub npe disk rau hauv lub cim xeeb, siv tau los ntawm kev sib cav ua haujlwm;
    • Peb txuag lub sij hawm.

  • Sau npe kretprobe ua rau rov los ntawm generic_make_request():
    • Peb tau txais lub sijhawm tam sim no;
    • Peb saib rau lub sijhawm khaws tseg thiab muab piv nrog rau tam sim no;
    • Yog tias qhov tshwm sim ntau dua li qhov tau teev tseg, tom qab ntawd peb pom lub npe disk khaws tseg thiab tso rau ntawm lub davhlau ya nyob twg.

Kprobes ΠΈ cov kretprobes siv cov txheej txheem breakpoint los hloov cov lej ua haujlwm ntawm ya. Koj nyeem tau cov ntaub ntawv ΠΈ zoo tsab xov xwm ntawm lub ncauj lus no. Yog tias koj saib cov cai ntawm ntau yam khoom siv hauv BCC, ces koj tuaj yeem pom tias lawv muaj cov qauv zoo ib yam. Yog li hauv kab lus no peb yuav hla cov lus sib cav sib cav thiab txav mus rau BPF program nws tus kheej.

Cov ntawv eBPF hauv cov ntawv python zoo li no:

bpf_text = β€œβ€β€ # Here will be the bpf program code β€œβ€β€

Txhawm rau pauv cov ntaub ntawv ntawm kev ua haujlwm, eBPF cov kev pabcuam siv rooj rooj. Peb yuav ua ib yam. Peb yuav siv cov txheej txheem PID ua tus yuam sij, thiab txhais cov qauv raws li tus nqi:

struct data_t {
	u64 pid;
	u64 ts;
	char comm[TASK_COMM_LEN];
	u64 lat;
	char disk[DISK_NAME_LEN];
};

BPF_HASH(p, u64, struct data_t);
BPF_PERF_OUTPUT(events);

Ntawm no peb sau npe lub rooj hash hu ua p, nrog hom tseem ceeb u64 thiab tus nqi ntawm hom struct data_t. Cov lus yuav muaj nyob rau hauv cov ntsiab lus ntawm peb qhov kev pab cuam BPF. BPF_PERF_OUTPUT macro sau npe rau lwm lub rooj hu ua txheej xwm, uas yog siv rau cov ntaub ntawv kis tau tus mob rau hauv qhov chaw siv.

Thaum ntsuas kev ncua ntawm kev hu xov tooj rau kev ua haujlwm thiab rov qab los ntawm nws, lossis ntawm kev hu mus rau cov haujlwm sib txawv, koj yuav tsum coj mus rau hauv tus account tias cov ntaub ntawv tau txais yuav tsum yog tib lub ntsiab lus. Hauv lwm lo lus, koj yuav tsum nco ntsoov txog qhov ua tau sib npaug ntawm kev ua haujlwm. Peb muaj peev xwm ntsuas qhov latency ntawm kev hu xov tooj rau hauv cov ntsiab lus ntawm ib tus txheej txheem thiab rov qab los ntawm qhov kev ua haujlwm hauv cov ntsiab lus ntawm lwm tus txheej txheem, tab sis qhov no yuav tsis muaj txiaj ntsig. Ib qho piv txwv zoo ntawm no yuav yog kev siv biolatency, qhov twg hash lub rooj yuam sij yog teem rau tus pointer rau struct thov, uas qhia txog ib qho kev thov disk.

Tom ntej no, peb yuav tsum sau tus lej uas yuav khiav thaum lub luag haujlwm hauv kev kawm hu ua:

void start(struct pt_regs *ctx, struct bio *bio) {
	u64 pid = bpf_get_current_pid_tgid();
	struct data_t data = {};
	u64 ts = bpf_ktime_get_ns();
	data.pid = pid;
	data.ts = ts;
	bpf_probe_read_str(&data.disk, sizeof(data.disk), (void*)bio->bi_disk->disk_name);
	p.update(&pid, &data);
}

Ntawm no thawj qhov kev sib cav ntawm qhov kev hu ua haujlwm yuav raug hloov raws li qhov kev sib cav thib ob generic_make_request(). Tom qab no, peb tau txais PID ntawm cov txheej txheem hauv cov ntsiab lus uas peb ua haujlwm, thiab lub sijhawm tam sim no hauv nanoseconds. Peb sau nws tag nrho rau hauv ib qho kev xaiv tshiab struct data_t data. Peb tau txais lub npe disk los ntawm cov qauv bio, uas yog dhau thaum hu generic_make_request(), thiab txuag nws nyob rau hauv tib lub qauv cov ntaub ntawv. Cov kauj ruam kawg yog ntxiv qhov nkag mus rau lub rooj hash uas tau hais ua ntej.

Cov haujlwm hauv qab no yuav raug hu rov qab los ntawm generic_make_request():

void stop(struct pt_regs *ctx) {
    u64 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    struct data_t* data = p.lookup(&pid);
    if (data != 0 && data->ts > 0) {
        bpf_get_current_comm(&data->comm, sizeof(data->comm));
        data->lat = (ts - data->ts)/1000;
        if (data->lat > MIN_US) {
            FACTOR
            data->pid >>= 32;
            events.perf_submit(ctx, data, sizeof(struct data_t));
        }
        p.delete(&pid);
    }
}

Qhov kev ua haujlwm no zoo ib yam li yav dhau los: peb pom PID ntawm cov txheej txheem thiab lub sijhawm, tab sis tsis txhob faib lub cim xeeb rau cov qauv ntaub ntawv tshiab. Hloov chaw, peb tshawb cov lus hash rau ib qho qauv uas twb muaj lawm siv tus yuam sij == tam sim no PID. Yog tias pom cov qauv, ces peb pom lub npe ntawm cov txheej txheem khiav thiab ntxiv rau nws.

Kev hloov pauv binary peb siv ntawm no yog xav tau kom tau txais xov GID. cov. PID ntawm cov txheej txheem tseem ceeb uas pib cov xov nyob rau hauv cov ntsiab lus uas peb tab tom ua haujlwm. Cov haujlwm peb hu bpf_get_current_pid_tgid() xa rov qab ob lub xov GID thiab nws cov PID hauv ib qho 64-ntsis tus nqi.

Thaum tso tawm mus rau lub davhlau ya nyob twg, peb tam sim no tsis txaus siab rau cov xov, tab sis peb txaus siab rau cov txheej txheem tseem ceeb. Tom qab muab piv cov txiaj ntsig ncua nrog rau qhov pib, peb dhau peb cov qauv cov ntaub ntawv mus rau tus neeg siv qhov chaw ntawm lub rooj txheej xwm, tom qab ntawd peb rho tawm qhov nkag los ntawm p.

Hauv tsab ntawv python uas yuav thauj cov lej no, peb yuav tsum hloov MIN_US thiab FACTOR nrog qhov ncua sij hawm ncua sij hawm thiab lub sijhawm, uas peb yuav dhau los ntawm kev sib cav:

bpf_text = bpf_text.replace('MIN_US',str(min_usec))
if args.milliseconds:
	bpf_text = bpf_text.replace('FACTOR','data->lat /= 1000;')
	label = "msec"
else:
	bpf_text = bpf_text.replace('FACTOR','')
	label = "usec"

Tam sim no peb yuav tsum npaj BPF qhov kev pab cuam ntawm BPF macro thiab sau npe cov qauv:

b = BPF(text=bpf_text)
b.attach_kprobe(event="generic_make_request",fn_name="start")
b.attach_kretprobe(event="generic_make_request",fn_name="stop")

Peb kuj yuav tau txiav txim struct data_t nyob rau hauv peb tsab ntawv, txwv tsis pub peb yuav tsis tau nyeem dab tsi:

TASK_COMM_LEN = 16	# linux/sched.h
DISK_NAME_LEN = 32	# linux/genhd.h
class Data(ct.Structure):
	_fields_ = [("pid", ct.c_ulonglong),
            	("ts", ct.c_ulonglong),
            	("comm", ct.c_char * TASK_COMM_LEN),
            	("lat", ct.c_ulonglong),
            	("disk",ct.c_char * DISK_NAME_LEN)]

Cov kauj ruam kawg yog tso tawm cov ntaub ntawv mus rau lub davhlau ya nyob twg:

def print_event(cpu, data, size):
    global start
    event = ct.cast(data, ct.POINTER(Data)).contents
    if start == 0:
        start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d   %-1s %s   %s" % (time_s, event.comm, event.pid, event.lat, label, event.disk))

b["events"].open_perf_buffer(print_event)
# format output
start = 0
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Tsab ntawv nws tus kheej muaj nyob ntawm GITHub. Cia peb sim khiav nws ntawm qhov kev sim platform uas fio khiav, sau ntawv rau bcache, thiab hu rau udevadm saib:

Los ntawm High Ceph Latency mus rau Kernel Patch siv eBPF / BCC
Thaum kawg! Tam sim no peb pom tias qhov zoo li qhov stalling bcache ntaus ntawv yog qhov kev hu xovtooj generic_make_request() rau ib tug cached disk.

Khawb rau hauv Kernel

Dab tsi yog qhov qeeb qeeb thaum thov kev xa khoom? Peb pom tias qhov ncua sij hawm tshwm sim txawm tias ua ntej pib thov accounting, i.e. accounting ntawm ib qho kev thov tshwj xeeb rau kev tso tawm ntxiv ntawm cov txheeb cais ntawm nws (/proc/diskstats lossis iostat) tseem tsis tau pib. Qhov no tuaj yeem txheeb xyuas tau yooj yim los ntawm kev khiav iostat thaum rov tsim qhov teeb meem, lossis BCC tsab ntawv biolatency, uas yog raws li qhov pib thiab xaus ntawm kev thov accounting. Tsis muaj cov khoom siv hluav taws xob no yuav qhia teeb meem rau kev thov rau lub cached disk.

Yog peb saib ntawm lub luag haujlwm generic_make_request(), ces peb yuav pom tias ua ntej qhov kev thov pib accounting, ob lub zog ntxiv hu ua. Ua ntej - generic_make_request_checks(), ua cov kev txheeb xyuas qhov tseeb ntawm qhov kev thov hais txog qhov teeb tsa disk. Thib ob - blk_queue_enter(), uas muaj kev sib tw nthuav wait_event_interruptible():

ret = wait_event_interruptible(q->mq_freeze_wq,
	(atomic_read(&q->mq_freeze_depth) == 0 &&
	(preempt || !blk_queue_preempt_only(q))) ||
	blk_queue_dying(q));

Nyob rau hauv nws, lub kernel tos rau lub queue kom unfreeze. Wb ntsuas qhov ncua blk_queue_enter():

~# /usr/share/bcc/tools/funclatency  blk_queue_enter -i 1 -m               	 
Tracing 1 functions for "blk_queue_enter"... Hit Ctrl-C to end.

 	msecs           	: count 	distribution
     	0 -> 1      	: 341  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 316  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 255  	|****************************************|
     	2 -> 3      	: 0    	|                                    	|
     	4 -> 7      	: 0    	|                                    	|
     	8 -> 15     	: 1    	|                                    	|

Nws zoo li peb nyob ze rau kev daws teeb meem. Cov dej num siv los khov / unfreeze ib queue yog blk_mq_freeze_queue ΠΈ blk_mq_unfreeze_queue. Lawv tau siv thaum tsim nyog los hloov qhov kev thov queue, uas muaj peev xwm txaus ntshai rau kev thov hauv kab no. Thaum hu blk_mq_freeze_queue() muaj nuj nqi blk_freeze_queue_start() lub txee yog incremented q->mq_freeze_depth. Tom qab ntawd, lub kernel tos lub queue kom khoob hauv blk_mq_freeze_queue_wait().

Lub sij hawm nws yuav siv sij hawm kom tshem tawm cov kab no yog sib npaug rau disk latency raws li cov ntsiav tos rau tag nrho cov haujlwm queued kom tiav. Thaum lub queue tas lawm, cov kev hloov pauv tau siv. Tom qab ntawd nws yog hu ua blk_mq_unfreeze_queue(), decrementing lub txee khov_depth.

Tam sim no peb paub txaus los kho qhov xwm txheej. Cov lus txib udevadm trigger ua rau cov chaw rau cov cuab yeej thaiv tau thov. Cov kev teeb tsa no tau piav qhia hauv udev cov cai. Peb tuaj yeem nrhiav tau qhov chaw uas khov rau cov kab los ntawm kev sim hloov lawv los ntawm sysfs lossis los ntawm kev saib cov lej hauv kab ke. Peb kuj tuaj yeem sim BCC qhov hluav taws xob qhov cim tseg, uas yuav tso tawm kernel thiab userspace pawg kab rau txhua tus hu mus rau lub davhlau ya nyob twg blk_freeze_queue, piv txwv:

~# /usr/share/bcc/tools/trace blk_freeze_queue -K -U
PID 	TID 	COMM        	FUNC        	 
3809642 3809642 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	elevator_switch+0x29 [kernel]
    	elv_iosched_store+0x197 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

3809631 3809631 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	queue_requests_store+0xb6 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

Udev cov cai hloov pauv tsis tshua muaj thiab feem ntau qhov no tshwm sim hauv kev tswj hwm. Yog li peb pom tias txawm tias siv cov txiaj ntsig uas twb tau teeb tsa ua rau muaj qhov cuam tshuam rau qhov ncua sij hawm hloov pauv qhov kev thov los ntawm daim ntawv thov mus rau disk. Tau kawg, tsim cov txheej xwm udev thaum tsis muaj kev hloov pauv hauv kev teeb tsa disk (piv txwv li, lub cuab yeej tsis txuas / txiav tawm) tsis yog qhov kev coj ua zoo. Txawm li cas los xij, peb tuaj yeem pab cov kernel tsis ua haujlwm tsis tsim nyog thiab khov cov ntawv thov yog tias tsis tsim nyog. Peb me me cog lus kho qhov xwm txheej.

xaus

eBPF yog lub cuab yeej hloov tau yooj yim thiab muaj zog heev. Hauv tsab xov xwm peb tau saib ntawm ib rooj plaub uas siv tau thiab ua kom pom ib feem me me ntawm qhov ua tau. Yog tias koj xav tsim BCC cov khoom siv hluav taws xob, nws tsim nyog saib xyuas official qhia, uas piav txog cov hauv paus ntsiab lus zoo.

Muaj lwm qhov nthuav debugging thiab profileing cov cuab yeej raws li eBPF. Ib tug ntawm lawv - bpftrace ua, uas tso cai rau koj los sau cov ntaub ntawv muaj zog thiab cov kev pabcuam me me hauv cov lus zoo li awk. Lwm tus - ebpf_exporter ua, tso cai rau koj los sau cov qib qis, kev daws teeb meem siab ncaj qha rau hauv koj tus neeg rau zaub mov prometheus, nrog lub peev xwm tom qab tau txais kev pom zoo nkauj thiab txawm tias ceeb toom.

Tau qhov twg los: www.hab.com

Ntxiv ib saib