Kusukela ku-High Ceph Latency kuya ku-Kernel Patch usebenzisa i-eBPF/BCC

Kusukela ku-High Ceph Latency kuya ku-Kernel Patch usebenzisa i-eBPF/BCC

I-Linux inenani elikhulu lamathuluzi okulungisa iphutha le-kernel nezinhlelo zokusebenza. Iningi lazo linomthelela omubi ekusebenzeni kohlelo lokusebenza futhi azikwazi ukusetshenziswa ekukhiqizeni.

Eminyakeni embalwa edlule kwakukhona senziwe elinye ithuluzi - eBPF. Kwenza kube nokwenzeka ukulandelela i-kernel nezinhlelo zokusebenza zomsebenzisi nge-overhead ephansi futhi ngaphandle kwesidingo sokwakha kabusha izinhlelo nokulayisha amamojula ezinkampani zangaphandle ku-kernel.

Sekuvele kunezinsiza eziningi ezisebenzisa i-eBPF, futhi kulesi sihloko sizobheka ukuthi ungabhala kanjani insiza yakho yokuphrofayili ngokusekelwe kumtapo wezincwadi. I-PythonBCC. Lesi sihloko sisekelwe ezenzakalweni zangempela. Sizosuka enkingeni siye ekulungiseni ukuze sibonise ukuthi izinsiza ezikhona zingasetshenziswa kanjani ezimeni ezithile.

UCeph Uhamba Kancane

Umsingathi omusha ungeziwe kuqoqo le-Ceph. Ngemva kokuthuthela enye idatha kuyo, siqaphele ukuthi isivinini sokucubungula izicelo zokubhala ngayo sasiphansi kakhulu kunakwamanye amaseva.

Kusukela ku-High Ceph Latency kuya ku-Kernel Patch usebenzisa i-eBPF/BCC
Ngokungafani namanye amapulatifomu, lo msingathi usebenzise i-bcache kanye ne-linux 4.15 kernel entsha. Bekungokokuqala ngqa ukuthi kusetshenziswe inqwaba yalokhu kulungiselelwa lapha. Futhi ngaleso sikhathi kwacaca ukuthi umsuka wenkinga ungaba noma yini.

Ukuphenya uMbambisi

Ake siqale ngokubheka ukuthi kwenzekani ngaphakathi kwenqubo ye-ceph-osd. Kulokhu sizosebenzisa i-perf ΠΈ i-flamescope (okuningi ongakufunda ngakho lapha):

Kusukela ku-High Ceph Latency kuya ku-Kernel Patch usebenzisa i-eBPF/BCC
Isithombe sisitshela ukuthi umsebenzi i-fdatasync() uchithe isikhathi esiningi ukuthumela isicelo kumisebenzi generic_make_request(). Lokhu kusho ukuthi cishe imbangela yezinkinga isendaweni ethile ngaphandle kwe-osd daemon ngokwayo. Lokhu kungaba i-kernel noma amadiski. Okukhiphayo kwe-iostat kubonise ukubambezeleka okuphezulu ekucubunguleni izicelo ngamadiski e-bcache.

Lapho sihlola umsingathi, sithole ukuthi i-systemd-udevd daemon idla isikhathi esiningi se-CPU - cishe u-20% kuma-cores ambalwa. Lokhu ukuziphatha okungajwayelekile, ngakho-ke udinga ukuthola ukuthi kungani. Njengoba i-Systemd-udevd isebenza nama-uevents, sinqume ukuwabuka udevadm qapha. Kuvela ukuthi inani elikhulu lezenzakalo zoshintsho zakhiwe kudivayisi ngayinye ye-block ohlelweni. Lokhu akujwayelekile neze, ngakho-ke kuzofanele sibheke ukuthi yini ekhiqiza yonke le micimbi.

Ukusebenzisa i-BCC Toolkit

Njengoba sesivele sitholile, i-kernel (kanye ne-ceph daemon ocingweni lwesistimu) ichitha isikhathi esiningi generic_make_request(). Ake sizame ukukala isivinini salo msebenzi. IN BCC Sekuvele kunensiza emangalisayo - i-funclatency. Sizolandelela i-daemon nge-PID yayo ngesikhawu sesekhondi elingu-1 phakathi kokuphumayo futhi sikhiphe umphumela ngama-millisecond.

Kusukela ku-High Ceph Latency kuya ku-Kernel Patch usebenzisa i-eBPF/BCC
Lesi sici ngokuvamile sisebenza ngokushesha. Ekwenzayo nje ukudlulisa isicelo kumugqa womshayeli wedivayisi.

Bcache iyithuluzi eliyinkimbinkimbi eliqukethe amadiski amathathu:

  • idivayisi yokusekela (i-cached disk), kulokhu i-HDD ehamba kancane;
  • idivayisi ye-caching (i-caching disk), nansi ingxenye eyodwa yedivayisi ye-NVMe;
  • idivayisi ebonakalayo ye-bcache uhlelo lokusebenza olusebenza ngayo.

Siyazi ukuthi ukudluliswa kwesicelo kuhamba kancane, kodwa kumaphi kulawa madivayisi? Sizobhekana nalokhu ngemva kwesikhashana.

Manje siyazi ukuthi izehlakalo zingadala izinkinga. Ukuthola ukuthi yini ngempela ebangela isizukulwane sabo akulula kangako. Ake sicabange ukuthi lolu uhlobo oluthile lwesofthiwe eyethulwa ngezikhathi ezithile. Ake sibone ukuthi hlobo luni lwesofthiwe esebenza kusistimu kusetshenziswa iskripthi execsnoop kusukela okufanayo Ikhithi yokusetshenziswa kwe-BCC. Masiyiqalise futhi sithumele okukhiphayo kufayela.

Ngokwesibonelo kanje:

/usr/share/bcc/tools/execsnoop  | tee ./execdump

Ngeke sibonise okukhiphayo okugcwele kwe-execsnoop lapha, kodwa umugqa owodwa esithakaselayo ubukeke kanje:

sh 1764905 5802 0 sudo arcconf getconfig 1 AD | grep Temperature | awk -F '[:/]' '{print $2}' | sed 's/^ ([0-9]*) C.*/1/'

Ikholomu yesithathu yi-PPID (i-PID yomzali) yenqubo. Inqubo ene-PID 5802 iphenduke enye yezinhlelo zesistimu yethu yokuqapha. Lapho kuhlolwa ukucushwa kwesistimu yokuqapha, kutholwe imingcele eyiphutha. Izinga lokushisa le-adaptha ye-HBA lithathwa njalo ngemizuzwana engama-30, okuvamise ukudlula isidingo. Ngemva kokushintsha isikhathi sokuhlola sibe side, sithole ukuthi ukubambezeleka kokucubungula isicelo kulo msingathi akusagqama uma kuqhathaniswa nabanye ababungazi.

Kodwa namanje akukacaci ukuthi kungani idivayisi ye-bcache yayihamba kancane. Silungise inkundla yokuhlola enokucushwa okufanayo futhi sazama ukukhiqiza kabusha inkinga ngokusebenzisa i-fio ku-bcache, ngezikhathi ezithile sisebenzisa i-udevadm trigger ukuze sikhiqize imicimbi.

Ukubhala Amathuluzi Asekelwe ku-BCC

Ake sizame ukubhala insiza elula ukuze silandele futhi sibonise izingcingo ezihamba kancane generic_make_request(). Futhi sinentshisekelo egameni ledrayivu lo msebenzi obizwe ngayo.

Uhlelo lulula:

  • Bhalisa kprobe on generic_make_request():
    • Sigcina igama lediski kumemori, lifinyeleleka ngokusebenzisa ingxabano yomsebenzi;
    • Silondoloza isitembu sesikhathi.

  • Bhalisa i-kretprobe ukubuya kusuka generic_make_request():
    • Sithola isitembu sesikhathi samanje;
    • Sibheka isitembu sesikhathi esilondoloziwe futhi sisiqhathanise nesamanje;
    • Uma umphumela mkhulu kunalowo oshiwo, khona-ke sithola igama lediski eligciniwe futhi silibonise ku-terminal.

Ama-Kprobes ΠΈ ama-kretprobes sebenzisa i-breakpoint mechanism ukuze ushintshe ikhodi yokusebenza empukaneni. Ungafunda imibhalo ΠΈ kuhle isihloko ngalesi sihloko. Uma ubheka ikhodi yezinsiza ezihlukahlukene ku BCC, khona-ke ungabona ukuthi banesakhiwo esifanayo. Ngakho-ke kulesi sihloko sizokweqa ukuphikisana kombhalo bese sidlulela kuhlelo lwe-BPF ngokwalo.

Umbhalo we-eBPF ngaphakathi kweskripthi se-python ubukeka kanje:

bpf_text = β€œβ€β€ # Here will be the bpf program code β€œβ€β€

Ukushintshanisa idatha phakathi kwemisebenzi, izinhlelo ze-eBPF zisebenzisa amatafula e-hashi. Sizokwenza okufanayo. Sizosebenzisa inqubo ye-PID njengokhiye, futhi sichaze isakhiwo njengevelu:

struct data_t {
	u64 pid;
	u64 ts;
	char comm[TASK_COMM_LEN];
	u64 lat;
	char disk[DISK_NAME_LEN];
};

BPF_HASH(p, u64, struct data_t);
BPF_PERF_OUTPUT(events);

Lapha sibhalisa ithebula le-hash elibizwa ngokuthi p, ngohlobo lokhiye u64 kanye nenani lohlobo hlela idatha_t. Ithebula lizotholakala kungqikithi yohlelo lwethu lwe-BPF. I-BPF_PERF_OUTPUT macro ibhalisa elinye ithebula elibizwa izenzakalo, esetshenziselwa ukudluliswa kwedatha endaweni yomsebenzisi.

Lapho ukala ukubambezeleka phakathi kokubiza umsebenzi nokubuya kuwo, noma phakathi kwezingcingo eziya emisebenzini ehlukene, udinga ukucabangela ukuthi idatha eyamukelwe kufanele ibe ngeyomongo ofanayo. Ngamanye amazwi, udinga ukukhumbula mayelana nokwethulwa okuhambisanayo okungenzeka kwemisebenzi. Sinekhono lokulinganisa ukubambezeleka phakathi kokubiza umsebenzi kumongo wenqubo eyodwa nokubuya kulowo msebenzi kumongo wenye inqubo, kodwa lokhu cishe akusizi ngalutho. Isibonelo esihle lapha kungaba uhlelo lokusebenza lwe-biolatency, lapho ukhiye wethebula le-hash usethelwe kusikhombi isicelo sesakhiwo, okubonisa isicelo sediski eyodwa.

Okulandelayo, sidinga ukubhala ikhodi ezosebenza lapho umsebenzi ongaphansi kocwaningo ubizwa ngokuthi:

void start(struct pt_regs *ctx, struct bio *bio) {
	u64 pid = bpf_get_current_pid_tgid();
	struct data_t data = {};
	u64 ts = bpf_ktime_get_ns();
	data.pid = pid;
	data.ts = ts;
	bpf_probe_read_str(&data.disk, sizeof(data.disk), (void*)bio->bi_disk->disk_name);
	p.update(&pid, &data);
}

Lapha i-agumenti yokuqala yomsebenzi obizwayo izothathelwa indawo njengempikiswano yesibili generic_make_request(). Ngemva kwalokhu, sithola i-PID yenqubo emongweni esisebenza kuwo, kanye nesitembu sesikhathi samanje kuma-nanosecond. Konke sikubhala ngenguqulo esanda kukhethwa hlela idatha_t. Sithola igama lediski esakhiweni bio, edluliswa uma efona generic_make_request(), futhi uyigcine esakhiweni esifanayo idatha. Isinyathelo sokugcina siwukwengeza okufakiwe kuthebula le-hashi okukhulunywe ngalo ekuqaleni.

Umsebenzi olandelayo uzobizwa lapho ubuya generic_make_request():

void stop(struct pt_regs *ctx) {
    u64 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    struct data_t* data = p.lookup(&pid);
    if (data != 0 && data->ts > 0) {
        bpf_get_current_comm(&data->comm, sizeof(data->comm));
        data->lat = (ts - data->ts)/1000;
        if (data->lat > MIN_US) {
            FACTOR
            data->pid >>= 32;
            events.perf_submit(ctx, data, sizeof(struct data_t));
        }
        p.delete(&pid);
    }
}

Lo msebenzi ufana nowangaphambilini: sithola i-PID yenqubo kanye nesitembu sesikhathi, kodwa singabeki inkumbulo kusakhiwo sedatha esisha. Esikhundleni salokho, sisesha ithebula le-hashi ukuthola isakhiwo esivele sikhona sisebenzisa ukhiye == i-PID yamanje. Uma isakhiwo sitholakala, khona-ke sithola igama lenqubo esebenzayo futhi singeze kuso.

Ukushintsha kanambambili esikusebenzisa lapha kuyadingeka ukuze sithole i-GID yochungechunge. labo. I-PID yenqubo eyinhloko eqale uchungechunge kumongo esisebenza kuwo. Umsebenzi esiwubizayo bpf_get_current_pid_tgid() ibuyisela kokubili i-GID yochungechunge kanye ne-PID yayo ngevelu eyodwa engu-64-bit.

Lapho sikhipha kutheminali, okwamanje asinantshisekelo kuchungechunge, kodwa sinentshisekelo kunqubo eyinhloko. Ngemva kokuqhathanisa ukubambezeleka okubangelwa umkhawulo onikeziwe, sidlula isakhiwo sethu idatha endaweni yomsebenzisi ngetafula izenzakalo, ngemva kwalokho sisusa okufakiwe kusuka p.

Kuskripthi se-python esizolayisha le khodi, sidinga ukufaka u-MIN_US kanye ne-FACTOR esikhundleni sokubambezeleka namayunithi esikhathi, esizowadlulisa kuma-agumenti:

bpf_text = bpf_text.replace('MIN_US',str(min_usec))
if args.milliseconds:
	bpf_text = bpf_text.replace('FACTOR','data->lat /= 1000;')
	label = "msec"
else:
	bpf_text = bpf_text.replace('FACTOR','')
	label = "usec"

Manje sidinga ukulungiselela uhlelo lwe-BPF nge I-BPF macro namasampuli okubhalisa:

b = BPF(text=bpf_text)
b.attach_kprobe(event="generic_make_request",fn_name="start")
b.attach_kretprobe(event="generic_make_request",fn_name="stop")

Kuzodingeka futhi sinqume hlela idatha_t kusikripthi sethu, ngaphandle kwalokho ngeke sikwazi ukufunda lutho:

TASK_COMM_LEN = 16	# linux/sched.h
DISK_NAME_LEN = 32	# linux/genhd.h
class Data(ct.Structure):
	_fields_ = [("pid", ct.c_ulonglong),
            	("ts", ct.c_ulonglong),
            	("comm", ct.c_char * TASK_COMM_LEN),
            	("lat", ct.c_ulonglong),
            	("disk",ct.c_char * DISK_NAME_LEN)]

Isinyathelo sokugcina ukukhipha idatha kutheminali:

def print_event(cpu, data, size):
    global start
    event = ct.cast(data, ct.POINTER(Data)).contents
    if start == 0:
        start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d   %-1s %s   %s" % (time_s, event.comm, event.pid, event.lat, label, event.disk))

b["events"].open_perf_buffer(print_event)
# format output
start = 0
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Iskripthi ngokwaso siyatholakala ku- I-GIHub. Ake sizame ukuyisebenzisa endaweni yokuhlola lapho i-fio isebenza khona, sibhalela i-bcache, futhi sishayele i-udevadm monitor:

Kusukela ku-High Ceph Latency kuya ku-Kernel Patch usebenzisa i-eBPF/BCC
Ekugcineni! Manje siyabona ukuthi okwakubukeka sengathi idivayisi ye-bcache emile empeleni kuwucingo olubambekayo generic_make_request() okwediski egciniwe.

Gcoba ku-Kernel

Yini ngempela eyehlisa ijubane phakathi nokudluliswa kwesicelo? Siyabona ukuthi ukubambezeleka kwenzeka nangaphambi kokuqala kokubalwa kwesicelo, i.e. ukubalwa kwesicelo esithile sokukhipha okwengeziwe kwezibalo kuso (/proc/diskstats noma iostat) akukakaqali. Lokhu kungaqinisekiswa kalula ngokusebenzisa i-iostat ngenkathi kukhiqizwa kabusha inkinga, noma I-BCC script biolatency, okusekelwe ekuqaleni nasekupheleni kokubala kwesicelo. Azikho kulezi zinsiza ezizobonisa izinkinga zezicelo kudiski eligcinwe kunqolobane.

Uma sibheka umsebenzi generic_make_request(), khona-ke sizobona ukuthi ngaphambi kokuba isicelo siqale ukubalwa kwezimali, kubizwa eminye imisebenzi emibili. Okokuqala - generic_make_request_checks(), yenza ukuhlola ukufaneleka kwesicelo mayelana nezilungiselelo zediski. Okwesibili - blk_queue_enter(), enenselelo ethokozisayo wait_event_interruptible():

ret = wait_event_interruptible(q->mq_freeze_wq,
	(atomic_read(&q->mq_freeze_depth) == 0 &&
	(preempt || !blk_queue_preempt_only(q))) ||
	blk_queue_dying(q));

Kuyo, i-kernel ilinda ukuthi ulayini ungaqandi. Ake silinganise ukubambezeleka blk_queue_enter():

~# /usr/share/bcc/tools/funclatency  blk_queue_enter -i 1 -m               	 
Tracing 1 functions for "blk_queue_enter"... Hit Ctrl-C to end.

 	msecs           	: count 	distribution
     	0 -> 1      	: 341  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 316  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 255  	|****************************************|
     	2 -> 3      	: 0    	|                                    	|
     	4 -> 7      	: 0    	|                                    	|
     	8 -> 15     	: 1    	|                                    	|

Kubukeka sengathi sesiseduze nesixazululo. Imisebenzi esetshenziswa ukumisa/ukukhulula ulayini blk_mq_freeze_queue ΠΈ blk_mq_unfreeze_queue. Asetshenziswa uma kudingeka ukushintsha izilungiselelo zomugqa wesicelo, ezingaba yingozi ezicelweni zalo mugqa. Uma ufona blk_mq_freeze_queue() umsebenzi blk_freeze_queue_start() isibali siyanda q->mq_freeze_depth. Ngemuva kwalokhu, i-kernel ilinda ukuthi ulayini uthulule phakathi blk_mq_freeze_queue_wait().

Isikhathi esisithathayo ukusula lo mugqa silingana nokubambezeleka kwediski njengoba i-kernel ilinda ukuthi yonke imisebenzi ekulayini iphele. Uma ulayini ungenalutho, izinguquko zezilungiselelo ziyasetshenziswa. Ngemva kwalokho kuthiwa blk_mq_unfreeze_queue(), yehlisa ikhawunta ukujula_ukujula.

Manje sesazi ngokwanele ukulungisa isimo. Umyalo we-trigger ye-udevadm ubangela ukuthi izilungiselelo zedivayisi ye-block zisetshenziswe. Lezi zilungiselelo zichazwe emithethweni ye-udev. Singathola ukuthi yiziphi izilungiselelo ezifriza ulayini ngokuzama ukuzishintsha ngama-sysfs noma ngokubheka ikhodi yomthombo we-kernel. Singaphinda sizame insiza ye-BCC ukulandelela, ezokhipha izitaki ze-kernel nezindawo zomsebenzisi zokulandelela kocingo ngalunye oluya kutheminali blk_friza_umugqaisibonelo:

~# /usr/share/bcc/tools/trace blk_freeze_queue -K -U
PID 	TID 	COMM        	FUNC        	 
3809642 3809642 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	elevator_switch+0x29 [kernel]
    	elv_iosched_store+0x197 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

3809631 3809631 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	queue_requests_store+0xb6 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

Imithetho ye-Udev ishintsha kuyaqabukela futhi lokhu kwenzeka ngendlela elawulwayo. Ngakho-ke siyabona ukuthi ngisho nokusebenzisa amanani asevele asethiwe kudala i-spike ekubambezelekeni kokudlulisa isicelo kusuka kuhlelo kuya kudiski. Yiqiniso, ukukhiqiza imicimbi ye-udev lapho kungekho zinguquko ekucushweni kwediski (isibonelo, idivayisi ayikhwezwanga/inqanyuliwe) akuwona umkhuba omuhle. Kodwa-ke, singasiza i-kernel ukuthi ingenzi umsebenzi ongadingekile futhi imise ulayini wesicelo uma kungenasidingo. Okuthathu encane bophezela lungisa isimo.

Isiphetho

I-eBPF iyithuluzi elivumelana nezimo kakhulu futhi elinamandla. Esihlokweni sibheke indaba eyodwa engokoqobo futhi sabonisa ingxenye encane yalokho okungenziwa. Uma ungathanda ukuthuthukisa izinsiza ze-BCC, kufanelekile ukuthi uzibheke okokufundisa okusemthethweni, echaza kahle izinto eziyisisekelo.

Kukhona amanye amathuluzi athokozisayo okulungisa iphutha nawokwenza iphrofayela asekelwe ku-eBPF. Omunye wabo - bpftrace, okukuvumela ukuthi ubhale ama-line-liners kanye nezinhlelo ezincane ngolimi olufana ne-awk. Okunye - ebpf_exporter, ikuvumela ukuthi uqoqe amamethrikhi asezingeni eliphansi, anokulungiswa okuphezulu ngqo kuseva yakho ye-prometheus, namandla okuthola kamuva ukubonwa okuhle ngisho nezixwayiso.

Source: www.habr.com

Engeza amazwana