Kubva paPamusoro Ceph Latency kuenda kuKernel Patch uchishandisa eBPF/BCC

Kubva paPamusoro Ceph Latency kuenda kuKernel Patch uchishandisa eBPF/BCC

Linux ine nhamba huru yezvishandiso zvekugadzirisa kernel uye maapplication. Mazhinji acho ane kukanganisa kwekuita kwekushandisa uye haagone kushandiswa mukugadzira.

Makore mashoma apfuura kwaivepo chimwe chishandiso chakagadzirwa - eBPF. Inoita kuti zvikwanise kutsvaga kernel uye mashandisirwo emushandisi ane yakaderera kumusoro uye pasina chikonzero chekuvakazve zvirongwa uye kurodha yechitatu-bato modules mukernel.

Patova nezvakawanda zvekushandisa zvinoshandisa eBPF, uye mune ino chinyorwa tichatarisa maitiro ekunyora yako wega profiling utility zvichibva paraibhurari. PythonBCC. Nyaya yacho inobva pazviitiko chaizvo. Isu tichabva pane dambudziko kuti tigadzirise kuratidza kuti zviripo zvinoshandiswa zvinogona kushandiswa sei mumamiriro chaiwo.

Ceph Ari Kunonoka

Muiti mutsva akawedzerwa kuboka reCeph. Mushure mekutamisa imwe data kwairi, takaona kuti kumhanya kwekugadzirisa zvikumbiro zvekunyora nayo kwaive kwakadzikira pane kune mamwe maseva.

Kubva paPamusoro Ceph Latency kuenda kuKernel Patch uchishandisa eBPF/BCC
Kusiyana nemamwe mapuratifomu, mugadziri uyu akashandisa bcache uye itsva linux 4.15 kernel. Aka kakanga kari kekutanga kushandiswa kwegadziriro iyi. Uye panguva iyoyo zvaive pachena kuti mudzi wedambudziko ungangove chero chinhu.

Kuongorora Mugamuchiri

Ngatitange nekutarisa izvo zvinoitika mukati meiyo ceph-osd maitiro. Nokuda kweizvi tichashandisa zvakakwana ΠΈ flamescope (zvimwe zvaunogona kuverenga pano):

Kubva paPamusoro Ceph Latency kuenda kuKernel Patch uchishandisa eBPF/BCC
Mufananidzo unotiudza kuti basa racho fdatasync() akapedza nguva yakawanda achitumira chikumbiro kumabasa generic_make_request(). Izvi zvinoreva kuti kazhinji chikonzero chematambudziko chiri kumwe kunze kwe osd daemon pachayo. Izvi zvinogona kuva kernel kana disks. Iyo iostat yakabuda yakaratidza yakakwira latency mukugadzirisa zvikumbiro ne bcache disks.

Pakutarisa mugadziri, takaona kuti systemd-udevd daemon inoshandisa yakawanda yeCPU nguva - ingangoita 20% pamacores akati wandei. Aya maitiro asinganzwisisike, saka unofanirwa kuziva kuti sei. Sezvo Systemd-udevd ichishanda nemaevents, isu takasarudza kuvatarisa kuburikidza udevadm monitor. Zvinoitika kuti nhamba huru yezviitiko zvekuchinja yakagadzirwa kune yega yega block mudziyo muhurongwa. Izvi hazvina kujairika, saka tichafanirwa kutarisa izvo zvinogadzira zvese izvi zviitiko.

Kushandisa BCC Toolkit

Sezvatakatoona, kernel (uye ceph daemon mune system call) inopedza nguva yakawanda generic_make_request(). Ngatiedze kuyera kukurumidza kwebasa iri. IN Bcc Patova nechishandiso chinoshamisa - funclatency. Isu tichatsvaga iyo daemon nePID yayo ine 1 yechipiri kupindirana pakati pezvinobuda uye inoburitsa mhedzisiro mumamilliseconds.

Kubva paPamusoro Ceph Latency kuenda kuKernel Patch uchishandisa eBPF/BCC
Ichi chimiro chinowanzoshanda nekukurumidza. Zvese zvazvinoita kupfuudza chikumbiro kumutsara wemutyairi wemudziyo.

Bcache chigadzirwa chakaoma icho chine madhisiki matatu:

  • backing device (cached disk), munyaya iyi inononoka HDD;
  • caching device (caching disk), heino ichi chikamu chimwe cheNVMe device;
  • iyo bcache chaiyo mudziyo iyo application inomhanya nayo.

Isu tinoziva kuti kuendesa chikumbiro kunonoka, asi ndeipi yemidziyo iyi? Tichagadzirisa izvi zvishoma gare gare.

Isu tava kuziva kuti zviitiko zvinogona kukonzera matambudziko. Kuwana kuti chii chaizvo chinokonzera chizvarwa chavo hakusi nyore. Ngatifungei kuti iyi imhando yesoftware inotangwa nguva nenguva. Ngationei kuti ndeupi rudzi rwesoftware inomhanya pane system uchishandisa script execsnoop kubva zvakafanana BCC utility kit. Ngatimhanyei uye titumire zvakabuda kune faira.

Somuenzaniso seizvi:

/usr/share/bcc/tools/execsnoop  | tee ./execdump

Hatisi kuzoratidza kuburitsa kwakazara kwe execsnoop pano, asi mutsara mumwe wekufarira kwatiri wakaita seuyu:

sh 1764905 5802 0 sudo arcconf getconfig 1 AD | grep Temperature | awk -F '[:/]' '{print $2}' | sed 's/^ ([0-9]*) C.*/1/'

Koramu yechitatu ndiyo PPID (mubereki PID) yemaitiro. Maitiro nePID 5802 akazoita imwe yetambo dzeyedu yekutarisa system. Paunenge uchitarisa magadzirirwo ehurongwa hwekutarisa, zvikanganiso zvisizvo zvakawanikwa. Iyo tembiricha yeHBA adapta yakatorwa masekonzi makumi matatu ega ega, inova kazhinji kazhinji pane zvakafanira. Mushure mekushandura nguva yekutarisa kune imwe yakareba, takaona kuti chikumbiro chekugadzirisa latency pane ino host iyi yakanga isisina kumira kunze kana ichienzaniswa nemamwe mauto.

Asi hazvisati zvanyatsojeka kuti sei bcache mudziyo wainonoka kudaro. Isu takagadzirira chikuva chekuyedza neyakafanana gadziriso uye takaedza kuburitsa dambudziko nekumhanyisa fio pabcache, nguva nenguva tichimhanyisa udevadm trigger kugadzira zviitiko.

Kunyora BCC-Yakavakirwa Zvishandiso

Ngatiedzei kunyora zvirinyore zvekushandisa kutsvaga uye kuratidza inononoka kufona generic_make_request(). Isu tinofarirawo kune zita rekutyaira iro basa iri rakashevedzwa.

Chirongwa chacho chiri nyore:

  • Register kprobe pamusoro generic_make_request():
    • Isu tinochengetedza zita re diski mundangariro, rinowanikwa kuburikidza nenharo yebasa;
    • Isu tinochengetedza timetamp.

  • Register kretprobe zvekudzoka kubva generic_make_request():
    • Isu tinowana iyo nguva yenguva;
    • Isu tinotarisa iyo yakachengetwa timestamp toienzanisa neyazvino;
    • Kana mhedzisiro yakakura kupfuura yakatsanangurwa, saka isu tinowana yakachengetedzwa disk zita uye toiratidza pane iyo terminal.

Kprobes ΠΈ kretprobes shandisa breakpoint mechanism kushandura kodhi yebasa panhunzi. Unogona kuverenga zvinyorwa ΠΈ zvakanaka chinyorwa chenyaya iyi. Kana iwe ukatarisa kodhi yezvishandiso zvakasiyana mu Bcc, ipapo unogona kuona kuti vane chimiro chakafanana. Saka muchinyorwa chino tichasvetuka kupatsanura script nharo toenda kuchirongwa cheBPF pachacho.

Iyo eBPF mavara mukati meiyo python script inoita seizvi:

bpf_text = β€œβ€β€ # Here will be the bpf program code β€œβ€β€

Kuchinjana data pakati pemabasa, eBPF zvirongwa zvinoshandisa hash tables. Tichaita zvimwe chetezvo. Isu tichashandisa maitiro PID sekiyi, uye totsanangura chimiro sekukosha:

struct data_t {
	u64 pid;
	u64 ts;
	char comm[TASK_COMM_LEN];
	u64 lat;
	char disk[DISK_NAME_LEN];
};

BPF_HASH(p, u64, struct data_t);
BPF_PERF_OUTPUT(events);

Pano tinonyoresa tafura yehashi inonzi p, ine kiyi mhando u64 uye kukosha kwerudzi struct data_t. Tafura yacho ichavepo mukati mechirongwa chedu cheBPF. Iyo BPF_PERF_OUTPUT macro inonyoresa imwe tafura inonzi zviitiko, iyo inoshandiswa kufambiswa kwedata munzvimbo yemushandisi.

Paunenge uchiyera kunonoka pakati pekudaidza basa uye kudzoka kubva kwairi, kana pakati pemafoni kune akasiyana mabasa, unofanirwa kufunga kuti iyo yakagamuchirwa data inofanirwa kunge iri yemamiriro akafanana. Mune mamwe mazwi, iwe unofanirwa kuyeuka nezve inokwanisika parallel kuvhurwa kwemabasa. Isu tine kugona kuyera latency pakati pekudaidza basa muchimiro cheimwe nzira uye kudzoka kubva kune iyo basa mumamiriro eimwe nzira, asi izvi zvingangove zvisingabatsiri. Muenzaniso wakanaka pano ungave biolatency utility, uko kiyi yetafura ye hash yakaiswa kune chinongedzo kune struct chikumbiro, iyo inoratidza imwe disk chikumbiro.

Tevere, isu tinofanirwa kunyora kodhi iyo inomhanya kana basa riri pasi pekudzidza richinzi:

void start(struct pt_regs *ctx, struct bio *bio) {
	u64 pid = bpf_get_current_pid_tgid();
	struct data_t data = {};
	u64 ts = bpf_ktime_get_ns();
	data.pid = pid;
	data.ts = ts;
	bpf_probe_read_str(&data.disk, sizeof(data.disk), (void*)bio->bi_disk->disk_name);
	p.update(&pid, &data);
}

Pano gakava rekutanga rekudanwa kwechiito richatsiviwa senharo yechipiri generic_make_request(). Mushure meizvi, tinowana iyo PID yemaitiro mukati memamiriro atiri kushanda, uye yazvino timestamp muma nanoseconds. Tinonyora zvose pasi muchangobva kusarudzwa gadzira data_t data. Isu tinowana zita re diski kubva kune chimiro bio, iyo inopfuudzwa pakufona generic_make_request(), uye chengetedza muchimiro chimwe chete dhata. Nhanho yekupedzisira ndeyekuwedzera yekupinda kune iyo hash tafura yakambotaurwa.

Basa rinotevera richadaidzwa pakudzoka kubva generic_make_request():

void stop(struct pt_regs *ctx) {
    u64 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    struct data_t* data = p.lookup(&pid);
    if (data != 0 && data->ts > 0) {
        bpf_get_current_comm(&data->comm, sizeof(data->comm));
        data->lat = (ts - data->ts)/1000;
        if (data->lat > MIN_US) {
            FACTOR
            data->pid >>= 32;
            events.perf_submit(ctx, data, sizeof(struct data_t));
        }
        p.delete(&pid);
    }
}

Iri basa rakafanana nerekare: isu tinowana iyo PID yemaitiro uye timestamp, asi usagove ndangariro kune itsva data chimiro. Pane kudaro, tinotsvaga tafura yehashi yechimiro chave chiripo tichishandisa kiyi == yazvino PID. Kana iyo dhizaini yakawanikwa, saka isu tinowana zita rekumhanyisa maitiro uye towedzera kwariri.

Iko kuchinja kwebhinari kwatinoshandisa pano kunodiwa kuti tiwane shinda GID. avo. PID yenzira huru yakatanga tambo mumamiriro atiri kushanda. Basa ratinodaidza bpf_get_current_pid_tgid() inodzosa ese GID yeshinda uye PID yayo mune imwechete 64-bit kukosha.

Pakuburitsa kune terminal, isu hatisi parizvino kufarira rukova, asi isu tiri kufarira iyo huru maitiro. Mushure mokuenzanisa kunonoka kunoguma nechikumbaridzo chakapiwa, tinopfuudza chimiro chedu dhata munzvimbo yemushandisi kuburikidza netafura zviitiko, mushure mezvo tinodzima chinyorwa kubva p.

Mune python script inozoisa iyi kodhi, isu tinofanirwa kutsiva MIN_US uye FACTOR nekunonoka zvikumbaridzo uye nguva mayuniti, ayo isu tichapfuura nemapokana:

bpf_text = bpf_text.replace('MIN_US',str(min_usec))
if args.milliseconds:
	bpf_text = bpf_text.replace('FACTOR','data->lat /= 1000;')
	label = "msec"
else:
	bpf_text = bpf_text.replace('FACTOR','')
	label = "usec"

Iye zvino tinoda kugadzirira chirongwa cheBPF kuburikidza BPF macro uye kunyoresa samples:

b = BPF(text=bpf_text)
b.attach_kprobe(event="generic_make_request",fn_name="start")
b.attach_kretprobe(event="generic_make_request",fn_name="stop")

Tichafanirawo kusarudza struct data_t mune yedu script, zvikasadaro isu hatizokwanisa kuverenga chero chinhu:

TASK_COMM_LEN = 16	# linux/sched.h
DISK_NAME_LEN = 32	# linux/genhd.h
class Data(ct.Structure):
	_fields_ = [("pid", ct.c_ulonglong),
            	("ts", ct.c_ulonglong),
            	("comm", ct.c_char * TASK_COMM_LEN),
            	("lat", ct.c_ulonglong),
            	("disk",ct.c_char * DISK_NAME_LEN)]

Nhanho yekupedzisira ndeyekuburitsa data kune terminal:

def print_event(cpu, data, size):
    global start
    event = ct.cast(data, ct.POINTER(Data)).contents
    if start == 0:
        start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d   %-1s %s   %s" % (time_s, event.comm, event.pid, event.lat, label, event.disk))

b["events"].open_perf_buffer(print_event)
# format output
start = 0
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Iyo script pachayo inowanikwa pa GItHub. Ngatiedzei kuimhanyisa papuratifomu yekuyedza iyo fio iri kushanda, ichinyorera bcache, uye kufonera udevadm monitor:

Kubva paPamusoro Ceph Latency kuenda kuKernel Patch uchishandisa eBPF/BCC
Pakupedzisira! Ikozvino tinoona kuti chaiita senge chinomira bcache mudziyo ichiri kufona generic_make_request() kune cached disk.

Dzvanya muKernel

Chii chaizvo chiri kudzikira panguva yekukumbira kutapurirana? Tinoona kuti kunonoka kunoitika kunyange kusati kwatanga kukumbira accounting, i.e. kuverenga kwechikumbiro chaicho chekuwedzera kuburitswa kwenhamba pairi (/proc/diskstats kana iostat) haisati yatanga. Izvi zvinogona kusimbiswa zviri nyore nekumhanyisa iostat uchigadzira dambudziko, kana BCC script biolatency, iyo yakavakirwa pakutanga uye kupera kwekukumbira accounting. Hapana chimwe chezvishandiso izvi chicharatidza matambudziko ekukumbira kune cached disk.

Kana tikatarisa basa generic_make_request(), ipapo tichaona kuti chikumbiro chisati chatanga accounting, mamwe maviri mabasa anodanwa. Chekutanga - generic_make_request_checks(), inoita cheki pamusoro pekutendeseka kwechikumbiro maererano nedhisiki marongero. Chepiri - blk_queue_enter(), ine dambudziko rinonakidza wait_event_interruptible ():

ret = wait_event_interruptible(q->mq_freeze_wq,
	(atomic_read(&q->mq_freeze_depth) == 0 &&
	(preempt || !blk_queue_preempt_only(q))) ||
	blk_queue_dying(q));

Mariri, kernel inomirira kuti mutsara usununguke. Ngatiyere kunonoka blk_queue_enter():

~# /usr/share/bcc/tools/funclatency  blk_queue_enter -i 1 -m               	 
Tracing 1 functions for "blk_queue_enter"... Hit Ctrl-C to end.

 	msecs           	: count 	distribution
     	0 -> 1      	: 341  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 316  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 255  	|****************************************|
     	2 -> 3      	: 0    	|                                    	|
     	4 -> 7      	: 0    	|                                    	|
     	8 -> 15     	: 1    	|                                    	|

Zvinoita sekunge tave pedyo nemhinduro. Mafunctions anoshandiswa kuomesa/kusunungura mutsetse ndiwo blk_mq_freeze_queue ΠΈ blk_mq_unfreeze_queue. Iwo anoshandiswa pazvinenge zvichidikanwa kushandura marongero emutsara wekukumbira, izvo zvinogona kuva nengozi kune zvikumbiro mumutsara uyu. Pakufona blk_mq_freeze_queue() basa blk_freeze_queue_start() iyo counter inowedzera q->mq_freeze_depth. Mushure meizvi, kernel inomirira kuti mutsara upinde mukati blk_mq_freeze_queue_wait().

Nguva yainotora kubvisa iyi queue yakaenzana nedisk latency sezvo kernel inomirira kuti mabasa ese akamirirwa apedze. Kana mutsara usisina chinhu, shanduko dzemaseting dzinoiswa. Mushure mezvo zvodanwa blk_mq_unfreeze_queue(), kuderedza counter freeze_depth.

Iye zvino tava kuziva zvakakwana kugadzirisa mamiriro acho ezvinhu. Iyo udevadm trigger command inokonzeresa kuti zvigadziriso zvechivharo chishandiswe. Aya marongero anotsanangurwa mumitemo yeudev. Tinogona kuwana kuti ndeapi marongero ari kuomesa mutsara nekuyedza kuvashandura kuburikidza nesysfs kana nekutarisa kernel source code. Isu tinogona zvakare kuedza iyo BCC yekushandisa tsvaga, iyo inoburitsa kernel uye userspace stack traces yekufona kwega kwega kune terminal blk_freeze_queue, somuenzaniso:

~# /usr/share/bcc/tools/trace blk_freeze_queue -K -U
PID 	TID 	COMM        	FUNC        	 
3809642 3809642 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	elevator_switch+0x29 [kernel]
    	elv_iosched_store+0x197 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

3809631 3809631 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	queue_requests_store+0xb6 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

Mitemo yeUdev inoshanduka kashoma uye kazhinji izvi zvinoitika nenzira inodzorwa. Saka isu tinoona kuti kunyangwe kushandisa iyo yakatotarwa kukosha kunokonzeresa spike mukunonoka kuendesa chikumbiro kubva kuchikumbiro kuenda kudiski. Zvechokwadi, kugadzira zviitiko zveudev kana pasina kuchinja mukugadzirisa disk (somuenzaniso, chigadzirwa chacho hachina kukwidzwa / kubviswa) haisi tsika yakanaka. Nekudaro, isu tinogona kubatsira kernel kuti isaite basa risingaite uye kuomesa mutsara wekukumbira kana zvisiri izvo. Vatatu diki commit gadzirisa mamiriro acho ezvinhu.

mhedziso

eBPF chishandiso chinochinjika uye chine simba. Muchinyorwa takatarisa pane chimwe chiitiko chinoshanda uye takaratidzira chikamu chidiki chezvingaitwa. Kana iwe uchida kugadzira BCC zvishandiso, zvakakosha kuti utarise official tutorial, iyo inotsanangura nheyo zvakanaka.

Kune zvimwe zvinonakidza debugging uye profiling zvishandiso zvinoenderana neBPF. Mumwe wavo - bpftrace, iyo inokubvumira kunyora ane simba-liner uye mapurogiramu maduku mumutauro we-awk. Imwe - ebpf_exporter, inobvumidza iwe kuunganidza yakaderera-level, yakakwirira-resolution metrics yakananga mune yako prometheus server, nekugona kwekupedzisira kuwana yakanaka yekuona uye kunyange chenjedzo.

Source: www.habr.com

Voeg