Kuchokera ku High Ceph Latency kupita ku Kernel Patch yokhala ndi eBPF/BCC

Kuchokera ku High Ceph Latency kupita ku Kernel Patch yokhala ndi eBPF/BCC

Linux ili ndi zida zambiri zosinthira kernel ndi kugwiritsa ntchito. Ambiri aiwo ali ndi zotsatira zoyipa pakugwiritsa ntchito ntchito ndipo sangathe kugwiritsidwa ntchito popanga.

Zaka zingapo zapitazo kunali chida china chapangidwa -eBPF. Zimapangitsa kuti zitheke kutsata ma kernel ndi ogwiritsa ntchito ndi otsika kwambiri komanso popanda kufunika komanganso mapulogalamu ndikuyika ma module a chipani chachitatu mu kernel.

Pali kale zida zambiri zomwe zimagwiritsa ntchito eBPF, ndipo m'nkhaniyi tiwona momwe mungalembere mbiri yanu pogwiritsa ntchito laibulale. PythonBCC. Nkhaniyi yachokera pa zochitika zenizeni. Tidzachoka pavuto kuti tikonze momwe zida zomwe zilipo zingagwiritsidwe ntchito pazinthu zinazake.

Ceph Ndi Pang'onopang'ono

Wolandira watsopano wawonjezedwa ku gulu la Ceph. Titasamutsa zina mwazinthuzo, tidawona kuti liwiro la kukonza zopempha ndi ilo linali lotsika kwambiri kuposa ma seva ena.

Kuchokera ku High Ceph Latency kupita ku Kernel Patch yokhala ndi eBPF/BCC
Mosiyana ndi nsanja zina, wolandirayo adagwiritsa ntchito bcache ndi linux 4.15 kernel yatsopano. Aka kanali koyamba kuti masinthidwe ambiri agwiritsidwe ntchito pano. Ndipo panthawiyo zinali zoonekeratu kuti gwero la vuto likhoza kukhala chirichonse.

Kufufuza Wolandira

Tiyeni tiyambe ndi kuyang'ana zomwe zimachitika mkati mwa ndondomeko ya ceph-osd. Kwa izi tidzagwiritsa ntchito wangwiro ΠΈ flamescope (zambiri zomwe mungawerenge apa):

Kuchokera ku High Ceph Latency kupita ku Kernel Patch yokhala ndi eBPF/BCC
Chithunzicho chimatiuza kuti ntchitoyo fdatasync () amatenga nthawi yayitali kutumiza zopempha ku ntchito generic_make_request(). Izi zikutanthauza kuti mwina chomwe chimayambitsa mavuto ndi kwinakwake kunja kwa daemon ya osd. Izi zitha kukhala kernel kapena ma disks. Kutulutsa kwa iostat kunawonetsa kuchedwa kwakukulu pakukonza zopempha ndi ma disks a bcache.

Tikayang'ana wolandirayo, tidapeza kuti daemon ya systemd-udevd imadya nthawi yambiri ya CPU - pafupifupi 20% pamacores angapo. Ili ndi khalidwe lachilendo, choncho muyenera kudziwa chifukwa chake. Popeza Systemd-udevd imagwira ntchito ndi uevents, tidaganiza zowayang'ana udevadm monitor. Zikuoneka kuti chiwerengero chachikulu cha zochitika zosintha zinapangidwa pa chipangizo chilichonse cha chipika mu dongosolo. Izi sizachilendo, kotero tiyenera kuyang'ana zomwe zimapanga zochitika zonsezi.

Kugwiritsa ntchito BCC Toolkit

Monga tadziwira kale, kernel (ndi ceph daemon mu call system) imakhala nthawi yayitali generic_make_request(). Tiyeni tiyese kuyesa liwiro la ntchitoyi. MU BCC Pali kale chida chodabwitsa - funclatency. Tidzatsata daemon ndi PID yake ndi mphindi imodzi yachiwiri pakati pa zotuluka ndikutulutsa zotsatira zake mu ma milliseconds.

Kuchokera ku High Ceph Latency kupita ku Kernel Patch yokhala ndi eBPF/BCC
Izi nthawi zambiri zimagwira ntchito mwachangu. Zomwe zimachita ndikungopereka pempho ku mzere woyendetsa chipangizo.

Bcache ndi chipangizo chovuta chomwe chimakhala ndi ma disks atatu:

  • chipangizo chothandizira (chosungira disk), pamenepa ndi pang'onopang'ono HDD;
  • caching chipangizo (caching disk), apa ndi gawo limodzi la chipangizo cha NVMe;
  • chida cha bcache chomwe pulogalamuyo imayendera.

Tikudziwa kuti kutumiza kwa pempho kumachedwa, koma ndi zida ziti mwa izi? Tithana ndi izi posachedwa.

Tsopano tikudziwa kuti zochitika zitha kuyambitsa zovuta. Kupeza chomwe chimayambitsa mbadwo wawo sikophweka. Tiyerekeze kuti iyi ndi pulogalamu yamtundu wina yomwe imayambitsidwa nthawi ndi nthawi. Tiyeni tiwone mtundu wa mapulogalamu omwe amayendetsa padongosolo pogwiritsa ntchito script execsnoop kuchokera momwemo BCC zida zothandizira. Tiyeni tiyendetse ndikutumiza zotuluka ku fayilo.

Mwachitsanzo monga chonchi:

/usr/share/bcc/tools/execsnoop  | tee ./execdump

Sitiwonetsa kutulutsa kwathunthu kwa execsnoop apa, koma mzere umodzi wosangalatsa kwa ife umawoneka motere:

sh 1764905 5802 0 sudo arcconf getconfig 1 AD | grep Temperature | awk -F '[:/]' '{print $2}' | sed 's/^ ([0-9]*) C.*/1/'

Mzere wachitatu ndi PPID (kholo PID) ya ndondomekoyi. Njira yokhala ndi PID 5802 idakhala imodzi mwazinthu zowunikira. Poyang'ana masinthidwe a dongosolo loyang'anira, magawo olakwika adapezeka. Kutentha kwa adaputala ya HBA kunatengedwa masekondi 30 aliwonse, omwe nthawi zambiri amafunikira. Titasintha nthawi yowerengera kukhala yotalikirapo, tidapeza kuti kuchedwa kwa pempho kwa wolandirayo sikunawonekerenso poyerekeza ndi olandila ena.

Koma sizikudziwikabe chifukwa chake chipangizo cha bcache chinali chochedwa kwambiri. Tidakonza nsanja yoyeserera yokhala ndi masinthidwe ofanana ndikuyesera kubweretsanso vutolo pogwiritsa ntchito fio pa bcache, nthawi ndi nthawi timagwiritsa ntchito udevadm trigger kuti tipange zochitika.

Kulemba Zida Zochokera ku BCC

Tiyeni tiyese kulemba chida chosavuta kuti tifufuze ndikuwonetsa mafoni omwe akuchedwa kwambiri generic_make_request(). Tilinso ndi chidwi ndi dzina la drive yomwe ntchitoyi idayitanidwira.

Dongosololi ndi losavuta:

  • Register kprobe pa generic_make_request():
    • Timasunga dzina la disk mu kukumbukira, kupezeka kudzera mkangano wa ntchito;
    • Timasunga chizindikiro chanthawi.

  • Register kretprobe za kubwerera kuchokera generic_make_request():
    • Timapeza sitampu yamakono;
    • Timayang'ana sitampu yosungidwa ndikuyifanizira ndi yomwe ilipo;
    • Ngati zotsatira zake ndi zazikulu kuposa zomwe zafotokozedwa, ndiye kuti timapeza dzina la disk losungidwa ndikuliwonetsa pa terminal.

Kprobes ΠΈ kretprobes gwiritsani ntchito njira yopumira kuti musinthe code yogwirira ntchito pa ntchentche. Mutha kuwerenga zolemba ΠΈ zabwino nkhani pamutuwu. Ngati muyang'ana ma code azinthu zosiyanasiyana mu BCC, ndiye mutha kuwona kuti ali ndi mawonekedwe ofanana. Chifukwa chake m'nkhaniyi tidumpha mikangano ya script ndikupita ku pulogalamu ya BPF yokha.

Zolemba za eBPF mkati mwa python script zikuwoneka motere:

bpf_text = β€œβ€β€ # Here will be the bpf program code β€œβ€β€

Kusinthanitsa deta pakati pa ntchito, mapulogalamu a eBPF amagwiritsa ntchito matebulo a hashi. Tidzachitanso chimodzimodzi. Tidzagwiritsa ntchito PID ngati chinsinsi, ndikutanthauzira kapangidwe kake ngati mtengo:

struct data_t {
	u64 pid;
	u64 ts;
	char comm[TASK_COMM_LEN];
	u64 lat;
	char disk[DISK_NAME_LEN];
};

BPF_HASH(p, u64, struct data_t);
BPF_PERF_OUTPUT(events);

Apa timalembetsa tebulo la hashi lotchedwa p, ndi mtundu wachinsinsi u64 ndi mtengo wamtundu struct data_t. Gome lidzakhalapo malinga ndi pulogalamu yathu ya BPF. BPF_PERF_OUTPUT macro amalembetsa tebulo lina lotchedwa zochitika, zomwe zimagwiritsidwa ntchito kutumiza deta mu malo ogwiritsa ntchito.

Mukayesa kuchedwa pakati pa kuyitana ntchito ndikubwerera kuchokera ku izo, kapena pakati pa mafoni kupita kuzinthu zosiyanasiyana, muyenera kuganizira kuti zomwe mwalandira ziyenera kukhala zamtundu womwewo. Mwa kuyankhula kwina, muyenera kukumbukira za zotheka kufanana kukhazikitsidwa kwa ntchito. Tili ndi mphamvu yoyezera kuchedwa pakati pa kuyitana ntchito muzochitika za ndondomeko imodzi ndikubwerera kuchokera ku ntchitoyo muzochitika za ndondomeko ina, koma izi ndizopanda ntchito. Chitsanzo chabwino apa chingakhale biolatency zothandiza, pomwe kiyi ya tebulo la hashi imayikidwa kuti ikhale cholozera struct pempho, zomwe zikuwonetsa pempho limodzi la disk.

Kenaka, tifunika kulemba code yomwe idzagwire ntchito ikadzatchedwa:

void start(struct pt_regs *ctx, struct bio *bio) {
	u64 pid = bpf_get_current_pid_tgid();
	struct data_t data = {};
	u64 ts = bpf_ktime_get_ns();
	data.pid = pid;
	data.ts = ts;
	bpf_probe_read_str(&data.disk, sizeof(data.disk), (void*)bio->bi_disk->disk_name);
	p.update(&pid, &data);
}

Apa mtsutso woyamba wa ntchito yotchedwa ntchito idzalowetsedwa m'malo ngati mtsutso wachiwiri generic_make_request(). Pambuyo pake, timapeza PID ya ndondomekoyi momwe tikugwira ntchito, ndi nthawi yamakono mu nanoseconds. Timalemba zonse m'mawu osankhidwa mwatsopano struct data_t data. Timapeza dzina la disk kuchokera pamapangidwe zamera, zomwe zimadutsa poyitana generic_make_request(), ndi kulisunga mumpangidwe womwewo deta. Chomaliza ndikuwonjezera cholowera pa tebulo la hashi lomwe latchulidwa kale.

Ntchito yotsatirayi idzayitanidwa pakubwerera kuchokera generic_make_request():

void stop(struct pt_regs *ctx) {
    u64 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    struct data_t* data = p.lookup(&pid);
    if (data != 0 && data->ts > 0) {
        bpf_get_current_comm(&data->comm, sizeof(data->comm));
        data->lat = (ts - data->ts)/1000;
        if (data->lat > MIN_US) {
            FACTOR
            data->pid >>= 32;
            events.perf_submit(ctx, data, sizeof(struct data_t));
        }
        p.delete(&pid);
    }
}

Ntchitoyi ndi yofanana ndi yapitayi: timapeza PID ya ndondomekoyi ndi ndondomeko ya nthawi, koma osagawa kukumbukira kwa deta yatsopano. M'malo mwake, timasaka tebulo la hashi kuti tipeze zomwe zilipo kale pogwiritsa ntchito kiyi == PID yamakono. Ngati dongosolo likupezeka, ndiye kuti timapeza dzina la ndondomeko yoyendetsera ntchito ndikuwonjezerapo.

Kusintha kwa binary komwe timagwiritsa ntchito pano ndikofunikira kuti tipeze ulusi wa GID. izo. PID ya njira yayikulu yomwe idayambitsa ulusi pazomwe tikugwira ntchito. Ntchito timayitana bpf_get_current_pid_tgid() imabweretsa GID ya ulusi ndi PID yake mumtengo umodzi wa 64-bit.

Potulutsa ku terminal, sitikhala ndi chidwi ndi ulusi, koma tili ndi chidwi ndi njira yayikulu. Pambuyo poyerekezera kuchedwa kotsatira ndi malire opatsidwa, timadutsa dongosolo lathu deta mu malo ogwiritsa ntchito kudzera pa tebulo zochitika, pambuyo pake timachotsa cholowacho p.

Muzolemba za python zomwe zidzatsegule kachidindo iyi, tifunika kusintha MIN_US ndi FACTOR ndi zocheperako ndi mayunitsi a nthawi, zomwe tidutse pazokanganazo:

bpf_text = bpf_text.replace('MIN_US',str(min_usec))
if args.milliseconds:
	bpf_text = bpf_text.replace('FACTOR','data->lat /= 1000;')
	label = "msec"
else:
	bpf_text = bpf_text.replace('FACTOR','')
	label = "usec"

Tsopano tikuyenera kukonzekera pulogalamu ya BPF kudzera BPF macro ndi kulembetsa zitsanzo:

b = BPF(text=bpf_text)
b.attach_kprobe(event="generic_make_request",fn_name="start")
b.attach_kretprobe(event="generic_make_request",fn_name="stop")

Tiyeneranso kudziwa struct data_t m'mawu athu, apo ayi sitidzatha kuwerenga chilichonse:

TASK_COMM_LEN = 16	# linux/sched.h
DISK_NAME_LEN = 32	# linux/genhd.h
class Data(ct.Structure):
	_fields_ = [("pid", ct.c_ulonglong),
            	("ts", ct.c_ulonglong),
            	("comm", ct.c_char * TASK_COMM_LEN),
            	("lat", ct.c_ulonglong),
            	("disk",ct.c_char * DISK_NAME_LEN)]

Chomaliza ndikutulutsa deta ku terminal:

def print_event(cpu, data, size):
    global start
    event = ct.cast(data, ct.POINTER(Data)).contents
    if start == 0:
        start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d   %-1s %s   %s" % (time_s, event.comm, event.pid, event.lat, label, event.disk))

b["events"].open_perf_buffer(print_event)
# format output
start = 0
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Script yokha ikupezeka pa GIHub. Tiyeni tiyese kuyendetsa pa nsanja yoyeserera pomwe fio ikugwira ntchito, kulembera ku bcache, ndikuyimbira udevadm monitor:

Kuchokera ku High Ceph Latency kupita ku Kernel Patch yokhala ndi eBPF/BCC
Pomaliza! Tsopano tikuwona kuti chomwe chimawoneka ngati chida choyimilira cha bcache ndichoyimitsa generic_make_request() kwa diski yosungidwa.

Dulani mu Kernel

Ndi chiyani kwenikweni chomwe chikuchedwetsa panthawi yofunsira? Tikuwona kuti kuchedwa kumachitika ngakhale isanayambe kuwerengera ndalama, i.e. kuwerengera za pempho linalake lofuna kutulutsanso ziwerengero zake (/proc/diskstats kapena iostat) sikunayambe. Izi zitha kutsimikiziridwa mosavuta pogwiritsa ntchito iostat ndikubweretsanso vuto, kapena BCC script biolatency, zomwe zimatengera kuyambira ndi kutha kwa kuwerengera ndalama. Palibe chilichonse mwazinthu izi chomwe chidzawonetse zovuta pazofunsira ku diski yosungidwa.

Ngati tiyang'ana pa ntchito generic_make_request(), ndiye tiwona kuti pempho lowerengera lisanayambe, ntchito zina ziwiri zimatchedwa. Choyamba - generic_make_request_checks(), imayang'ana kuvomerezeka kwa pempho lokhudzana ndi makonda a disk. Chachiwiri - blk_queue_enter(), yomwe ili ndi zovuta zosangalatsa wait_event_interruptible():

ret = wait_event_interruptible(q->mq_freeze_wq,
	(atomic_read(&q->mq_freeze_depth) == 0 &&
	(preempt || !blk_queue_preempt_only(q))) ||
	blk_queue_dying(q));

M'menemo, kernel imadikirira kuti mzerewo usungunuke. Tiyeni tiyeze kuchedwa blk_queue_enter():

~# /usr/share/bcc/tools/funclatency  blk_queue_enter -i 1 -m               	 
Tracing 1 functions for "blk_queue_enter"... Hit Ctrl-C to end.

 	msecs           	: count 	distribution
     	0 -> 1      	: 341  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 316  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 255  	|****************************************|
     	2 -> 3      	: 0    	|                                    	|
     	4 -> 7      	: 0    	|                                    	|
     	8 -> 15     	: 1    	|                                    	|

Zikuwoneka ngati tatsala pang'ono kupeza yankho. Ntchito zomwe zimagwiritsidwa ntchito poyimitsa / kumasula mzere ndi blk_mq_freeze_queue ΠΈ blk_mq_unfreeze_queue. Amagwiritsidwa ntchito ngati pakufunika kusintha masinthidwe a mizere yopempha, zomwe zingakhale zoopsa pazofunsira pamzerewu. Poyimba blk_mq_freeze_queue() ntchito blk_freeze_queue_start() kauntala ikuwonjezeka q->mq_freeze_kuya. Pambuyo pake, kernel imadikirira kuti mzerewo ulowemo blk_mq_freeze_queue_wait().

Nthawi yomwe imafunika kuchotsa mzerewu ndi yofanana ndi disk latency pomwe kernel imadikirira kuti ntchito zonse zomwe zili pamzere zimalize. Mzere ukakhala wopanda kanthu, zosintha zimayikidwa. Pambuyo pake amatchedwa blk_mq_unfreeze_queue(), kuchepetsa kauntala kuzimitsa_kuya.

Tsopano tikudziwa mokwanira kukonza zinthu. Lamulo la udevadm trigger limapangitsa kuti makonzedwe a chipangizo chotchinga agwiritsidwe ntchito. Zokonda izi zikufotokozedwa m'malamulo a udev. Titha kupeza makonda omwe akuundana pamzere poyesera kuwasintha kudzera mu sysfs kapena kuyang'ana pa kernel source code. Titha kuyesanso kugwiritsa ntchito BCC tsatanetsatane, yomwe idzatulutse kernel ndi userspace stack trakes pa kuyitana kulikonse ku terminal blk_freeze_queue, mwachitsanzo:

~# /usr/share/bcc/tools/trace blk_freeze_queue -K -U
PID 	TID 	COMM        	FUNC        	 
3809642 3809642 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	elevator_switch+0x29 [kernel]
    	elv_iosched_store+0x197 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

3809631 3809631 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	queue_requests_store+0xb6 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

Malamulo a Udev amasintha kawirikawiri ndipo nthawi zambiri izi zimachitika molamulidwa. Chifukwa chake tikuwona kuti ngakhale kugwiritsa ntchito zikhalidwe zomwe zakhazikitsidwa kale kumayambitsa kuchedwetsa kusamutsa pempho kuchokera ku pulogalamu kupita ku diski. Zoonadi, kupanga zochitika za udev pamene palibe kusintha kwa kasinthidwe ka disk (mwachitsanzo, chipangizocho sichinakwezedwe / chotsekedwa) sichiri chabwino. Komabe, titha kuthandiza kernel kuti isagwire ntchito yosafunikira ndikuyimitsa mzere wopempha ngati sikofunikira. Zitatu yaying'ono perekani konzani mkhalidwewo.

Kutsiliza

eBPF ndi chida chosinthika komanso champhamvu. M’nkhaniyo tinaona chitsanzo chimodzi chothandiza ndi kusonyeza mbali yaing’ono ya zimene tingathe kuchita. Ngati mukufuna kupanga zida za BCC, ndizoyenera kuziwona maphunziro ovomerezeka, yomwe imalongosola zoyambira bwino.

Palinso zida zina zosangalatsa zowongolera ndi kuyika mbiri kutengera eBPF. Mmodzi wa iwo - bphtrace, zomwe zimakulolani kuti mulembe ma liner amphamvu ndi mapulogalamu ang'onoang'ono m'chinenero chofanana ndi awk. Winanso - ebpf_exporter, imakulolani kuti mutenge ma metric otsika, okwera kwambiri mwachindunji mu seva yanu ya prometheus, ndi kuthekera kopeza mawonedwe okongola komanso zidziwitso.

Source: www.habr.com

Kuwonjezera ndemanga