Ho tloha ho High Ceph Latency ho ea Kernel Patch ho sebelisa eBPF/BCC

Ho tloha ho High Ceph Latency ho ea Kernel Patch ho sebelisa eBPF/BCC

Linux e na le lisebelisoa tse ngata tsa ho lokisa kernel le lits'ebetso. Tse ngata tsa tsona li na le phello e mpe ts'ebetsong ea kopo 'me li ke ke tsa sebelisoa tlhahisong.

Lilemong tse 'maloa tse fetileng ho ne ho le teng ho entsoe sesebelisoa se seng - eBPF. E etsa hore ho khonehe ho ts'oara kernel le lits'ebetso tsa mosebelisi ka holimo holimo ntle le tlhoko ea ho aha mananeo bocha le ho kenya li-module tsa mokha oa boraro ka har'a kernel.

Ho se ho na le lisebelisoa tse ngata tse sebelisang eBPF, 'me sehloohong sena re tla sheba mokhoa oa ho ngola ts'ebeliso ea hau ea profiling ho latela laeborari. PythonBCC. Sengoliloeng se thehiloe liketsahalong tsa sebele. Re tla tloha ho bothata ho ea ho lokisa ho bontša hore na lisebelisoa tse teng li ka sebelisoa joang maemong a itseng.

Ceph o butle

Moamoheli e mocha o kentsoe sehlopheng sa Ceph. Ka mor'a hore re fallele boitsebiso bo bong ho eona, re hlokometse hore lebelo la ho sebetsana le likopo tsa ho ngola ka eona le ne le le tlaase haholo ho feta ho li-server tse ling.

Ho tloha ho High Ceph Latency ho ea Kernel Patch ho sebelisa eBPF/BCC
Ho fapana le li-platform tse ling, moamoheli enoa o sebelisitse bcache le linux 4.15 kernel e ncha. Lena e ne e le lekhetlo la pele palo e ngata ea tlhophiso ena e sebelisoa mona. 'Me ka nako eo ho ne ho hlakile hore motso oa bothata e ka ba ntho leha e le efe.

Ho Fuputsa Moeti

Ha re qale ka ho sheba se etsahalang ka har'a ts'ebetso ea ceph-osd. Bakeng sa sena re tla sebelisa sehlahisoa и flamescope (tse ling tseo u ka li balang mona):

Ho tloha ho High Ceph Latency ho ea Kernel Patch ho sebelisa eBPF/BCC
Setšoantšo se re bolella hore mosebetsi fdatasync() o qetile nako e ngata a romella kopo mesebetsing generic_make_request(). Sena se bolela hore mohlomong sesosa sa mathata ke kae-kae ka ntle ho daemon ea osd ka boeona. Sena e ka ba kernel kapa disks. Sephetho sa iostat se bonts'itse latency e phahameng ha ho sebetsoa likopo ka li-disk tsa bcache.

Ha re hlahloba moamoheli, re fumane hore daemon ea systemd-udevd e sebelisa nako e ngata ea CPU - e ka bang 20% ​​ho li-cores tse 'maloa. Ena ke boitšoaro bo makatsang, kahoo o hloka ho fumana hore na ke hobane'ng. Kaha Systemd-udevd e sebetsa le liketsahalo, re nkile qeto ea ho li sheba ka botlalo udevadm monitor. Hoa etsahala hore palo e kholo ea liketsahalo tsa phetoho e hlahisitsoe bakeng sa sesebelisoa se seng le se seng sa thibela tsamaiso. Sena ha sea tloaeleha, kahoo re tla tlameha ho sheba hore na ke eng e hlahisang liketsahalo tsena kaofela.

Ho sebelisa BCC Toolkit

Joalo ka ha re se re fumane, kernel (le ceph daemon ka har'a mohala oa sistimi) e qeta nako e ngata e le teng. generic_make_request(). A re leke ho lekanya lebelo la mosebetsi ona. IN BCC Ho se ho ntse ho e-na le thuso e babatsehang - funclatency. Re tla sala morao daemon ka PID ea eona ka nako ea motsotsoana o le mong pakeng tsa liphetho le ho hlahisa sephetho ka milliseconds.

Ho tloha ho High Ceph Latency ho ea Kernel Patch ho sebelisa eBPF/BCC
Hangata tšobotsi ena e sebetsa kapele. Sohle seo e se etsang ke ho fetisetsa kopo ho queue ea mokhanni oa sesebelisoa.

Bcache ke sesebelisoa se rarahaneng se hlileng se nang le li-disk tse tharo:

  • sesebelisoa sa tšehetso (cached disk), tabeng ena ke HDD e liehang;
  • sesebelisoa sa caching (caching disk), mona ke karolo e le 'ngoe ea sesebelisoa sa NVMe;
  • sesebelisoa sa bcache seo sesebelisoa se sebetsang ka sona.

Rea tseba hore phetisetso ea kopo e lieha, empa ke efe ea lisebelisoa tsee? Re tla sebetsana le taba ena hamorao.

Hona joale rea tseba hore liketsahalo li ka baka mathata. Ho fumana se hlileng se bakang moloko oa bona ha ho bonolo hakaalo. Ha re nke hore ona ke mofuta o mong oa software o qalisoang nako le nako. Ha re boneng hore na ke software ea mofuta ofe e tsamaisang sistimi e sebelisang script execsnoop ho tloha ho tshoana BCC lisebelisoa tsa lisebelisoa. Ha re e tsamaise ebe re romela tlhahiso ho faele.

Ka mohlala, joalo ka:

/usr/share/bcc/tools/execsnoop  | tee ./execdump

Ha re na ho bonts'a tlhahiso e felletseng ea execsnoop mona, empa mola o le mong oa thahasello ho rona o ne o shebahala tjena:

sh 1764905 5802 0 sudo arcconf getconfig 1 AD | grep Temperature | awk -F '[:/]' '{print $2}' | sed 's/^ ([0-9]*) C.*/1/'

Kholomo ea boraro ke PPID (PID ea motsoali) ea ts'ebetso. Ts'ebetso e nang le PID 5802 e fetohile e 'ngoe ea likhoele tsa sistimi ea rona ea ho beha leihlo. Ha ho hlahlojoa tlhophiso ea tsamaiso ea ho shebella, ho ile ha fumanoa litekanyetso tse fosahetseng. Thempereichara ea adaptara ea HBA e nkiloe metsotsoana e meng le e meng e 30, e leng hangata ho feta kamoo ho hlokahalang. Ka mor'a ho fetola nako ea ho hlahloba hore e be e telele, re fumane hore nako ea ho sebetsa ha kopo ho moamoheli enoa ha e sa hlahella ha e bapisoa le baamoheli ba bang.

Empa ho ntse ho sa hlaka hore na ke hobane'ng ha sesebelisoa sa bcache se ne se lieha hakana. Re hlophisitse sethala sa liteko se nang le tlhophiso e ts'oanang mme ra leka ho hlahisa bothata hape ka ho sebelisa fio ho bcache, nako le nako re sebelisa udevadm trigger ho hlahisa liketsahalo.

Ho Ngola Lisebelisoa tse thehiloeng ho BCC

Ha re leke ho ngola ts'ebeliso e bonolo ea ho ts'oara le ho bonts'a mehala e liehang ho feta generic_make_request(). Re boetse re thahasella lebitso la koloi eo mosebetsi ona o neng o bitsoa ka eona.

Morero o bonolo:

  • Ngodisa kprobe mabapi le generic_make_request():
    • Re boloka lebitso la disk mohopolong, le fumaneha ka khang ea mosebetsi;
    • Re boloka setempe sa nako.

  • Ngodisa kretprobe bakeng sa ho khutla ho tsoa generic_make_request():
    • Re fumana setempe sa nako sa hajoale;
    • Re batla setempe sa nako se bolokiloeng ebe re se bapisa le sa hajoale;
    • Haeba sephetho se le seholo ho feta se boletsoeng, joale re fumana lebitso la disk le bolokiloeng ebe re le hlahisa ho terminal.

Kprobes и li-kretprobes sebelisa mochine oa breakpoint ho fetola khoutu ea ts'ebetso ho fofa. U ka bala litokomane и molemo sehlooho se buang ka taba ena. Haeba u sheba khoutu ea lisebelisoa tse fapaneng ho BCC, joale u ka bona hore li na le sebopeho se tšoanang. Kahoo sengolong sena re tla tlola likhang tsa sengoloa ebe re fetela lenaneong la BPF ka bolona.

Mongolo oa eBPF kahare ho python script o shebahala tjena:

bpf_text = “”” # Here will be the bpf program code “””

Ho fapanyetsana data lipakeng tsa mesebetsi, mananeo a eBPF a sebelisa litafole tsa hash. Re tla etsa joalo. Re tla sebelisa PID ea ts'ebetso e le senotlolo, 'me re hlalose sebopeho e le boleng:

struct data_t {
	u64 pid;
	u64 ts;
	char comm[TASK_COMM_LEN];
	u64 lat;
	char disk[DISK_NAME_LEN];
};

BPF_HASH(p, u64, struct data_t);
BPF_PERF_OUTPUT(events);

Mona re ngolisa tafole ea hash e bitsoang p, ka mofuta oa senotlolo u64 le boleng ba mofuta sebopeho data_t. Tafole e tla fumaneha ho latela maemo a lenaneo la rona la BPF. BPF_PERF_OUTPUT macro e ngolisa tafole e 'ngoe e bitsoang liketsahalo, e sebelisetsoang phetiso ya data sebakeng sa basebelisi.

Ha u lekanya tieho lipakeng tsa ho letsetsa tšebetso le ho khutla ho tsoa ho eona, kapa lipakeng tsa mehala ho ea lits'ebetsong tse fapaneng, o hloka ho ela hloko hore data e amohetsoeng e tlameha ho ba ea moelelo o tšoanang. Ka mantsoe a mang, o hloka ho hopola ka ts'ebetso e ts'oanang e ts'oanang ea mesebetsi. Re na le bokhoni ba ho lekanya ho lieha ha nako pakeng tsa ho letsetsa ts'ebetso molemong oa ts'ebetso e le 'ngoe le ho khutla mosebetsing oo ho latela mokhoa o mong, empa sena se ka' na sa se ke sa thusa. Mohlala o motle mona e ka ba sesebelisoa sa biolatency, moo senotlolo sa tafole ea hash se behiloeng ho pointer ho kopo ea sebopeho, e bonts'ang kopo ea disk e le 'ngoe.

Ka mor'a moo, re hloka ho ngola khoutu e tla sebetsa ha mosebetsi o ntseng o ithuta o bitsoa:

void start(struct pt_regs *ctx, struct bio *bio) {
	u64 pid = bpf_get_current_pid_tgid();
	struct data_t data = {};
	u64 ts = bpf_ktime_get_ns();
	data.pid = pid;
	data.ts = ts;
	bpf_probe_read_str(&data.disk, sizeof(data.disk), (void*)bio->bi_disk->disk_name);
	p.update(&pid, &data);
}

Mona ho tla nkeloa khang ea pele ea tšebetso e bitsoang khang ea bobeli generic_make_request(). Kamora sena, re fumana PID ea ts'ebetso maemong ao re sebetsang ho ona, le setempe sa nako sa hajoale ho li-nanoseconds. Re li ngola kaofela ka har'a e sa tsoa khethoa hlophisa data_t data. Re fumana lebitso la disk ho tloha mohahong Bio, e fetisoang ha ho letsa generic_make_request(), 'me u e boloke ka mokhoa o tšoanang ya data. Mohato oa ho qetela ke ho kenyelletsa ho kena tafoleng ea hash e boletsoeng pejana.

Mosebetsi o latelang o tla bitsoa ha o khutla generic_make_request():

void stop(struct pt_regs *ctx) {
    u64 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    struct data_t* data = p.lookup(&pid);
    if (data != 0 && data->ts > 0) {
        bpf_get_current_comm(&data->comm, sizeof(data->comm));
        data->lat = (ts - data->ts)/1000;
        if (data->lat > MIN_US) {
            FACTOR
            data->pid >>= 32;
            events.perf_submit(ctx, data, sizeof(struct data_t));
        }
        p.delete(&pid);
    }
}

Ts'ebetso ena e ts'oana le e fetileng: re fumana PID ea ts'ebetso le setempe sa nako, empa u se ke ua fana ka mohopolo bakeng sa sebopeho se secha sa data. Sebakeng seo, re batla tafole ea hash bakeng sa sebopeho se seng se ntse se le teng re sebelisa senotlolo == PID ea hajoale. Haeba sebopeho se fumanoa, joale re fumana lebitso la ts'ebetso e sebetsang ebe re e eketsa ho eona.

Phetoho ea binary eo re e sebelisang mona ea hlokahala ho fumana khoele ea GID. tseo. PID ea ts'ebetso e kholo e qalileng khoele maemong ao re sebetsang ka ona. Mosebetsi oo re o bitsang bpf_get_current_pid_tgid() e khutlisa GID ea khoele le PID ea eona ka boleng bo le bong ba 64-bit.

Ha re hlahisa ho terminal, ha joale ha re thahaselle khoele, empa re thahasella ts'ebetso ea mantlha. Ka mor'a ho bapisa ho lieha ho hlahisoang le moeli o fanoeng, re feta mohaho oa rona ya data sebakeng sa basebelisi ka tafole liketsahalo, ka mor'a moo re hlakola ho kena ho tloha p.

Ho script ea python e tla kenya khoutu ena, re hloka ho khutlisa MIN_US le FACTOR ka litekanyo tsa ho lieha le likarolo tsa nako, tseo re tla li fetisa likhang:

bpf_text = bpf_text.replace('MIN_US',str(min_usec))
if args.milliseconds:
	bpf_text = bpf_text.replace('FACTOR','data->lat /= 1000;')
	label = "msec"
else:
	bpf_text = bpf_text.replace('FACTOR','')
	label = "usec"

Hona joale re hloka ho lokisa lenaneo la BPF ka BPF macro le ho ngolisa disampole:

b = BPF(text=bpf_text)
b.attach_kprobe(event="generic_make_request",fn_name="start")
b.attach_kretprobe(event="generic_make_request",fn_name="stop")

Hape re tla tlameha ho etsa qeto sebopeho data_t ka mongolo oa rona, ho seng joalo re ke ke ra khona ho bala letho:

TASK_COMM_LEN = 16	# linux/sched.h
DISK_NAME_LEN = 32	# linux/genhd.h
class Data(ct.Structure):
	_fields_ = [("pid", ct.c_ulonglong),
            	("ts", ct.c_ulonglong),
            	("comm", ct.c_char * TASK_COMM_LEN),
            	("lat", ct.c_ulonglong),
            	("disk",ct.c_char * DISK_NAME_LEN)]

Mohato oa ho qetela ke ho ntša data ho terminal:

def print_event(cpu, data, size):
    global start
    event = ct.cast(data, ct.POINTER(Data)).contents
    if start == 0:
        start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d   %-1s %s   %s" % (time_s, event.comm, event.pid, event.lat, label, event.disk))

b["events"].open_perf_buffer(print_event)
# format output
start = 0
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Script ka boeona e fumaneha ho GIHub. Ha re leke ho e tsamaisa sethaleng sa liteko moo fio e sebetsang teng, e ngolla bcache, 'me u letse udevadm monitor:

Ho tloha ho High Ceph Latency ho ea Kernel Patch ho sebelisa eBPF/BCC
Qetellong! Joale rea bona hore se neng se shebahala joalo ka sesebelisoa sa bcache se tsitsitseng ha e le hantle ke mohala o thibang generic_make_request() bakeng sa "cached disk".

Kena ka har'a Kernel

Hantle-ntle ho fokotseha ha lebelo nakong ea phetisetso ea kopo? Rea bona hore ho lieha ho etsahala le pele ho qala kopo ea accounting, i.e. tlaleho ea kopo e khethehileng ea tlhahiso e eketsehileng ea lipalo-palo ho eona (/proc/diskstats kapa iostat) ha e so qale. Sena se ka netefatsoa habonolo ka ho sebelisa iostat ha o ntse o hlahisa bothata, kapa BCC script biolatency, e ipapisitseng le qalo le pheletso ea kopo ea accounting. Ha ho le e 'ngoe ea lisebelisoa tsena e tla bontša mathata a likopo ho disk e bolokiloeng.

Haeba re sheba mosebetsi generic_make_request(), joale re tla bona hore pele kopo e qala accounting, ho bitsoa mesebetsi e meng e 'meli. Ea pele - generic_make_request_checks(), e etsa licheke mabapi le ho nepahala ha kopo mabapi le litlhophiso tsa disk. Ea bobeli - blk_queue_enter(), e nang le phephetso e khahlisang wait_event_interruptible():

ret = wait_event_interruptible(q->mq_freeze_wq,
	(atomic_read(&q->mq_freeze_depth) == 0 &&
	(preempt || !blk_queue_preempt_only(q))) ||
	blk_queue_dying(q));

Ho eona, kernel e emela hore letoto le theohe. Ha re lekanye tieho blk_queue_enter():

~# /usr/share/bcc/tools/funclatency  blk_queue_enter -i 1 -m               	 
Tracing 1 functions for "blk_queue_enter"... Hit Ctrl-C to end.

 	msecs           	: count 	distribution
     	0 -> 1      	: 341  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 316  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 255  	|****************************************|
     	2 -> 3      	: 0    	|                                    	|
     	4 -> 7      	: 0    	|                                    	|
     	8 -> 15     	: 1    	|                                    	|

Ho bonahala eka re haufi le tharollo. Mesebetsi e sebelisoang ho hatsetsa/ho lokolla mokoloko ke blk_mq_freeze_queue и blk_mq_unfreeze_queue. Li sebelisoa ha ho hlokahala ho fetola litlhophiso tsa lethathamo la likopo, tse ka bang kotsi bakeng sa likopo tse teng moleng ona. Ha o letsa blk_mq_freeze_queue() mosebetsi blk_freeze_queue_start() k'haontareng e ea eketseha q->mq_freeze_depth. Ka mor'a sena, kernel e emela hore letoto le kene blk_mq_freeze_queue_wait().

Nako eo e e nkang ho hlakola letoto lena e lekana le disk latency ha kernel e emetse hore ts'ebetso eohle e phethoe. Hang ha mola o se o se na letho, liphetoho tsa litlhophiso li tla sebelisoa. Ka mor'a moo e bitsoa blk_mq_unfreeze_queue(), ho fokotsa k'haontareng hoama_botebo.

Joale re tseba ho lekana ho lokisa boemo. Taelo ea trigger ea udevadm e etsa hore litlhophiso tsa sesebelisoa sa block se sebelisoe. Litlhophiso tsena li hlalositsoe ho melao ea udev. Re ka fumana hore na ke li-setting life tse etsang hore queue e be leqhoa ka ho leka ho e fetola ka li-sysfs kapa ka ho sheba khoutu ea mohloli oa kernel. Re ka boela ra leka lisebelisoa tsa BCC latela, e tla hlahisa li-kernel le li-userspace stack traces bakeng sa mohala o mong le o mong ho terminal blk_freeze_queuemohlala:

~# /usr/share/bcc/tools/trace blk_freeze_queue -K -U
PID 	TID 	COMM        	FUNC        	 
3809642 3809642 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	elevator_switch+0x29 [kernel]
    	elv_iosched_store+0x197 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

3809631 3809631 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	queue_requests_store+0xb6 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

Melao ea Udev e fetoha ka seoelo mme hangata sena se etsahala ka tsela e laoloang. Kahoo rea bona hore esita le ho sebelisa litekanyetso tse seng li behiloe ho baka spike ho lieha ho fetisetsa kopo ho tloha ho kopo ho ea ho disk. Ha e le hantle, ho hlahisa liketsahalo tsa udev ha ho se na liphetoho ho tlhophiso ea disk (mohlala, sesebelisoa ha se hloekisoe / se khaotsoe) hase mokhoa o motle. Leha ho le joalo, re ka thusa kernel hore e se ke ea etsa mosebetsi o sa hlokahaleng le ho emisa mokoloko oa kopo haeba ho sa hlokahale. Tse tharo nyane itlama lokisa boemo.

fihlela qeto e

eBPF ke sesebelisoa se tenyetsehang haholo ebile se matla. Sehloohong re ile ra sheba ketsahalo e le ’ngoe e sebetsang ’me ra bontša karolo e nyenyane ea se ka etsoang. Haeba u thahasella ho nts'etsapele lisebelisoa tsa BCC, ho bohlokoa hore u shebe thuto ea molao, e hlalosang lintho tsa motheo hantle.

Ho na le lisebelisoa tse ling tse khahlisang tsa ho lokisa liphoso le ho etsa profilse tse thehiloeng ho eBPF. E mong oa bona - bpftrace, e leng se u lumellang hore u ngole li-liner tse matla le mananeo a manyenyane ka puo e kang ea awk. E 'ngoe - ebpf_exporter, e u lumella ho bokella litekanyetso tse tlaase, tse phahameng ka ho toba ka har'a seva sa hau sa prometheus, ka bokhoni ba ho fumana lipono tse ntle le litlhokomeliso hamorao.

Source: www.habr.com

Eketsa ka tlhaloso