He nui nā mea hana a Linux no ka hoʻopau ʻana i ka kernel a me nā noi. ʻO ka hapa nui o lākou he hopena maikaʻi ʻole i ka hana noi a ʻaʻole hiki ke hoʻohana ʻia i ka hana.
ʻElua mau makahiki i hala aku nei
Nui nā pono noi e hoʻohana ana i ka eBPF, a ma kēia ʻatikala e nānā mākou pehea e kākau ai i kāu pono ponoʻī e pili ana i ka waihona.
Ua lohi ʻo Ceph
Ua hoʻohui ʻia kahi mea hoʻokipa hou i ka hui Ceph. Ma hope o ka neʻe ʻana i kekahi o ka ʻikepili iā ia, ua ʻike mākou ua ʻoi aku ka haʻahaʻa o ka wikiwiki o ka hana ʻana i nā noi kākau ma mua o nā kikowaena ʻē aʻe.
ʻAʻole like me nā kahua ʻē aʻe, ua hoʻohana kēia host i ka bcache a me ka linux 4.15 kernel hou. ʻO kēia ka manawa mua i hoʻohana ʻia ai kahi pūʻali o kēia hoʻonohonoho ma aneʻi. A i kēlā manawa ua maopopo ke kumu o ka pilikia i ke ʻano o kekahi mea.
Ke noiʻi ʻana i ka mea hoʻokipa
E hoʻomaka kākou ma ka nānā ʻana i nā mea e hana nei i loko o ke kaʻina hana ceph-osd. No kēia e hoʻohana mākou
Hōʻike ke kiʻi iā mākou i ka hana fdatasync() ua lōʻihi ka manawa e hoʻouna i kahi noi i nā hana generic_make_request(). ʻO ia ke kumu o ke kumu o nā pilikia aia ma waho o ka daemon osd ponoʻī. Hiki paha kēia i ka kernel a i ʻole nā disks. Ua hōʻike ka iostat output i kahi latency kiʻekiʻe i ka hana ʻana i nā noi e nā bcache disks.
I ka nānā ʻana i ka host, ʻike mākou ua hoʻopau ka systemd-udevd daemon i ka nui o ka manawa CPU - ma kahi o 20% ma kekahi mau cores. He ʻano ʻano ʻē kēia, no laila pono ʻoe e ʻike i ke kumu. No ka hana ʻana o Systemd-udevd me uevents, ua hoʻoholo mākou e nānā iā lākou ma o udevadm monitor. Ua ʻike ʻia he nui o nā hanana hoʻololi i hana ʻia no kēlā me kēia mea poloka i ka ʻōnaehana. He mea maʻamau kēia, no laila pono mākou e nānā i ka mea e hoʻopuka ai i kēia mau hanana āpau.
Ke hoʻohana nei i ka BCC Toolkit
E like me kā mākou i ʻike mua ai, ʻo ka kernel (a me ka ceph daemon i ke kelepona ʻōnaehana) e hoʻolilo i ka manawa nui i generic_make_request(). E hoao kakou e ana i ka mama o keia hana. IN
Hana wikiwiki kēia hiʻohiʻona. ʻO nā mea a pau e hāʻawi i ka noi i ka pila o ka mea hoʻokele.
Bcache he mea paʻakikī i loaʻa i ʻekolu disks:
- mea kākoʻo (cached disk), i kēia hihia he HDD lohi;
- mea hoʻokohu (caching disk), eia kahi ʻāpana o ka mea NVMe;
- ka bcache virtual device e holo ai ka polokalamu.
Ua ʻike mākou ua lohi ka hoʻouna ʻana i ke noi, akā no wai o kēia mau mea hana? E hana mākou i kēia ma hope iki.
Ua ʻike mākou i kēia manawa hiki i nā uevents ke kumu i nā pilikia. ʻAʻole maʻalahi ka ʻimi ʻana i ke kumu o kā lākou hanauna. E manaʻo kākou he ʻano polokalamu kēia i hoʻokuʻu ʻia i kēlā me kēia manawa. E ʻike kākou i ke ʻano o ka polokalamu e holo ana ma ka ʻōnaehana me ka hoʻohana ʻana i kahi palapala execsnoop mai ka mea like
No ka laʻana penei:
/usr/share/bcc/tools/execsnoop | tee ./execdump
ʻAʻole mākou e hōʻike i ka hoʻopuka piha o execsnoop ma aneʻi, akā hoʻokahi laina hoihoi iā mākou e like me kēia:
sh 1764905 5802 0 sudo arcconf getconfig 1 AD | grep Temperature | awk -F '[:/]' '{print $2}' | sed 's/^ ([0-9]*) C.*/1/'
ʻO ke kolamu ʻekolu ka PPID (PID makua) o ke kaʻina hana. ʻO ke kaʻina hana me PID 5802 ua lilo ia i kekahi o nā pae o kā mākou ʻōnaehana nānā. I ka nānā ʻana i ka hoʻonohonoho ʻana o ka ʻōnaehana nānā, ʻike ʻia nā ʻāpana hewa. Lawe ʻia ka mahana o ka adapter HBA i kēlā me kēia 30 kekona, ʻoi aku ka nui o ka manawa ma mua o ka pono. Ma hope o ka hoʻololi ʻana i ka wā hoʻopaʻa i kahi lōʻihi, ʻike mākou ʻaʻole kū hou ka hoʻoponopono noi ʻana i kēia host i hoʻohālikelike ʻia me nā pūʻali ʻē aʻe.
Akā ʻaʻole maopopo ke kumu i lohi loa ai ka hāmeʻa bcache. Hoʻomākaukau mākou i kahi kahua hoʻāʻo me kahi hoʻonohonoho like a hoʻāʻo e hana hou i ka pilikia ma ka holo ʻana i ka fio ma bcache, e holo ana i kēlā me kēia manawa i ka udevadm trigger e hana i nā uevents.
Ke kākau ʻana i nā mea paahana BCC
E ho'āʻo kākou e kākau i kahi mea hoʻohana maʻalahi e ʻimi a hōʻike i nā kelepona lohi loa generic_make_request(). Makemake pū mākou i ka inoa o ka drive i kapa ʻia ai kēia hana.
He maʻalahi ka papahana:
- Kakau inoa kprobe maluna o generic_make_request():
- Mālama mākou i ka inoa disk i ka hoʻomanaʻo, hiki ke loaʻa ma o ka hoʻopaʻapaʻa hana;
- Mālama mākou i ka hōʻailona manawa.
- Kakau inoa kretprobe no ka hoi ana mai generic_make_request():
- Loaʻa iā mākou ka hōʻailona manawa o kēia manawa;
- Ke nānā nei mākou i ka hōʻailona manawa i mālama ʻia a hoʻohālikelike ʻia me ka mea i kēia manawa;
- Inā ʻoi aku ka hopena ma mua o ka mea i kuhikuhi ʻia, a laila ʻike mākou i ka inoa disk i mālama ʻia a hōʻike iā ia ma ka pahu.
Kprobes и kretprobes e hoʻohana i kahi hana breakpoint e hoʻololi i ke code hana ma ka lele. Hiki iā ʻoe ke heluhelu
ʻO ka kikokikona eBPF i loko o ka palapala python e like me kēia:
bpf_text = “”” # Here will be the bpf program code “””
E hoʻololi i ka ʻikepili ma waena o nā hana, hoʻohana nā polokalamu eBPF
struct data_t {
u64 pid;
u64 ts;
char comm[TASK_COMM_LEN];
u64 lat;
char disk[DISK_NAME_LEN];
};
BPF_HASH(p, u64, struct data_t);
BPF_PERF_OUTPUT(events);
Ma ʻaneʻi mākou e hoʻopaʻa inoa i kahi papa hash i kapa ʻia p, me ke ʻano kī u64 a he waiwai o ke ano struct data_t. E loaʻa ana ka papaʻaina ma ka pōʻaiapili o kā mākou papahana BPF. Hoʻopaʻa inoa ʻo BPF_PERF_OUTPUT macro i kekahi papa i kapa ʻia hanana, i hoohanaia no
Ke ana ʻana i nā lohi ma waena o ke kāhea ʻana i kahi hana a me ka hoʻi ʻana mai ia mea, a i ʻole ma waena o nā kelepona i nā hana like ʻole, pono ʻoe e noʻonoʻo pono i ka ʻikepili i loaʻa i ka pōʻaiapili like. I nā huaʻōlelo ʻē aʻe, pono ʻoe e hoʻomanaʻo e pili ana i ka hoʻomaka like ʻana o nā hana. Loaʻa iā mākou ka hiki ke ana i ka latency ma waena o ke kāhea ʻana i kahi hana i loko o ka pōʻaiapili o kahi kaʻina hana a me ka hoʻi ʻana mai kēlā hana ma ke ʻano o kahi kaʻina hana ʻē aʻe, akā ʻaʻole pono kēia. He laʻana maikaʻi ma ʻaneʻi
A laila, pono mākou e kākau i ke code e holo i ka wā i kapa ʻia ai ka hana ma lalo o ke aʻo ʻana:
void start(struct pt_regs *ctx, struct bio *bio) {
u64 pid = bpf_get_current_pid_tgid();
struct data_t data = {};
u64 ts = bpf_ktime_get_ns();
data.pid = pid;
data.ts = ts;
bpf_probe_read_str(&data.disk, sizeof(data.disk), (void*)bio->bi_disk->disk_name);
p.update(&pid, &data);
}
Maanei e hoʻololi ʻia ka hoʻopaʻapaʻa mua o ka hana i kapa ʻia e like me ka pane ʻelua
E kāhea ʻia kēia hana i ka hoʻi ʻana mai generic_make_request():
void stop(struct pt_regs *ctx) {
u64 pid = bpf_get_current_pid_tgid();
u64 ts = bpf_ktime_get_ns();
struct data_t* data = p.lookup(&pid);
if (data != 0 && data->ts > 0) {
bpf_get_current_comm(&data->comm, sizeof(data->comm));
data->lat = (ts - data->ts)/1000;
if (data->lat > MIN_US) {
FACTOR
data->pid >>= 32;
events.perf_submit(ctx, data, sizeof(struct data_t));
}
p.delete(&pid);
}
}
Ua like kēia hana me ka mea ma mua: ʻike mākou i ka PID o ke kaʻina hana a me ka timestamp, akā ʻaʻole e hoʻokaʻawale i ka hoʻomanaʻo no ka hoʻolālā ʻikepili hou. Ma kahi o, ʻimi mākou i ka papaʻaina hash no kahi hoʻolālā e hoʻohana nei i ke kī == PID o kēia manawa. Inā loaʻa ka hale, a laila ʻike mākou i ka inoa o ke kaʻina holo a hoʻohui iā ia.
Pono ka hoʻololi binary a mākou e hoʻohana ai ma aneʻi e kiʻi i ka thread GID. ka poe. PID o ke kaʻina hana nui i hoʻomaka i ke kaula i loko o ka pōʻaiapili a mākou e hana nei. ʻO ka hana a mākou e kāhea nei
Ke hoʻopuka ʻana i ka terminal, ʻaʻole mākou makemake i kēia manawa i ke kaula, akā makemake mākou i ke kaʻina hana nui. Ma hope o ka hoʻohālikelike ʻana i ka hopena o ka lohi me kahi paepae i hāʻawi ʻia, e hala mākou i kā mākou hale ʻikepili i loko o ka mea hoʻohana ma o ka papaʻaina hanana, a laila holoi mākou i ke komo ʻana mai p.
Ma ka palapala python e hoʻouka i kēia code, pono mākou e hoʻololi i ka MIN_US a me FACTOR me nā paepae lohi a me nā ʻāpana manawa, a mākou e hele ai i nā manaʻo:
bpf_text = bpf_text.replace('MIN_US',str(min_usec))
if args.milliseconds:
bpf_text = bpf_text.replace('FACTOR','data->lat /= 1000;')
label = "msec"
else:
bpf_text = bpf_text.replace('FACTOR','')
label = "usec"
I kēia manawa pono mākou e hoʻomākaukau i ka papahana BPF ma o
b = BPF(text=bpf_text)
b.attach_kprobe(event="generic_make_request",fn_name="start")
b.attach_kretprobe(event="generic_make_request",fn_name="stop")
Pono mākou e hoʻoholo struct data_t i kā mākou palapala, inā ʻaʻole hiki iā mākou ke heluhelu i kekahi mea:
TASK_COMM_LEN = 16 # linux/sched.h
DISK_NAME_LEN = 32 # linux/genhd.h
class Data(ct.Structure):
_fields_ = [("pid", ct.c_ulonglong),
("ts", ct.c_ulonglong),
("comm", ct.c_char * TASK_COMM_LEN),
("lat", ct.c_ulonglong),
("disk",ct.c_char * DISK_NAME_LEN)]
ʻO ka hana hope e hoʻopuka i ka ʻikepili i ka terminal:
def print_event(cpu, data, size):
global start
event = ct.cast(data, ct.POINTER(Data)).contents
if start == 0:
start = event.ts
time_s = (float(event.ts - start)) / 1000000000
print("%-18.9f %-16s %-6d %-1s %s %s" % (time_s, event.comm, event.pid, event.lat, label, event.disk))
b["events"].open_perf_buffer(print_event)
# format output
start = 0
while 1:
try:
b.perf_buffer_poll()
except KeyboardInterrupt:
exit()
Loaʻa ka palapala ponoʻī ma
ʻO ka hope loa! I kēia manawa, ʻike mākou ʻo ka mea i like me kahi hāmeʻa stalling bcache he kelepona paʻa generic_make_request() no kahi diski huna.
E ʻeli i loko o ka Kernel
He aha ka mea e lohi nei i ka wā o ka hoʻouna noi? ʻIke mākou i ka hiki ʻana o ka lohi ma mua o ka hoʻomaka ʻana o ka noi moʻohelu kālā, ʻo ia hoʻi. ʻAʻole i hoʻomaka ka helu ʻana i kahi noi kikoʻī no ka hoʻopuka hou ʻana o nā helu ma luna o ia (/proc/diskstats a i ʻole iostat). Hiki ke hōʻoia maʻalahi kēia ma ka holo ʻana i ka iostat i ka wā e hana hou ana i ka pilikia, a i ʻole
Inā mākou e nānā i ka hana generic_make_request(), a laila e ʻike mākou ma mua o ka hoʻomaka ʻana o ka noi ʻana i ka helu helu, ua kāhea ʻia ʻelua mau hana. Ka mua - generic_make_request_checks(), hana i nā mākaʻikaʻi i ka pono o ka noi e pili ana i nā hoʻonohonoho disk. Ka lua -
ret = wait_event_interruptible(q->mq_freeze_wq,
(atomic_read(&q->mq_freeze_depth) == 0 &&
(preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
I loko o ia mea, ke kali nei ka kernel no ka wehe ʻana o ka pila. E ana kāua i ka lohi blk_queue_enter():
~# /usr/share/bcc/tools/funclatency blk_queue_enter -i 1 -m
Tracing 1 functions for "blk_queue_enter"... Hit Ctrl-C to end.
msecs : count distribution
0 -> 1 : 341 |****************************************|
msecs : count distribution
0 -> 1 : 316 |****************************************|
msecs : count distribution
0 -> 1 : 255 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 | |
Me he mea lā ua kokoke mākou i kahi hoʻonā. ʻO nā hana i hoʻohana ʻia no ka hoʻopaʻa ʻana/unfreeze i kahi queue
ʻO ka manawa e hoʻomaʻemaʻe ai i kēia queue ua like ia me ka disc latency ʻoiai ke kali nei ka kernel a pau nā hana queued e hoʻopau. Ke pau ka pila, hoʻohana ʻia nā hoʻololi. Mahope iho ua kapaia
I kēia manawa ua lawa kā mākou ʻike e hoʻoponopono i ke kūlana. Na ka udevadm trigger kauoha e hoʻohana i nā hoʻonohonoho no ka mea poloka. Hōʻike ʻia kēia mau hoʻonohonoho i nā lula udev. Hiki iā mākou ke ʻike i nā hoʻonohonoho e maloʻo nei i ka pila ma ka hoʻāʻo ʻana e hoʻololi iā lākou ma o sysfs a i ʻole ma ka nānā ʻana i ke kumu kumu kernel. Hiki iā mākou ke hoʻāʻo i ka pono BCC
~# /usr/share/bcc/tools/trace blk_freeze_queue -K -U
PID TID COMM FUNC
3809642 3809642 systemd-udevd blk_freeze_queue
blk_freeze_queue+0x1 [kernel]
elevator_switch+0x29 [kernel]
elv_iosched_store+0x197 [kernel]
queue_attr_store+0x5c [kernel]
sysfs_kf_write+0x3c [kernel]
kernfs_fop_write+0x125 [kernel]
__vfs_write+0x1b [kernel]
vfs_write+0xb8 [kernel]
sys_write+0x55 [kernel]
do_syscall_64+0x73 [kernel]
entry_SYSCALL_64_after_hwframe+0x3d [kernel]
__write_nocancel+0x7 [libc-2.23.so]
[unknown]
3809631 3809631 systemd-udevd blk_freeze_queue
blk_freeze_queue+0x1 [kernel]
queue_requests_store+0xb6 [kernel]
queue_attr_store+0x5c [kernel]
sysfs_kf_write+0x3c [kernel]
kernfs_fop_write+0x125 [kernel]
__vfs_write+0x1b [kernel]
vfs_write+0xb8 [kernel]
sys_write+0x55 [kernel]
do_syscall_64+0x73 [kernel]
entry_SYSCALL_64_after_hwframe+0x3d [kernel]
__write_nocancel+0x7 [libc-2.23.so]
[unknown]
ʻAʻole liʻiliʻi nā lula Udev a maʻamau kēia e hana ʻia ma ke ʻano kaohi. No laila ke ʻike nei mākou ʻo ka hoʻopili ʻana i nā waiwai i hoʻonohonoho mua ʻia ke kumu o ka spike i ka lohi i ka hoʻololi ʻana i ka noi mai ka noi i ka disk. ʻOiaʻiʻo, ʻo ka hana ʻana i nā hanana udev inā ʻaʻohe hoʻololi i ka hoʻonohonoho disk (no ka laʻana, ʻaʻole i kau ʻia / wehe ʻia ka mea hana) ʻaʻole ia he hana maikaʻi. Eia naʻe, hiki iā mākou ke kōkua i ka kernel ʻaʻole e hana i nā hana pono ʻole a hoʻokuʻu i ka queue noi inā ʻaʻole pono.
Panina
ʻO ka eBPF kahi mea maʻalahi a ikaika loa. Ma ka ʻatikala ua nānā mākou i kahi hihia kūpono a hōʻike i kahi ʻāpana liʻiliʻi o ka mea hiki ke hana. Inā makemake ʻoe i ka hoʻomohala ʻana i nā pono BCC, pono e nānā
Aia kekahi mau mea hana debugging a me ka hoʻopili ʻana e pili ana i ka eBPF. ʻO kekahi o lākou -
Source: www.habr.com