Hoʻopau i ka latency pūnaewele ma Kubernetes

Hoʻopau i ka latency pūnaewele ma Kubernetes

ʻElua mau makahiki i hala ʻo Kubernetes ua kūkākūkā mua ʻia ma ka moʻomanaʻo GitHub blog. Mai ia manawa, ua lilo ia i ʻenehana maʻamau no ka hoʻokau ʻana i nā lawelawe. Ke mālama nei ʻo Kubernetes i kahi hapa nui o nā lawelawe kūloko a me ka lehulehu. I ka ulu ʻana o kā mākou mau puʻupuʻu a ua ʻoi aku ka paʻakikī o ka hana, ua hoʻomaka mākou e ʻike i kekahi mau lawelawe ma Kubernetes e ʻike pinepine ana i ka latency ʻaʻole hiki ke wehewehe ʻia e ka ukana o ka noi ponoʻī.

ʻO ka mea nui, ʻike nā noi i ka latency pūnaewele like ʻole a hiki i 100ms a ʻoi aku paha, e hopena ana i ka manawa a i ʻole ka hoʻāʻo hou ʻana. Ua manaʻo ʻia e hiki i nā lawelawe ke pane i nā noi ʻoi aku ka wikiwiki ma mua o 100ms. Akā he mea hiki ʻole kēia inā lōʻihi ka manawa o ka pilina. Ma kahi kaʻawale, ʻike mākou i nā nīnau MySQL wikiwiki loa e lawe i nā milliseconds, a ua hoʻopau ʻo MySQL i nā milliseconds, akā mai ka manaʻo o ka noi noi, ua lawe ka pane i 100ms a ʻoi aku paha.

Ua maopopo koke ka pilikia i ka wā e hoʻopili ai i kahi node Kubernetes, ʻoiai inā i hele mai ke kelepona mai waho mai o Kubernetes. ʻO ke ala maʻalahi e hana hou i ka pilikia ma kahi hoʻokolohua Vegeta, e holo ana mai nā mea hoʻokipa kūloko, e hoʻāʻo i ka lawelawe Kubernetes ma kahi awa kikoʻī, a hoʻopaʻa inoa i ka manawa lōʻihi. Ma kēia ʻatikala, e nānā mākou pehea i hiki ai iā mākou ke ʻimi i ke kumu o kēia pilikia.

Hoʻopau i ka paʻakikī pono ʻole i ke kaulahao e alakaʻi i ka hāʻule

Ma ka hana hou ʻana i ka laʻana like, makemake mākou e hōʻemi i ka manaʻo o ka pilikia a wehe i nā ʻāpana pono ʻole o ka paʻakikī. I ka hoʻomaka ʻana, ua nui loa nā mea i loko o ke kahe ma waena o Vegeta a me nā pods Kubernetes. No ka ʻike ʻana i kahi pilikia pūnaewele hohonu, pono ʻoe e kāpae i kekahi o lākou.

Hoʻopau i ka latency pūnaewele ma Kubernetes

Hoʻokumu ka mea kūʻai (Vegeta) i kahi pilina TCP me kekahi node o ka pūʻulu. Hoʻohana ʻo Kubernetes ma ke ʻano he pūnaewele overlay (ma luna o ka pūnaewele kikowaena data i loaʻa) e hoʻohana nei IPIP, ʻo ia hoʻi, e hoʻopili ana i nā ʻeke IP o ka pūnaewele overlay i loko o nā ʻeke IP o ke kikowaena data. Ke hoʻohui ʻia i ka node mua, hana ʻia ka unuhi ʻana o ka helu wahi pūnaewele Unuhi Wahi Pūnaewele (NAT) stateful e unuhi i ka IP address a me ke awa o ka Kubernetes node i ka IP address a me ke awa i loko o ka pūnaewele overlay (ʻo ia hoʻi, ka pod me ka palapala noi). No nā ʻeke komo, hana ʻia ke kaʻina hana hope. He ʻōnaehana paʻakikī me ka nui o ka mokuʻāina a me nā mea he nui i hoʻonui mau ʻia a hoʻololi ʻia i ka wā e kau ʻia a neʻe ʻia nā lawelawe.

Mea hoʻohana tcpdump i ka ho'āʻo Vegeta aia kahi lohi i ka wā o ka lulu lima TCP (ma waena o SYN a me SYN-ACK). No ka wehe ʻana i kēia paʻakikī pono ʻole, hiki iā ʻoe ke hoʻohana hping3 no nā "pings" maʻalahi me nā ʻeke SYN. Nānā mākou inā he lohi i ka ʻeke pane, a laila hoʻonohonoho hou i ka pilina. Hiki iā mākou ke kānana i ka ʻikepili e hoʻokomo wale i nā ʻeke ʻoi aku ka nui ma mua o 100ms a loaʻa i kahi ala maʻalahi e hana hou i ka pilikia ma mua o ka hoʻāʻo ʻana o ka layer network piha 7 ma Vegeta. Eia nā Kubernetes node "pings" me ka hoʻohana ʻana iā TCP SYN/SYN-ACK ma ka lawelawe "node port" (30927) ma nā manawa 10ms, kānana ʻia e nā pane lohi:

theojulienne@shell ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms

Hiki ke hana koke i ka nana mua. Ma ka hoʻoholo ʻana i nā helu kaʻina a me nā manawa, ʻike ʻia ʻaʻole kēia he mau manawa hoʻokahi. Hoʻonui pinepine ka lohi a hoʻopau ʻia.

A laila, makemake mākou e ʻike i nā ʻāpana e pili ana i ka hiki ʻana o ka congestion. ʻO kēia paha kekahi o nā haneli o nā lula iptables ma NAT? A i ʻole he pilikia paha me ka IPIP tunneling ma ka pūnaewele? ʻO kahi ala e hoʻāʻo ai i kēia ʻo ka hoʻāʻo ʻana i kēlā me kēia ʻanuʻu o ka ʻōnaehana ma ka hoʻopau ʻana iā ia. He aha ka hopena inā wehe ʻoe i ka logic NAT a me ke ahi ahi, waiho wale i ka ʻāpana IPIP:

Hoʻopau i ka latency pūnaewele ma Kubernetes

ʻO ka mea pōmaikaʻi, ua maʻalahi ʻo Linux ke komo pololei i ka papa overlay IP inā aia ka mīkini ma ka pūnaewele like:

theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms

Ke hoʻoholo nei i nā hopena, ke mau nei ka pilikia! ʻAʻole kēia iptables a me NAT. No laila ʻo TCP ka pilikia? E ʻike kākou pehea e hele ai kahi ping ICMP maʻamau:

theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms

len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms

len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms

len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms

Hōʻike nā hopena ʻaʻole i pau ka pilikia. He tunnel IPIP paha kēia? E hoʻomaʻamaʻa hou i ka hoʻāʻo:

Hoʻopau i ka latency pūnaewele ma Kubernetes

Hoʻouna ʻia nā ʻeke a pau ma waena o kēia mau pūʻali ʻelua?

theojulienne@kube-node-client ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms

len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms

Ua maʻalahi mākou i ke kūlana i ʻelua mau node Kubernetes e hoʻouna ana kekahi i kekahi i kekahi ʻeke, ʻo kahi ping ICMP. Ke ʻike mau nei lākou i ka latency inā "ʻino" ka mea hoʻokipa (ʻoi aku ka ʻino ma mua o nā mea ʻē aʻe).

ʻO ka nīnau hope loa: no ke aha e loaʻa ai ka lohi ma nā kikowaena kube-node? A hiki mai paha inā ʻo kube-node ka mea hoʻouna a i ʻole ka mea hoʻokipa? ʻO ka mea pōmaikaʻi, he mea maʻalahi hoʻi kēia e noʻonoʻo ma ka hoʻouna ʻana i kahi ʻeke mai kahi pūʻali ma waho o Kubernetes, akā me ka mea loaʻa "ʻino" like. E like me kāu e ʻike ai, ʻaʻole i nalowale ka pilikia:

theojulienne@shell ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms

A laila e holo mākou i nā noi like mai ke kumu kube-node mua i ka host o waho (ʻaʻole i hoʻokaʻawale i ka host kumu mai ka ping e komo pū ana me kahi ʻāpana RX a me TX):

theojulienne@kube-node-client ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms

Ma ka nānā ʻana i nā paʻi paʻi latency, loaʻa iā mākou kekahi ʻike hou aʻe. ʻO kahi kikoʻī, ʻike ka mea hoʻouna (lalo) i kēia manawa, akā ʻaʻole ʻike ka mea loaʻa (luna) - ʻike i ke kolamu Delta (i kekona):

Hoʻopau i ka latency pūnaewele ma Kubernetes

Eia kekahi, inā ʻoe e nānā i ka ʻokoʻa o ka hoʻonohonoho ʻana o nā paʻi TCP a me ICMP (ma nā helu helu) ma ka ʻaoʻao o ka mea loaʻa, e hōʻea mau nā paʻi ICMP i ke kaʻina like i hoʻouna ʻia ai, akā me ka manawa like ʻole. I ka manawa like, hoʻopili ʻia nā ʻeke TCP i kekahi manawa, a paʻa kekahi o lākou. ʻO ka mea nui, inā e nānā ʻoe i nā awa o nā ʻeke SYN, aia lākou ma ka ʻaoʻao o ka mea hoʻouna, akā ʻaʻole ma ka ʻaoʻao o ka mea lawe.

Aia kekahi ʻokoʻa maʻalahi i ka pehea kāleka pūnaewele nā kikowaena hou (e like me nā mea i loko o kā mākou kikowaena data) kaʻina hana i nā ʻeke i loaʻa iā TCP a i ʻole ICMP. I ka hiki ʻana mai o kahi ʻeke, ʻo ka mea hoʻopili pūnaewele "hashes ia i kēlā me kēia pilina", ʻo ia hoʻi, e hoʻāʻo e wāwahi i nā pilina i loko o nā queues a hoʻouna i kēlā me kēia pila i kahi ʻāpana kaʻawale. No TCP, ua komo kēia hash i ke kumu a me ka helu IP wahi a me ke awa. I nā huaʻōlelo ʻē aʻe, ʻokoʻa ʻē aʻe kēlā me kēia pilina. No ka ICMP, ua hashed wale nā ​​helu IP, no ka mea, ʻaʻohe awa.

ʻO kekahi ʻike hou: i kēia manawa ʻike mākou i ka hoʻopaneʻe ʻana o ICMP i nā kamaʻilio āpau ma waena o ʻelua mau pūʻali, akā ʻaʻole ʻo TCP. Hōʻike kēia iā mākou e pili ana paha ke kumu i ka hashing queue RX: ʻo ka congestion e kokoke loa i ka hana ʻana i nā ʻeke RX, ʻaʻole i ka hoʻouna ʻana i nā pane.

Hoʻopau kēia i ka hoʻouna ʻana i nā ʻeke mai ka papa inoa o nā kumu. Ua ʻike mākou i kēia manawa aia ka pilikia hoʻoili packet ma ka ʻaoʻao loaʻa ma kekahi mau kikowaena kube-node.

Ka hoʻomaopopo ʻana i ka hoʻoili ʻana i ka packet ma ka Linux kernel

No ka hoʻomaopopo ʻana i ke kumu o ka pilikia ma ka mea hoʻokipa ma kekahi mau kikowaena kube-node, e nānā i ke ʻano o ka hana ʻana o ka Linux kernel i nā ʻeke.

Ke hoʻi nei i ka hoʻokō kuʻuna maʻalahi, loaʻa i ke kāleka pūnaewele ka ʻeke a hoʻouna keakea ka Linux kernel aia kahi puʻupuʻu e pono e hana ʻia. Hoʻopau ka kernel i nā hana ʻē aʻe, hoʻololi i ka pōʻaiapili i ka mea nāna e hoʻopau, hana i ka ʻeke, a laila hoʻi i nā hana o kēia manawa.

Hoʻopau i ka latency pūnaewele ma Kubernetes

He lohi ka hoʻololi ʻana i ka pōʻaiapili: ʻaʻole i ʻike ʻia ka latency ma nā kāleka pūnaewele 10Mbps i nā makahiki 90, akā ma nā kāleka 10G hou me ka loaʻa ʻana o 15 miliona mau ʻeke i kēlā me kēia kekona, hiki ke hoʻopau ʻia kēlā me kēia kikowaena o kahi kikowaena liʻiliʻi ʻewalu mau miliona. o na manawa i kekona.

I ʻole e mālama mau i nā keʻakeʻa, ua hoʻohui ʻo Linux i nā makahiki he nui i hala NAPI: API pūnaewele i hoʻohana ʻia e nā mea hoʻokele hou e hoʻomaikaʻi i ka hana ma nā wikiwiki kiʻekiʻe. I ka haʻahaʻa haʻahaʻa, loaʻa i ka kernel nā mea hoʻopau mai ke kāleka pūnaewele ma ke ala kahiko. Ke hiki mai ka nui o nā ʻeke i ʻoi aku ma mua o ka paepae, hoʻopau ka kernel a hoʻomaka i ke koho balota i ka mea hoʻopili pūnaewele a ʻohi i nā ʻeke i nā ʻāpana. Hana ʻia ka hana ʻana ma softirq, ʻo ia hoʻi, in pōʻaiapili o nā polokalamu keakea ma hope o ke kelepona ʻana o ka ʻōnaehana a me ka paʻa ʻana o ka lako, ke holo nei ka kernel (e kūʻē i ka wahi hoʻohana).

Hoʻopau i ka latency pūnaewele ma Kubernetes

ʻOi aku ka wikiwiki o kēia, akā he pilikia ʻē aʻe. Inā he nui nā ʻeke, a laila, pau ka manawa i ka hoʻoponopono ʻana i nā ʻeke mai ke kāleka pūnaewele, a ʻaʻohe manawa o nā kaʻina kikowaena mea hoʻohana e hoʻokaʻawale i kēia mau queues (heluhelu mai nā pilina TCP, etc.). Ma hope ua piha nā pila a hoʻomaka mākou e hoʻolei i nā ʻeke. I ka hoʻāʻo ʻana e ʻimi i ke koena, hoʻonohonoho ka kernel i kahi kālā no ka helu kiʻekiʻe o nā ʻeke i hana ʻia ma ka pōʻaiapili softirq. Ke hoʻopau ʻia kēia kālā, hoʻāla ʻia kahi pae ʻokoʻa ksoftirqd (E ʻike ʻoe i kekahi o lākou i loko ps per core) nāna e mālama i kēia mau softirqs ma waho o ke ala syscall/interrupt maʻamau. Hoʻonohonoho ʻia kēia pae me ka hoʻohana ʻana i ke kaʻina hana maʻamau, e hoʻāʻo nei e hoʻokaʻawale pono i nā kumuwaiwai.

Hoʻopau i ka latency pūnaewele ma Kubernetes

Ma hope o ke aʻo ʻana i ke ʻano o ka hana ʻana o ka kernel, hiki iā ʻoe ke ʻike aia ke ʻano o ka congestion. Inā liʻiliʻi ka loaʻa ʻana o nā kelepona softirq, pono e kali nā ʻeke no kekahi manawa e hana ʻia ma ka laina RX ma ke kāleka pūnaewele. Hiki paha kēia ma muli o kekahi hana e ālai ana i ka core processor, a i ʻole kekahi mea ʻē aʻe e pale ana i ke kumu mai ka holo ʻana i ka softirq.

Hoʻemi i ka hana ʻana a hiki i ke kumu a i ʻole ke ʻano

ʻO nā lohi Softirq he kuhi wale nō ia i kēia manawa. Akā, kūpono ia, a ʻike mākou ke ʻike nei mākou i kahi mea like loa. No laila ʻo ka hana aʻe e hōʻoia i kēia manaʻo. A ina ua hooiaioia, e imi i ke kumu o ka lohi.

E hoʻi kāua i kā mākou ʻeke lohi:

len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms

len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms

len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms

len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms

E like me ka mea i kamaʻilio mua ʻia, ua hoʻopili ʻia kēia mau ʻeke ICMP i hoʻokahi queue RX NIC a hoʻoponopono ʻia e hoʻokahi CPU core. Inā makemake mākou e hoʻomaopopo pehea e hana ai ʻo Linux, pono e ʻike i kahi (ma kahi o ka CPU core) a pehea (softirq, ksoftirqd) e hana ʻia kēia mau pūʻulu i mea e hahai ai i ke kaʻina hana.

ʻO ka manawa kēia e hoʻohana ai i nā mea hana e hiki ai iā ʻoe ke nānā i ka kernel Linux i ka manawa maoli. Maanei mākou i hoʻohana ai ke kanesa. ʻO kēia pūʻulu o nā mea hana e hiki ai iā ʻoe ke kākau i nā polokalamu C liʻiliʻi e hoʻopili i nā hana arbitrary i loko o ka kernel a hoʻopaʻa i nā hanana i loko o kahi papahana Python hoʻohana-space e hiki ke hoʻoponopono iā lākou a hoʻihoʻi i ka hopena iā ʻoe. ʻO ka hoʻopaʻa ʻana i nā hana arbitrary i loko o ka kernel he ʻoihana paʻakikī, akā ua hoʻolālā ʻia ka pono no ka palekana kiʻekiʻe a ua hoʻolālā ʻia e ʻimi pololei i ke ʻano o nā pilikia hana ʻaʻole maʻalahi i hana hou ʻia i kahi hoʻokolohua a i ʻole ka hoʻomohala ʻana.

He mea maʻalahi ka hoʻolālā ma aneʻi: ʻike mākou e hana ka kernel i kēia mau pings ICMP, no laila e kau mākou i kahi makau ma ka hana kernel icmp_echo, ka mea e ʻae i kahi ʻeke noi echo ICMP e komo mai ana a hoʻomaka i ka hoʻouna ʻana i kahi pane leo ICMP. Hiki iā mākou ke ʻike i kahi ʻeke ma ka hoʻonui ʻana i ka helu icmp_seq, e hōʻike ana hping3 ʻoi aku ka kiʻekiʻe.

kuhi bcc palapala he mea paʻakikī ke nānā aku, akā ʻaʻole ia e like me ka weliweli. Hana icmp_echo haʻi aku struct sk_buff *skb: He ʻeke kēia me kahi "echo request". Hiki iā mākou ke hahai, huki i ke kaʻina echo.sequence (e hoohalike ana me icmp_seq na hping3 выше), a hoʻouna iā ia i kahi mea hoʻohana. Maikaʻi nō hoʻi e hopu i ka inoa kaʻina hana o kēia manawa. Ma lalo iho nei nā hopena a mākou e ʻike pololei ai i ka wā e hoʻoili ʻia ai nā ʻeke kernel:

TGID PID KA INOA ICMP_SEQ
0 0 swapper/11
770 0 swapper/0
11 771 swapper/0
0 11 swapper/772
0 0 swapper/11
773 0 prometheus 0
11 774 swapper/20041
20086 775 swapper/0
0 11 swapper/776
0 0 spokes-report-s 11

Pono e hoʻomaopopoʻia maʻaneʻi ma ka pōʻaiapili softirq ʻO nā kaʻina hana i hana i nā kelepona ʻōnaehana e ʻike ʻia ma ke ʻano he "kaʻina hana" inā ʻoiaʻiʻo, ʻo ka kernel ka mea e mālama pono ai i nā ʻeke ma ke ʻano o ka kernel.

Me kēia mea hana hiki iā mākou ke hoʻohui i nā kaʻina hana kūikawā me nā pūʻolo kikoʻī e hōʻike ana i ka lohi o hping3. E maʻalahi kāua grep ma keia hopu ana no kekahi mau waiwai icmp_seq. Ua hōʻailona ʻia nā ʻeke e pili ana i nā koina icmp_seq ma luna me kā lākou RTT a mākou i ʻike ai ma luna (ma nā pale i nā koina RTT i manaʻo ʻia no nā ʻeke a mākou i kānana ʻia ma muli o nā waiwai RTT ma lalo o 50ms):

TGID PID INOA ICMP_SEQ ** RTT
--
10137 10436 luna hoʻomalu 1951
10137 10436 luna hoʻomalu 1952
76 76 ksoftirqd/11 1953 ** 99ms
76 76 ksoftirqd/11 1954 ** 89ms
76 76 ksoftirqd/11 1955 ** 79ms
76 76 ksoftirqd/11 1956 ** 69ms
76 76 ksoftirqd/11 1957 ** 59ms
76 76 ksoftirqd/11 1958 ** (49ms)
76 76 ksoftirqd/11 1959 ** (39ms)
76 76 ksoftirqd/11 1960 ** (29ms)
76 76 ksoftirqd/11 1961 ** (19ms)
76 76 ksoftirqd/11 1962 ** (9ms)
--
10137 10436 luna hoʻomalu 2068
10137 10436 luna hoʻomalu 2069
76 76 ksoftirqd/11 2070 ** 75ms
76 76 ksoftirqd/11 2071 ** 65ms
76 76 ksoftirqd/11 2072 ** 55ms
76 76 ksoftirqd/11 2073 ** (45ms)
76 76 ksoftirqd/11 2074 ** (35ms)
76 76 ksoftirqd/11 2075 ** (25ms)
76 76 ksoftirqd/11 2076 ** (15ms)
76 76 ksoftirqd/11 2077 ** (5ms)

Hōʻike nā hopena iā mākou i kekahi mau mea. ʻO ka mea mua, ua hana ʻia kēia mau pūʻolo āpau e ka pōʻaiapili ksoftirqd/11. ʻO ke ʻano kēia no kēia mau mīkini ʻelua, ua hoʻopaʻa ʻia nā ʻeke ICMP i ke kumu 11 ma ka hopena loaʻa. ʻIke pū mākou i ka wā e loaʻa ai kahi jam, aia nā ʻeke i hana ʻia i loko o ka pōʻaiapili o ke kelepona ʻōnaehana cadvisor... A laila ksoftirqd lawe i ka hana a hana i ka queue i hōʻiliʻili ʻia: pololei ka helu o nā ʻeke i hōʻiliʻili ma hope cadvisor.

ʻO kaʻoiaʻiʻo ma mua koke o ka hana mau cadvisor, hōʻike i kona komo ʻana i ka pilikia. ʻO ka mea hoʻohenehene, ke kumu cadvisor - "E noʻonoʻo i ka hoʻohana waiwai a me nā hiʻohiʻona hana o nā ipu e holo ana" ma mua o ka hoʻokumu ʻana i kēia pilikia hana.

E like me nā hiʻohiʻona ʻē aʻe o nā ipu, he mau mea hana kiʻekiʻe loa kēia a hiki ke manaʻo ʻia e ʻike i nā pilikia hana ma lalo o kekahi mau kūlana i ʻike ʻole ʻia.

He aha ka hana a cadvisor e hoʻolohi i ka pila packet?

Loaʻa iā mākou ka ʻike maikaʻi i ke ʻano o ka hāʻule ʻana, he aha ke kaʻina hana, a ma luna o ka CPU. ʻIke mākou ma muli o ka paʻakikī paʻakikī, ʻaʻohe manawa o ka Linux kernel e hoʻonohonoho ksoftirqd. A ʻike mākou ua hana ʻia nā ʻeke ma ka pōʻaiapili cadvisor. He kūpono ke manaʻo i kēlā cadvisor hoʻomaka i kahi syscall lohi, a ma hope o ka hana ʻana i nā ʻeke a pau i hōʻiliʻili ʻia i kēlā manawa:

Hoʻopau i ka latency pūnaewele ma Kubernetes

He manaʻo kēia, akā pehea e hoʻāʻo ai? ʻO ka mea hiki iā mākou ke hana, ʻo ka ʻimi ʻana i ka CPU core i loko o kēia kaʻina hana, e ʻimi i kahi e hele ai ka helu o nā ʻeke ma luna o ka waihona kālā a kāhea ʻia ʻo ksoftirqd, a laila e nānā iki i hope e ʻike i ka mea e holo pololei ana ma ka CPU core ma mua o kēlā manawa. . Ua like ia me ka x-ray i ka CPU i kēlā me kēia mau milliseconds. E like me kēia:

Hoʻopau i ka latency pūnaewele ma Kubernetes

Maʻalahi, hiki ke hana i kēia me nā mea hana i loaʻa. ʻo kahi laʻana, perf mooolelo nānā i ka CPU core i hāʻawi ʻia ma kahi alapine i kuhikuhi ʻia a hiki ke hoʻopuka i kahi papa inoa o nā kelepona i ka ʻōnaehana holo, me nā wahi mea hoʻohana a me ka kernel Linux. Hiki iā ʻoe ke lawe i kēia moʻolelo a hoʻopaʻa ʻia me ka hoʻohana ʻana i kahi ʻāpana liʻiliʻi o ka papahana Kiʻi ʻĀpana mai Brendan Gregg, ka mea e mālama i ke ʻano o ka hoʻopaʻa ʻana. Hiki iā mākou ke mālama i nā meheu hoʻopaʻa laina hoʻokahi i kēlā me kēia 1 ms, a laila hoʻokalaka a mālama i kahi laʻana 100 milliseconds ma mua o ka paʻi ʻana o ka trace. ksoftirqd:

# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100

Eia nā hopena:

(сотни следов, которые выглядят похожими)

cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run

Nui nā mea ma aneʻi, akā ʻo ka mea nui ke ʻike mākou i ke ʻano "cadvisor ma mua o ksoftirqd" a mākou i ʻike mua ai ma ka ICMP tracer. He aha ka manaʻo?

ʻO kēlā me kēia laina he ʻano CPU ma kahi kikoʻī i ka manawa. Hoʻokaʻawale ʻia kēlā me kēia kelepona i lalo o ka waihona ma kahi laina e kahi semicolon. Ma ka waena o nā laina ʻike mākou i ka syscall i kapa ʻia: read(): .... ;do_syscall_64;sys_read; .... No laila, hoʻohana nui ka cadvisor i ka manawa ma ke kelepona ʻōnaehana read()pili i na hana mem_cgroup_* (luna o ka waihona kelepona/hopena o ka laina).

He mea maʻalahi ka ʻike ʻana i ka ʻike kelepona i ka mea e heluhelu ʻia nei, no laila e holo kāua strace a e ʻike kākou i ka hana a cadvisor a ʻike i nā kelepona pūnaewele ʻoi aku ka lōʻihi ma mua o 100ms:

theojulienne@kube-node-bad ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0.[1-9]'
[pid 10436] <... futex resumed> ) = 0 <0.156784>
[pid 10432] <... futex resumed> ) = 0 <0.258285>
[pid 10137] <... futex resumed> ) = 0 <0.678382>
[pid 10384] <... futex resumed> ) = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> ) = 0 <0.104614>
[pid 10436] <... futex resumed> ) = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> ) = 0 <0.118113>
[pid 10382] <... pselect6 resumed> ) = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> ) = 0 <0.917495>
[pid 10436] <... futex resumed> ) = 0 <0.208172>
[pid 10417] <... futex resumed> ) = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 576 <0.154442>

E like me kāu e manaʻo ai, ʻike mākou i nā kelepona lohi ma aneʻi read(). Mai loko mai o nā hana heluhelu a me ka pōʻaiapili mem_cgroup ua maopopo keia mau pilikia read() e kuhikuhi i ka faila memory.stat, e hōʻike ana i ka hoʻohana ʻana i ka hoʻomanaʻo a me nā palena cgroup (Docker's resource isolation technology). Nīnau ka mea hana cadvisor i kēia faila no ka loaʻa ʻana o ka ʻike hoʻohana waiwai no nā ipu. E nānā inā ʻo ka kernel a i ʻole cadvisor e hana nei i kahi mea i manaʻo ʻole ʻia:

theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null

real 0m0.153s
user 0m0.000s
sys 0m0.152s
theojulienne@kube-node-bad ~ $

I kēia manawa hiki iā mākou ke hana hou i ka bug a hoʻomaopopo i ke alo o ka Linux kernel i kahi pathology.

No ke aha i lohi ai ka hana heluhelu?

I kēia pae, ʻoi aku ka maʻalahi o ka loaʻa ʻana o nā memo mai nā mea hoʻohana ʻē aʻe e pili ana i nā pilikia like. E like me ka mea i ʻike ʻia, ma ka cadvisor tracker ua hōʻike ʻia kēia bug pilikia o ka hoʻohana nui ʻana o ka CPU, ʻaʻohe mea i ʻike ua ʻike ʻole ʻia ka latency i loko o ka waihona pūnaewele. Ua ʻike maoli ʻia ka hoʻohana ʻana o ka cadvisor i ka manawa CPU ma mua o ka mea i manaʻo ʻia, akā ʻaʻole i hāʻawi nui ʻia kēia, no ka mea he nui nā kumuwaiwai CPU o kā mākou mau kikowaena, no laila ʻaʻole i aʻo pono ʻia ka pilikia.

ʻO ka pilikia, ʻo nā hui e noʻonoʻo i ka hoʻohana ʻana i ka hoʻomanaʻo i loko o ka namespace (container). Ke haʻalele nā ​​kaʻina hana a pau i kēia cgroup, hoʻokuʻu ʻo Docker i ka cgroup hoʻomanaʻo. Eia naʻe, ʻaʻole ʻo "memory" wale ka hoʻomanaʻo ʻana i ka hoʻomanaʻo. ʻOiai ʻaʻole i hoʻohana hou ʻia ke kaʻina hana, ʻike ʻia e hāʻawi mau ana ka kernel i nā mea i hūnā ʻia, e like me nā dentries a me nā inodes (directory and file metadata), i hūnā ʻia i loko o ka pūʻulu hoʻomanaʻo. Mai ka wehewehe pilikia:

zombie cgroups: nā hui ʻaʻohe kaʻina hana a ua holoi ʻia, akā aia nō ka hoʻomanaʻo i hoʻokaʻawale ʻia (i koʻu hihia, mai ka dentry cache, akā hiki ke hoʻokaʻawale ʻia mai ka ʻaoʻao cache a i ʻole tmpfs).

Hiki ke lohi loa ka nānā ʻana o ka kernel i nā ʻaoʻao a pau i loko o ka cache i ka wā e hoʻokuʻu ai i kahi cgroup, no laila ua koho ʻia ke kaʻina hana palaualelo: e kali a noi hou ʻia kēia mau ʻaoʻao, a laila hoʻomaʻemaʻe i ka cgroup i ka wā e pono ai ka hoʻomanaʻo. A hiki i kēia manawa, mālama ʻia ka cgroup i ka ʻohi ʻana i nā helu.

Mai kahi hiʻohiʻona hana, hāʻawi lākou i ka hoʻomanaʻo no ka hana: ka wikiwiki ʻana i ka hoʻomaʻemaʻe mua ma ka waiho ʻana i kahi hoʻomanaʻo huna. Maikaʻi kēia. Ke hoʻohana ka kernel i ka hoʻomanaʻo hope loa o ka hoʻomanaʻo cache, ua hoʻomaʻemaʻe ʻia ka cgroup, no laila ʻaʻole hiki ke kapa ʻia he "leak". ʻO ka mea pōʻino, ka hoʻokō kikoʻī o ka mīkini ʻimi memory.stat i kēia ʻano kernel (4.9), i hui pū ʻia me ka nui o ka hoʻomanaʻo ma kā mākou mau kikowaena, ʻo ia ka mea e lōʻihi loa ka hoʻihoʻi ʻana i ka ʻikepili cache hou a hoʻomaʻemaʻe i nā zombies cgroup.

Ua ʻike ʻia he nui nā cgroup zombies o kekahi o kā mākou mau node i ʻoi aku ka heluhelu a me ka latency ma mua o kekona.

ʻO ka workaround no ka pilikia cadvisor e hoʻokuʻu koke i nā dentries / inodes caches a puni ka ʻōnaehana, e hoʻopau koke i ka latency heluhelu a me ka latency pūnaewele ma ka mea hoʻokipa, no ka mea, ʻo ka hoʻomaʻemaʻe ʻana i ka cache huli i nā ʻaoʻao zombie cgroup cache a ua hoʻokuʻu ʻia lākou. ʻAʻole kēia he hopena, akā hōʻoia i ke kumu o ka pilikia.

Ua hoʻololi ʻia ma nā mana kernel hou (4.19+) ua hoʻomaikaʻi ʻia ka hana kelepona memory.stat, no laila ke hoʻololi nei i kēia kernel i hoʻoponopono i ka pilikia. I ka manawa like, loaʻa iā mākou nā mea hana e ʻike ai i nā node pilikia i nā pūʻulu Kubernetes, hoʻokahe maikaʻi iā lākou a hoʻomaka hou iā lākou. Hoʻopili mākou i nā pūʻulu āpau, loaʻa nā nodes me ka latency kiʻekiʻe a hoʻihoʻi hou iā lākou. Hāʻawi kēia iā mākou i ka manawa e hoʻonui i ka OS ma nā kikowaena i koe.

E hōʻuluʻulu

Ma muli o ka pau ʻana o kēia pahu i ka hoʻoili ʻana i ka queue RX NIC no nā haneli milliseconds, ua hana like ia i ka latency kiʻekiʻe ma nā pili pōkole a me ka latency waena waena, e like me waena o nā noi MySQL a me nā ʻeke pane.

ʻO ka hoʻomaopopo a me ka mālama ʻana i ka hana o nā ʻōnaehana kumu, e like me Kubernetes, he mea koʻikoʻi i ka hilinaʻi a me ka wikiwiki o nā lawelawe āpau e pili ana iā lākou. Loaʻa nā ʻōnaehana āpau āu e holo ai mai ka hoʻomaikaʻi ʻana i ka hana Kubernetes.

Source: www.habr.com

Pākuʻi i ka manaʻo hoʻopuka