ʻElua mau makahiki i hala ʻo Kubernetes
ʻO ka mea nui, ʻike nā noi i ka latency pūnaewele like ʻole a hiki i 100ms a ʻoi aku paha, e hopena ana i ka manawa a i ʻole ka hoʻāʻo hou ʻana. Ua manaʻo ʻia e hiki i nā lawelawe ke pane i nā noi ʻoi aku ka wikiwiki ma mua o 100ms. Akā he mea hiki ʻole kēia inā lōʻihi ka manawa o ka pilina. Ma kahi kaʻawale, ʻike mākou i nā nīnau MySQL wikiwiki loa e lawe i nā milliseconds, a ua hoʻopau ʻo MySQL i nā milliseconds, akā mai ka manaʻo o ka noi noi, ua lawe ka pane i 100ms a ʻoi aku paha.
Ua maopopo koke ka pilikia i ka wā e hoʻopili ai i kahi node Kubernetes, ʻoiai inā i hele mai ke kelepona mai waho mai o Kubernetes. ʻO ke ala maʻalahi e hana hou i ka pilikia ma kahi hoʻokolohua
Hoʻopau i ka paʻakikī pono ʻole i ke kaulahao e alakaʻi i ka hāʻule
Ma ka hana hou ʻana i ka laʻana like, makemake mākou e hōʻemi i ka manaʻo o ka pilikia a wehe i nā ʻāpana pono ʻole o ka paʻakikī. I ka hoʻomaka ʻana, ua nui loa nā mea i loko o ke kahe ma waena o Vegeta a me nā pods Kubernetes. No ka ʻike ʻana i kahi pilikia pūnaewele hohonu, pono ʻoe e kāpae i kekahi o lākou.
Hoʻokumu ka mea kūʻai (Vegeta) i kahi pilina TCP me kekahi node o ka pūʻulu. Hoʻohana ʻo Kubernetes ma ke ʻano he pūnaewele overlay (ma luna o ka pūnaewele kikowaena data i loaʻa) e hoʻohana nei
Mea hoʻohana tcpdump
i ka ho'āʻo Vegeta aia kahi lohi i ka wā o ka lulu lima TCP (ma waena o SYN a me SYN-ACK). No ka wehe ʻana i kēia paʻakikī pono ʻole, hiki iā ʻoe ke hoʻohana hping3
no nā "pings" maʻalahi me nā ʻeke SYN. Nānā mākou inā he lohi i ka ʻeke pane, a laila hoʻonohonoho hou i ka pilina. Hiki iā mākou ke kānana i ka ʻikepili e hoʻokomo wale i nā ʻeke ʻoi aku ka nui ma mua o 100ms a loaʻa i kahi ala maʻalahi e hana hou i ka pilikia ma mua o ka hoʻāʻo ʻana o ka layer network piha 7 ma Vegeta. Eia nā Kubernetes node "pings" me ka hoʻohana ʻana iā TCP SYN/SYN-ACK ma ka lawelawe "node port" (30927) ma nā manawa 10ms, kānana ʻia e nā pane lohi:
theojulienne@shell ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms
Hiki ke hana koke i ka nana mua. Ma ka hoʻoholo ʻana i nā helu kaʻina a me nā manawa, ʻike ʻia ʻaʻole kēia he mau manawa hoʻokahi. Hoʻonui pinepine ka lohi a hoʻopau ʻia.
A laila, makemake mākou e ʻike i nā ʻāpana e pili ana i ka hiki ʻana o ka congestion. ʻO kēia paha kekahi o nā haneli o nā lula iptables ma NAT? A i ʻole he pilikia paha me ka IPIP tunneling ma ka pūnaewele? ʻO kahi ala e hoʻāʻo ai i kēia ʻo ka hoʻāʻo ʻana i kēlā me kēia ʻanuʻu o ka ʻōnaehana ma ka hoʻopau ʻana iā ia. He aha ka hopena inā wehe ʻoe i ka logic NAT a me ke ahi ahi, waiho wale i ka ʻāpana IPIP:
ʻO ka mea pōmaikaʻi, ua maʻalahi ʻo Linux ke komo pololei i ka papa overlay IP inā aia ka mīkini ma ka pūnaewele like:
theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms
Ke hoʻoholo nei i nā hopena, ke mau nei ka pilikia! ʻAʻole kēia iptables a me NAT. No laila ʻo TCP ka pilikia? E ʻike kākou pehea e hele ai kahi ping ICMP maʻamau:
theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms
len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms
len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms
len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms
Hōʻike nā hopena ʻaʻole i pau ka pilikia. He tunnel IPIP paha kēia? E hoʻomaʻamaʻa hou i ka hoʻāʻo:
Hoʻouna ʻia nā ʻeke a pau ma waena o kēia mau pūʻali ʻelua?
theojulienne@kube-node-client ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms
len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms
Ua maʻalahi mākou i ke kūlana i ʻelua mau node Kubernetes e hoʻouna ana kekahi i kekahi i kekahi ʻeke, ʻo kahi ping ICMP. Ke ʻike mau nei lākou i ka latency inā "ʻino" ka mea hoʻokipa (ʻoi aku ka ʻino ma mua o nā mea ʻē aʻe).
ʻO ka nīnau hope loa: no ke aha e loaʻa ai ka lohi ma nā kikowaena kube-node? A hiki mai paha inā ʻo kube-node ka mea hoʻouna a i ʻole ka mea hoʻokipa? ʻO ka mea pōmaikaʻi, he mea maʻalahi hoʻi kēia e noʻonoʻo ma ka hoʻouna ʻana i kahi ʻeke mai kahi pūʻali ma waho o Kubernetes, akā me ka mea loaʻa "ʻino" like. E like me kāu e ʻike ai, ʻaʻole i nalowale ka pilikia:
theojulienne@shell ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms
A laila e holo mākou i nā noi like mai ke kumu kube-node mua i ka host o waho (ʻaʻole i hoʻokaʻawale i ka host kumu mai ka ping e komo pū ana me kahi ʻāpana RX a me TX):
theojulienne@kube-node-client ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms
Ma ka nānā ʻana i nā paʻi paʻi latency, loaʻa iā mākou kekahi ʻike hou aʻe. ʻO kahi kikoʻī, ʻike ka mea hoʻouna (lalo) i kēia manawa, akā ʻaʻole ʻike ka mea loaʻa (luna) - ʻike i ke kolamu Delta (i kekona):
Eia kekahi, inā ʻoe e nānā i ka ʻokoʻa o ka hoʻonohonoho ʻana o nā paʻi TCP a me ICMP (ma nā helu helu) ma ka ʻaoʻao o ka mea loaʻa, e hōʻea mau nā paʻi ICMP i ke kaʻina like i hoʻouna ʻia ai, akā me ka manawa like ʻole. I ka manawa like, hoʻopili ʻia nā ʻeke TCP i kekahi manawa, a paʻa kekahi o lākou. ʻO ka mea nui, inā e nānā ʻoe i nā awa o nā ʻeke SYN, aia lākou ma ka ʻaoʻao o ka mea hoʻouna, akā ʻaʻole ma ka ʻaoʻao o ka mea lawe.
Aia kekahi ʻokoʻa maʻalahi i ka pehea
ʻO kekahi ʻike hou: i kēia manawa ʻike mākou i ka hoʻopaneʻe ʻana o ICMP i nā kamaʻilio āpau ma waena o ʻelua mau pūʻali, akā ʻaʻole ʻo TCP. Hōʻike kēia iā mākou e pili ana paha ke kumu i ka hashing queue RX: ʻo ka congestion e kokoke loa i ka hana ʻana i nā ʻeke RX, ʻaʻole i ka hoʻouna ʻana i nā pane.
Hoʻopau kēia i ka hoʻouna ʻana i nā ʻeke mai ka papa inoa o nā kumu. Ua ʻike mākou i kēia manawa aia ka pilikia hoʻoili packet ma ka ʻaoʻao loaʻa ma kekahi mau kikowaena kube-node.
Ka hoʻomaopopo ʻana i ka hoʻoili ʻana i ka packet ma ka Linux kernel
No ka hoʻomaopopo ʻana i ke kumu o ka pilikia ma ka mea hoʻokipa ma kekahi mau kikowaena kube-node, e nānā i ke ʻano o ka hana ʻana o ka Linux kernel i nā ʻeke.
Ke hoʻi nei i ka hoʻokō kuʻuna maʻalahi, loaʻa i ke kāleka pūnaewele ka ʻeke a hoʻouna
He lohi ka hoʻololi ʻana i ka pōʻaiapili: ʻaʻole i ʻike ʻia ka latency ma nā kāleka pūnaewele 10Mbps i nā makahiki 90, akā ma nā kāleka 10G hou me ka loaʻa ʻana o 15 miliona mau ʻeke i kēlā me kēia kekona, hiki ke hoʻopau ʻia kēlā me kēia kikowaena o kahi kikowaena liʻiliʻi ʻewalu mau miliona. o na manawa i kekona.
I ʻole e mālama mau i nā keʻakeʻa, ua hoʻohui ʻo Linux i nā makahiki he nui i hala
ʻOi aku ka wikiwiki o kēia, akā he pilikia ʻē aʻe. Inā he nui nā ʻeke, a laila, pau ka manawa i ka hoʻoponopono ʻana i nā ʻeke mai ke kāleka pūnaewele, a ʻaʻohe manawa o nā kaʻina kikowaena mea hoʻohana e hoʻokaʻawale i kēia mau queues (heluhelu mai nā pilina TCP, etc.). Ma hope ua piha nā pila a hoʻomaka mākou e hoʻolei i nā ʻeke. I ka hoʻāʻo ʻana e ʻimi i ke koena, hoʻonohonoho ka kernel i kahi kālā no ka helu kiʻekiʻe o nā ʻeke i hana ʻia ma ka pōʻaiapili softirq. Ke hoʻopau ʻia kēia kālā, hoʻāla ʻia kahi pae ʻokoʻa ksoftirqd
(E ʻike ʻoe i kekahi o lākou i loko ps
per core) nāna e mālama i kēia mau softirqs ma waho o ke ala syscall/interrupt maʻamau. Hoʻonohonoho ʻia kēia pae me ka hoʻohana ʻana i ke kaʻina hana maʻamau, e hoʻāʻo nei e hoʻokaʻawale pono i nā kumuwaiwai.
Ma hope o ke aʻo ʻana i ke ʻano o ka hana ʻana o ka kernel, hiki iā ʻoe ke ʻike aia ke ʻano o ka congestion. Inā liʻiliʻi ka loaʻa ʻana o nā kelepona softirq, pono e kali nā ʻeke no kekahi manawa e hana ʻia ma ka laina RX ma ke kāleka pūnaewele. Hiki paha kēia ma muli o kekahi hana e ālai ana i ka core processor, a i ʻole kekahi mea ʻē aʻe e pale ana i ke kumu mai ka holo ʻana i ka softirq.
Hoʻemi i ka hana ʻana a hiki i ke kumu a i ʻole ke ʻano
ʻO nā lohi Softirq he kuhi wale nō ia i kēia manawa. Akā, kūpono ia, a ʻike mākou ke ʻike nei mākou i kahi mea like loa. No laila ʻo ka hana aʻe e hōʻoia i kēia manaʻo. A ina ua hooiaioia, e imi i ke kumu o ka lohi.
E hoʻi kāua i kā mākou ʻeke lohi:
len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms
len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms
len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms
len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms
len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms
len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms
E like me ka mea i kamaʻilio mua ʻia, ua hoʻopili ʻia kēia mau ʻeke ICMP i hoʻokahi queue RX NIC a hoʻoponopono ʻia e hoʻokahi CPU core. Inā makemake mākou e hoʻomaopopo pehea e hana ai ʻo Linux, pono e ʻike i kahi (ma kahi o ka CPU core) a pehea (softirq, ksoftirqd) e hana ʻia kēia mau pūʻulu i mea e hahai ai i ke kaʻina hana.
ʻO ka manawa kēia e hoʻohana ai i nā mea hana e hiki ai iā ʻoe ke nānā i ka kernel Linux i ka manawa maoli. Maanei mākou i hoʻohana ai
He mea maʻalahi ka hoʻolālā ma aneʻi: ʻike mākou e hana ka kernel i kēia mau pings ICMP, no laila e kau mākou i kahi makau ma ka hana kernel hping3
ʻoi aku ka kiʻekiʻe.
kuhi icmp_echo
haʻi aku struct sk_buff *skb
: He ʻeke kēia me kahi "echo request". Hiki iā mākou ke hahai, huki i ke kaʻina echo.sequence
(e hoohalike ana me icmp_seq
na hping3 выше
), a hoʻouna iā ia i kahi mea hoʻohana. Maikaʻi nō hoʻi e hopu i ka inoa kaʻina hana o kēia manawa. Ma lalo iho nei nā hopena a mākou e ʻike pololei ai i ka wā e hoʻoili ʻia ai nā ʻeke kernel:
TGID PID KA INOA ICMP_SEQ 0 0 swapper/11 770 0 swapper/0 11 771 swapper/0 0 11 swapper/772 0 0 swapper/11 773 0 prometheus 0 11 774 swapper/20041 20086 775 swapper/0 0 11 swapper/776 0 0 spokes-report-s 11
Pono e hoʻomaopopoʻia maʻaneʻi ma ka pōʻaiapili softirq
ʻO nā kaʻina hana i hana i nā kelepona ʻōnaehana e ʻike ʻia ma ke ʻano he "kaʻina hana" inā ʻoiaʻiʻo, ʻo ka kernel ka mea e mālama pono ai i nā ʻeke ma ke ʻano o ka kernel.
Me kēia mea hana hiki iā mākou ke hoʻohui i nā kaʻina hana kūikawā me nā pūʻolo kikoʻī e hōʻike ana i ka lohi o hping3
. E maʻalahi kāua grep
ma keia hopu ana no kekahi mau waiwai icmp_seq
. Ua hōʻailona ʻia nā ʻeke e pili ana i nā koina icmp_seq ma luna me kā lākou RTT a mākou i ʻike ai ma luna (ma nā pale i nā koina RTT i manaʻo ʻia no nā ʻeke a mākou i kānana ʻia ma muli o nā waiwai RTT ma lalo o 50ms):
TGID PID INOA ICMP_SEQ ** RTT -- 10137 10436 luna hoʻomalu 1951 10137 10436 luna hoʻomalu 1952 76 76 ksoftirqd/11 1953 ** 99ms 76 76 ksoftirqd/11 1954 ** 89ms 76 76 ksoftirqd/11 1955 ** 79ms 76 76 ksoftirqd/11 1956 ** 69ms 76 76 ksoftirqd/11 1957 ** 59ms 76 76 ksoftirqd/11 1958 ** (49ms) 76 76 ksoftirqd/11 1959 ** (39ms) 76 76 ksoftirqd/11 1960 ** (29ms) 76 76 ksoftirqd/11 1961 ** (19ms) 76 76 ksoftirqd/11 1962 ** (9ms) -- 10137 10436 luna hoʻomalu 2068 10137 10436 luna hoʻomalu 2069 76 76 ksoftirqd/11 2070 ** 75ms 76 76 ksoftirqd/11 2071 ** 65ms 76 76 ksoftirqd/11 2072 ** 55ms 76 76 ksoftirqd/11 2073 ** (45ms) 76 76 ksoftirqd/11 2074 ** (35ms) 76 76 ksoftirqd/11 2075 ** (25ms) 76 76 ksoftirqd/11 2076 ** (15ms) 76 76 ksoftirqd/11 2077 ** (5ms)
Hōʻike nā hopena iā mākou i kekahi mau mea. ʻO ka mea mua, ua hana ʻia kēia mau pūʻolo āpau e ka pōʻaiapili ksoftirqd/11
. ʻO ke ʻano kēia no kēia mau mīkini ʻelua, ua hoʻopaʻa ʻia nā ʻeke ICMP i ke kumu 11 ma ka hopena loaʻa. ʻIke pū mākou i ka wā e loaʻa ai kahi jam, aia nā ʻeke i hana ʻia i loko o ka pōʻaiapili o ke kelepona ʻōnaehana cadvisor
... A laila ksoftirqd
lawe i ka hana a hana i ka queue i hōʻiliʻili ʻia: pololei ka helu o nā ʻeke i hōʻiliʻili ma hope cadvisor
.
ʻO kaʻoiaʻiʻo ma mua koke o ka hana mau cadvisor
, hōʻike i kona komo ʻana i ka pilikia. ʻO ka mea hoʻohenehene, ke kumu
E like me nā hiʻohiʻona ʻē aʻe o nā ipu, he mau mea hana kiʻekiʻe loa kēia a hiki ke manaʻo ʻia e ʻike i nā pilikia hana ma lalo o kekahi mau kūlana i ʻike ʻole ʻia.
He aha ka hana a cadvisor e hoʻolohi i ka pila packet?
Loaʻa iā mākou ka ʻike maikaʻi i ke ʻano o ka hāʻule ʻana, he aha ke kaʻina hana, a ma luna o ka CPU. ʻIke mākou ma muli o ka paʻakikī paʻakikī, ʻaʻohe manawa o ka Linux kernel e hoʻonohonoho ksoftirqd
. A ʻike mākou ua hana ʻia nā ʻeke ma ka pōʻaiapili cadvisor
. He kūpono ke manaʻo i kēlā cadvisor
hoʻomaka i kahi syscall lohi, a ma hope o ka hana ʻana i nā ʻeke a pau i hōʻiliʻili ʻia i kēlā manawa:
He manaʻo kēia, akā pehea e hoʻāʻo ai? ʻO ka mea hiki iā mākou ke hana, ʻo ka ʻimi ʻana i ka CPU core i loko o kēia kaʻina hana, e ʻimi i kahi e hele ai ka helu o nā ʻeke ma luna o ka waihona kālā a kāhea ʻia ʻo ksoftirqd, a laila e nānā iki i hope e ʻike i ka mea e holo pololei ana ma ka CPU core ma mua o kēlā manawa. . Ua like ia me ka x-ray i ka CPU i kēlā me kēia mau milliseconds. E like me kēia:
Maʻalahi, hiki ke hana i kēia me nā mea hana i loaʻa. ʻo kahi laʻana, ksoftirqd
:
# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100
Eia nā hopena:
(сотни следов, которые выглядят похожими)
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run
Nui nā mea ma aneʻi, akā ʻo ka mea nui ke ʻike mākou i ke ʻano "cadvisor ma mua o ksoftirqd" a mākou i ʻike mua ai ma ka ICMP tracer. He aha ka manaʻo?
ʻO kēlā me kēia laina he ʻano CPU ma kahi kikoʻī i ka manawa. Hoʻokaʻawale ʻia kēlā me kēia kelepona i lalo o ka waihona ma kahi laina e kahi semicolon. Ma ka waena o nā laina ʻike mākou i ka syscall i kapa ʻia: read(): .... ;do_syscall_64;sys_read; ...
. No laila, hoʻohana nui ka cadvisor i ka manawa ma ke kelepona ʻōnaehana read()
pili i na hana mem_cgroup_*
(luna o ka waihona kelepona/hopena o ka laina).
He mea maʻalahi ka ʻike ʻana i ka ʻike kelepona i ka mea e heluhelu ʻia nei, no laila e holo kāua strace
a e ʻike kākou i ka hana a cadvisor a ʻike i nā kelepona pūnaewele ʻoi aku ka lōʻihi ma mua o 100ms:
theojulienne@kube-node-bad ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0.[1-9]'
[pid 10436] <... futex resumed> ) = 0 <0.156784>
[pid 10432] <... futex resumed> ) = 0 <0.258285>
[pid 10137] <... futex resumed> ) = 0 <0.678382>
[pid 10384] <... futex resumed> ) = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> ) = 0 <0.104614>
[pid 10436] <... futex resumed> ) = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> ) = 0 <0.118113>
[pid 10382] <... pselect6 resumed> ) = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> ) = 0 <0.917495>
[pid 10436] <... futex resumed> ) = 0 <0.208172>
[pid 10417] <... futex resumed> ) = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 576 <0.154442>
E like me kāu e manaʻo ai, ʻike mākou i nā kelepona lohi ma aneʻi read()
. Mai loko mai o nā hana heluhelu a me ka pōʻaiapili mem_cgroup
ua maopopo keia mau pilikia read()
e kuhikuhi i ka faila memory.stat
, e hōʻike ana i ka hoʻohana ʻana i ka hoʻomanaʻo a me nā palena cgroup (Docker's resource isolation technology). Nīnau ka mea hana cadvisor i kēia faila no ka loaʻa ʻana o ka ʻike hoʻohana waiwai no nā ipu. E nānā inā ʻo ka kernel a i ʻole cadvisor e hana nei i kahi mea i manaʻo ʻole ʻia:
theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null
real 0m0.153s
user 0m0.000s
sys 0m0.152s
theojulienne@kube-node-bad ~ $
I kēia manawa hiki iā mākou ke hana hou i ka bug a hoʻomaopopo i ke alo o ka Linux kernel i kahi pathology.
No ke aha i lohi ai ka hana heluhelu?
I kēia pae, ʻoi aku ka maʻalahi o ka loaʻa ʻana o nā memo mai nā mea hoʻohana ʻē aʻe e pili ana i nā pilikia like. E like me ka mea i ʻike ʻia, ma ka cadvisor tracker ua hōʻike ʻia kēia bug
ʻO ka pilikia, ʻo nā hui e noʻonoʻo i ka hoʻohana ʻana i ka hoʻomanaʻo i loko o ka namespace (container). Ke haʻalele nā kaʻina hana a pau i kēia cgroup, hoʻokuʻu ʻo Docker i ka cgroup hoʻomanaʻo. Eia naʻe, ʻaʻole ʻo "memory" wale ka hoʻomanaʻo ʻana i ka hoʻomanaʻo. ʻOiai ʻaʻole i hoʻohana hou ʻia ke kaʻina hana, ʻike ʻia e hāʻawi mau ana ka kernel i nā mea i hūnā ʻia, e like me nā dentries a me nā inodes (directory and file metadata), i hūnā ʻia i loko o ka pūʻulu hoʻomanaʻo. Mai ka wehewehe pilikia:
zombie cgroups: nā hui ʻaʻohe kaʻina hana a ua holoi ʻia, akā aia nō ka hoʻomanaʻo i hoʻokaʻawale ʻia (i koʻu hihia, mai ka dentry cache, akā hiki ke hoʻokaʻawale ʻia mai ka ʻaoʻao cache a i ʻole tmpfs).
Hiki ke lohi loa ka nānā ʻana o ka kernel i nā ʻaoʻao a pau i loko o ka cache i ka wā e hoʻokuʻu ai i kahi cgroup, no laila ua koho ʻia ke kaʻina hana palaualelo: e kali a noi hou ʻia kēia mau ʻaoʻao, a laila hoʻomaʻemaʻe i ka cgroup i ka wā e pono ai ka hoʻomanaʻo. A hiki i kēia manawa, mālama ʻia ka cgroup i ka ʻohi ʻana i nā helu.
Mai kahi hiʻohiʻona hana, hāʻawi lākou i ka hoʻomanaʻo no ka hana: ka wikiwiki ʻana i ka hoʻomaʻemaʻe mua ma ka waiho ʻana i kahi hoʻomanaʻo huna. Maikaʻi kēia. Ke hoʻohana ka kernel i ka hoʻomanaʻo hope loa o ka hoʻomanaʻo cache, ua hoʻomaʻemaʻe ʻia ka cgroup, no laila ʻaʻole hiki ke kapa ʻia he "leak". ʻO ka mea pōʻino, ka hoʻokō kikoʻī o ka mīkini ʻimi memory.stat
i kēia ʻano kernel (4.9), i hui pū ʻia me ka nui o ka hoʻomanaʻo ma kā mākou mau kikowaena, ʻo ia ka mea e lōʻihi loa ka hoʻihoʻi ʻana i ka ʻikepili cache hou a hoʻomaʻemaʻe i nā zombies cgroup.
Ua ʻike ʻia he nui nā cgroup zombies o kekahi o kā mākou mau node i ʻoi aku ka heluhelu a me ka latency ma mua o kekona.
ʻO ka workaround no ka pilikia cadvisor e hoʻokuʻu koke i nā dentries / inodes caches a puni ka ʻōnaehana, e hoʻopau koke i ka latency heluhelu a me ka latency pūnaewele ma ka mea hoʻokipa, no ka mea, ʻo ka hoʻomaʻemaʻe ʻana i ka cache huli i nā ʻaoʻao zombie cgroup cache a ua hoʻokuʻu ʻia lākou. ʻAʻole kēia he hopena, akā hōʻoia i ke kumu o ka pilikia.
Ua hoʻololi ʻia ma nā mana kernel hou (4.19+) ua hoʻomaikaʻi ʻia ka hana kelepona memory.stat
, no laila ke hoʻololi nei i kēia kernel i hoʻoponopono i ka pilikia. I ka manawa like, loaʻa iā mākou nā mea hana e ʻike ai i nā node pilikia i nā pūʻulu Kubernetes, hoʻokahe maikaʻi iā lākou a hoʻomaka hou iā lākou. Hoʻopili mākou i nā pūʻulu āpau, loaʻa nā nodes me ka latency kiʻekiʻe a hoʻihoʻi hou iā lākou. Hāʻawi kēia iā mākou i ka manawa e hoʻonui i ka OS ma nā kikowaena i koe.
E hōʻuluʻulu
Ma muli o ka pau ʻana o kēia pahu i ka hoʻoili ʻana i ka queue RX NIC no nā haneli milliseconds, ua hana like ia i ka latency kiʻekiʻe ma nā pili pōkole a me ka latency waena waena, e like me waena o nā noi MySQL a me nā ʻeke pane.
ʻO ka hoʻomaopopo a me ka mālama ʻana i ka hana o nā ʻōnaehana kumu, e like me Kubernetes, he mea koʻikoʻi i ka hilinaʻi a me ka wikiwiki o nā lawelawe āpau e pili ana iā lākou. Loaʻa nā ʻōnaehana āpau āu e holo ai mai ka hoʻomaikaʻi ʻana i ka hana Kubernetes.
Source: www.habr.com