Gyara jinkirin hanyar sadarwa a cikin Kubernetes

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

Shekaru biyu da suka gabata Kubernetes an riga an tattauna akan shafin yanar gizon GitHub na hukuma. Tun daga nan, ya zama daidaitaccen fasaha don ƙaddamar da ayyuka. Kubernetes yanzu yana kula da wani muhimmin yanki na ayyukan ciki da na jama'a. Yayin da gungu na mu suka girma kuma buƙatun aiki sun ƙara tsauri, mun fara lura cewa wasu ayyuka akan Kubernetes suna fuskantar jinkirin lokaci-lokaci waɗanda nauyin aikace-aikacen da kansa ba zai iya bayyana su ba.

Mahimmanci, aikace-aikace suna fuskantar da alama bazuwar lat ɗin hanyar sadarwa na har zuwa 100ms ko fiye, wanda ke haifar da ƙarewar lokaci ko sake gwadawa. Ana tsammanin sabis ɗin zai iya amsa buƙatun da sauri fiye da 100ms. Amma wannan ba zai yiwu ba idan haɗin kanta yana ɗaukar lokaci mai yawa. Na dabam, mun lura da tambayoyin MySQL cikin sauri waɗanda yakamata su ɗauki milliseconds, kuma MySQL sun cika a cikin millise seconds, amma daga hangen aikace-aikacen neman, amsar ta ɗauki 100ms ko fiye.

Nan da nan ya bayyana a fili cewa matsalar ta faru ne kawai lokacin haɗawa zuwa kumburin Kubernetes, koda kuwa kiran ya fito daga wajen Kubernetes. Hanya mafi sauƙi don sake haifar da matsalar shine a cikin gwaji Kayan lambu, wanda ke gudana daga kowane mai masaukin baki, yana gwada sabis na Kubernetes akan takamaiman tashar jiragen ruwa, kuma a lokaci-lokaci yana yin rajistar babban latency. A cikin wannan labarin, za mu dubi yadda muka sami damar gano musabbabin wannan matsala.

Kawar da hadaddun da ba dole ba a cikin sarkar da ke haifar da gazawa

Ta hanyar sake fitar da misalin guda ɗaya, muna so mu taƙaita abin da matsala ta fi mayar da hankali kuma mu cire abubuwan da ba dole ba. Da farko, akwai abubuwa da yawa da yawa a cikin magudanar ruwa tsakanin Vegeta da kwas ɗin Kubernetes. Don gano matsalar hanyar sadarwa mai zurfi, kuna buƙatar kawar da wasu daga cikinsu.

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

Abokin ciniki (Vegeta) yana ƙirƙirar haɗin TCP tare da kowane kumburi a cikin tari. Kubernetes yana aiki azaman hanyar sadarwa mai rufi (a saman cibiyar sadarwar data data kasance) wacce ke amfani IPIP, wato, yana ɗaukar fakitin IP na cibiyar sadarwa mai rufi a cikin fakitin IP na cibiyar bayanai. Lokacin haɗawa zuwa kumburin farko, ana yin fassarar adireshin cibiyar sadarwa Fassarar Adireshin Yanar Gizo (NAT) bayyananne don fassara adireshin IP da tashar jiragen ruwa na kumburin Kubernetes zuwa adireshin IP da tashar jiragen ruwa a cikin hanyar sadarwa mai rufi (musamman, kwas ɗin tare da aikace-aikacen). Don fakiti masu shigowa, ana yin juzu'i na ayyuka. Tsari ne mai sarkakiya tare da jihohi da yawa da abubuwa da yawa waɗanda ake sabuntawa akai-akai kuma ana canza su yayin da ake tura sabis da motsi.

Mai amfani tcpdump a cikin gwajin Vegeta akwai jinkiri yayin musafaha na TCP (tsakanin SYN da SYN-ACK). Don cire wannan hadaddun da ba dole ba, zaka iya amfani hping3 don sauƙi "pings" tare da fakiti na SYN. Muna duba idan akwai jinkiri a cikin fakitin martani, sannan mu sake saita haɗin. Za mu iya tace bayanan don haɗa fakiti masu girma fiye da 100ms kawai kuma mu sami hanya mafi sauƙi don sake haifar da matsalar fiye da cikakken gwajin Layer 7 a cikin Vegeta. Anan akwai kumburin Kubernetes "pings" ta amfani da TCP SYN/SYN-ACK akan sabis ɗin "tashar tashar node" (30927) a tazarar 10ms, tace ta hanyar mafi ƙarancin martani:

theojulienne@shell ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms

Za a iya yin kallo na farko nan da nan. Idan aka yi la'akari da lambobi da lokuta, a bayyane yake cewa waɗannan ba cunkoso ba ne na lokaci ɗaya. Jinkirin yakan taru kuma a ƙarshe ana sarrafa shi.

Na gaba, muna so mu gano abubuwan da za su iya shiga cikin abin da ya faru na cunkoso. Wataƙila waɗannan su ne wasu ɗaruruwan ƙa'idodin iptables a cikin NAT? Ko akwai wasu matsaloli tare da tunneling IPIP akan hanyar sadarwa? Hanya ɗaya don gwada wannan ita ce gwada kowane mataki na tsarin ta hanyar kawar da shi. Me zai faru idan kun cire NAT da kalmar Tacewar zaɓi, barin ɓangaren IPIP kawai:

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

Abin farin ciki, Linux yana sauƙaƙa don samun dama ga Layer mai rufin IP kai tsaye idan injin yana kan hanyar sadarwa iri ɗaya:

theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms

Yin la'akari da sakamakon, matsalar har yanzu tana nan! Wannan ya ware iptables da NAT. Don haka matsalar ita ce TCP? Bari mu ga yadda ping na ICMP na yau da kullun ke tafiya:

theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms

len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms

len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms

len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms

Sakamakon ya nuna cewa matsalar ba ta kau ba. Wataƙila wannan rami ne na IPIP? Bari mu ƙara sauƙaƙe gwajin:

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

An aika duk fakiti tsakanin waɗannan runduna biyu?

theojulienne@kube-node-client ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms

len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms

Mun sauƙaƙa lamarin zuwa nodes na Kubernetes guda biyu suna aika juna kowane fakiti, har ma da ping na ICMP. Har yanzu suna ganin latency idan wanda aka yi niyya ya kasance "mara kyau" (wasu sun fi wasu muni).

Yanzu tambaya ta ƙarshe: me yasa jinkirin ke faruwa kawai akan sabar kube-node? Kuma yana faruwa lokacin kube-node shine mai aikawa ko mai karɓa? Sa'ar al'amarin shine, wannan kuma abu ne mai sauƙi don ganowa ta hanyar aika fakiti daga mai masaukin baki a wajen Kubernetes, amma tare da mai karɓar "sananan mara kyau". Kamar yadda kuke gani, matsalar ba ta ɓace ba:

theojulienne@shell ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms

Za mu gudanar da buƙatun iri ɗaya daga tushen kube-node na baya zuwa mai masaukin waje (wanda ya keɓance mai watsa shiri tun lokacin da ping ya ƙunshi duka ɓangaren RX da TX):

theojulienne@kube-node-client ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms

Ta binciki fakitin latency, mun sami ƙarin bayani. Musamman, cewa mai aikawa (a ƙasa) yana ganin wannan lokacin ƙarewa, amma mai karɓa (saman) baya - duba ginshiƙin Delta (a cikin daƙiƙa):

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

Bugu da kari, idan ka kalli bambance-bambance a cikin tsari na fakitin TCP da ICMP (ta jerin lambobi) a gefen mai karɓa, fakitin ICMP koyaushe suna zuwa a cikin jeri ɗaya da aka aiko su, amma tare da lokaci daban-daban. A lokaci guda, fakitin TCP wani lokaci suna shiga tsakani, kuma wasu daga cikinsu suna makale. Musamman, idan ka bincika tashoshin jiragen ruwa na fakitin SYN, suna cikin tsari a gefen mai aikawa, amma ba a gefen mai karɓa ba.

Akwai bambanci a hankali a yadda katunan sadarwar sabobin zamani (kamar waɗanda ke cikin cibiyar bayanan mu) suna aiwatar da fakiti masu ɗauke da TCP ko ICMP. Lokacin da fakiti ya zo, adaftar cibiyar sadarwa ta “hashes ta kowane haɗin gwiwa”, wato, tana ƙoƙarin karya haɗin gwiwar zuwa cikin layi da aika kowane jerin gwano zuwa babban abin sarrafawa daban. Don TCP, wannan zanta ya ƙunshi duka tushen da adireshin IP da tashar jiragen ruwa. A wasu kalmomi, kowane haɗin yana hashed (mai yiwuwa) daban. Don ICMP, adiresoshin IP kawai ake hashed, tunda babu tashar jiragen ruwa.

Wani sabon abin lura: a wannan lokacin muna ganin jinkirin ICMP akan duk sadarwa tsakanin runduna biyu, amma TCP baya. Wannan yana gaya mana cewa mai yuwuwa dalilin yana da alaƙa da hashing line na RX: tabbas cunkoson yana cikin sarrafa fakitin RX, ba a aika da martani ba.

Wannan yana kawar da fakitin aikawa daga jerin abubuwan da za su iya haifar da su. Yanzu mun san cewa matsalar sarrafa fakiti tana kan ɓangaren karɓa akan wasu sabar kube-node.

Fahimtar sarrafa fakiti a cikin Linux kernel

Don fahimtar dalilin da yasa matsalar ke faruwa a mai karɓa akan wasu sabar kube-node, bari mu kalli yadda kernel Linux ke aiwatar da fakiti.

Komawa zuwa aiwatar da al'ada mafi sauƙi, katin sadarwar yana karɓar fakiti kuma aika katse Linux kernel cewa akwai kunshin da ke buƙatar sarrafawa. Kwaya tana dakatar da wani aiki, tana canza mahallin zuwa mai kula da katsewa, sarrafa fakitin, sannan ta dawo kan ayyukan da ake ciki.

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

Wannan canjin mahallin yana jinkiri: ƙila ba a iya lura da latency akan katunan cibiyar sadarwa na 10Mbps a cikin '90s, amma akan katunan 10G na zamani tare da matsakaicin kayan aiki na fakiti miliyan 15 a sakan daya, kowane cibiya na ƙaramin sabar takwas-core ana iya katse shi miliyoyin. na lokuta a sakan daya.

Don kar a ci gaba da sarrafa katsewa, an ƙara Linux shekaru da yawa da suka gabata NAPIAPI ɗin cibiyar sadarwa wanda duk direbobin zamani ke amfani da su don haɓaka aiki cikin sauri. A ƙananan saurin kernel har yanzu yana karɓar katsewa daga katin sadarwar a tsohuwar hanya. Da zarar isassun fakiti sun isa wanda ya zarce bakin kofa, kernel yana hana katsewa kuma a maimakon haka ya fara jefa kuri'a na adaftar cibiyar sadarwa tare da ɗaukar fakiti a gungu. Ana aiwatar da aikin a cikin softirq, wato, in mahallin katse software bayan kiran tsarin da katsewar kayan aiki, lokacin da kernel (saɓanin sararin samaniya) ya riga ya gudana.

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

Wannan ya fi sauri, amma yana haifar da matsala ta daban. Idan akwai fakiti da yawa, to duk lokacin ana kashe fakitin sarrafa fakiti daga katin cibiyar sadarwa, kuma hanyoyin sararin samaniyar masu amfani ba su da lokacin da za su ɓata waɗannan layukan (karanta daga haɗin TCP, da sauransu). Daga ƙarshe layukan sun cika kuma muka fara zubar da fakiti. A ƙoƙarin nemo ma'auni, kernel yana saita kasafin kuɗi don iyakar adadin fakitin da aka sarrafa a cikin mahallin softirq. Da zarar an wuce wannan kasafin kuɗi, ana tada zaren daban ksoftirqd (zaka ga daya daga cikinsu a ciki ps kowane core) wanda ke sarrafa waɗannan softirqs a waje da hanyar syscall/katse ta al'ada. An tsara wannan zaren ta hanyar amfani da daidaitaccen tsari mai tsara tsari, wanda ke ƙoƙarin rarraba albarkatu daidai.

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

Bayan nazarin yadda kwaya ke sarrafa fakiti, za ku ga cewa akwai yuwuwar cunkoso. Idan ana karɓar kiran softirq ƙasa akai-akai, fakiti za su jira na ɗan lokaci don sarrafa su a cikin layin RX akan katin sadarwar. Wannan na iya zama saboda wasu ɗawainiya da ke toshe core processor, ko kuma wani abu dabam ke hana core gudu softirq.

Ƙuntataccen sarrafawa zuwa ainihin ko hanya

Jinkirin Softirq hasashe ne kawai a yanzu. Amma yana da ma'ana, kuma mun san muna ganin wani abu mai kama da haka. Don haka mataki na gaba shine tabbatar da wannan ka'idar. Idan kuma ya tabbata, to a nemo dalilin jinkirin.

Mu koma kan fakitinmu sannu a hankali:

len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms

len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms

len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms

len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms

Kamar yadda aka tattauna a baya, waɗannan fakitin ICMP ana haɗe su cikin layin RX NIC guda ɗaya kuma ana sarrafa su ta hanyar cibiya guda ɗaya. Idan muna son fahimtar yadda Linux ke aiki, yana da amfani mu san inda (a kan wanne CPU core) da kuma yadda (softirq, ksoftirqd) ake sarrafa waɗannan fakitin don bin tsarin.

Yanzu lokaci ya yi da za a yi amfani da kayan aikin da ke ba ku damar saka idanu akan kernel na Linux a ainihin lokacin. A nan mun yi amfani bcc. Wannan saitin kayan aikin yana ba ku damar rubuta ƙananan shirye-shiryen C waɗanda ke haɗa ayyukan sabani a cikin kernel da adana abubuwan da suka faru a cikin shirin Python-space mai amfani wanda zai iya sarrafa su kuma ya dawo muku da sakamakon. Haɗa ayyuka na sabani a cikin kwaya kasuwanci ne mai wayo, amma an ƙera kayan aikin don iyakar tsaro kuma an tsara shi don bin diddigin ainihin irin abubuwan da ake samarwa waɗanda ba a sauƙaƙe a sake su ba a cikin yanayin gwaji ko haɓakawa.

Shirin a nan yana da sauƙi: mun san cewa kernel yana aiwatar da waɗannan pings na ICMP, don haka za mu sanya ƙugiya a kan aikin kwaya. icmp_echo, wanda ke karɓar fakitin buƙatun echo na ICMP mai shigowa kuma ya fara aika da amsa amsawar ICMP. Za mu iya gano fakiti ta ƙara lambar icmp_seq, wanda ke nunawa hping3 mafi girma.

Lambar rubutun bcc yana kama da rikitarwa, amma ba haka ba ne mai ban tsoro kamar yadda ake gani. Aiki icmp_echo watsa struct sk_buff *skb: Wannan fakiti ne tare da "echo request". Za mu iya bin sa, fitar da jerin echo.sequence (wanda yayi daidai da icmp_seq da hping3 выше), kuma aika shi zuwa sararin mai amfani. Hakanan ya dace don ɗaukar sunan / id ɗin tsari na yanzu. A ƙasa akwai sakamakon da muke gani kai tsaye yayin fakitin sarrafa kwaya:

TGID PID TARIHIN SUNA ICMP_SEQ
0 0 swapper/11
770 0 swapper/0
11 771 swapper/0
0 11 swapper/772
0 0 swapper/11
773 0 na 0
11 774 swapper/20041
20086 775 swapper/0
0 11 swapper/776
0 0 kakakin-rahoton-s 11

Ya kamata a lura a nan cewa a cikin mahallin softirq Hanyoyin da suka yi kiran tsarin za su bayyana a matsayin "tsari" yayin da a zahiri kernel ne ke sarrafa fakiti cikin aminci a cikin mahallin kernel.

Tare da wannan kayan aiki za mu iya haɗa takamaiman matakai tare da takamaiman fakiti waɗanda ke nuna jinkirin hping3. Bari mu yi shi mai sauƙi grep akan wannan kama don wasu dabi'u icmp_seq. Fakitin da suka dace da ƙimar icmp_seq na sama an yi alama tare da RTT ɗin su da muka lura a sama (a cikin bakan gizo ana tsammanin ƙimar RTT don fakitin da muka tace saboda ƙimar RTT ƙasa da 50ms):

TGID PID TARIHIN SUNA ICMP_SEQ ** RTT
--
10137 10436 cadvisor 1951
10137 10436 cadvisor 1952
76 76 ksoftirqd/11 1953 ** 99ms
76 76 ksoftirqd/11 1954 ** 89ms
76 76 ksoftirqd/11 1955 ** 79ms
76 76 ksoftirqd/11 1956 ** 69ms
76 76 ksoftirqd/11 1957 ** 59ms
76 76 ksoftirqd/11 1958 ** (49ms)
76 76 ksoftirqd/11 1959 ** (39ms)
76 76 ksoftirqd/11 1960 ** (29ms)
76 76 ksoftirqd/11 1961 ** (19ms)
76 76 ksoftirqd/11 1962 ** (9ms)
--
Farashin 10137
10436 2068 cadvisor 10137
10436 2069 ksoftirqd/76 76 ** 11ms
2070 75 ksoftirqd/76 76 ** 11ms
2071 65 ksoftirqd/76 76 ** 11ms
2072 55 ksoftirqd/76 76 ** (11ms)
2073 45 ksoftirqd/76 76 ** (11ms)
2074 35 ksoftirqd/76 76 ** (11ms)
2075 25 ksoftirqd/76 76 ** (11ms)
2076 15 ksoftirqd/76 76 ** (11ms)

Sakamakon ya gaya mana abubuwa da yawa. Na farko, duk waɗannan fakitin ana sarrafa su ta hanyar mahallin ksoftirqd/11. Wannan yana nufin cewa don wannan takamaiman injunan guda biyu, fakitin ICMP an haɗe su zuwa ainihin 11 a ƙarshen karɓa. Mun kuma ga cewa a duk lokacin da aka samu jam, akwai fakitin da ake sarrafa su a yanayin kiran tsarin cadvisor... Sannan ksoftirqd yana ɗaukar aikin kuma yana aiwatar da jerin gwanon da aka tara: daidai adadin fakitin da suka taru bayan cadvisor.

Gaskiyar cewa nan da nan kafin shi ko da yaushe yana aiki cadvisor, yana nuna shigarsa cikin matsalar. Abin ban mamaki, manufar cadvisor - "bincika amfani da albarkatu da halayen aiki na kwantena masu gudana" maimakon haifar da wannan batun aikin.

Kamar yadda yake tare da sauran bangarorin kwantena, waɗannan duk kayan aikin ci gaba ne kuma ana iya sa ran su fuskanci al'amuran aiki a ƙarƙashin wasu yanayi mara kyau.

Menene cadvisor ke yi wanda ke rage layin fakiti?

Yanzu muna da kyakkyawar fahimtar yadda hadarin ke faruwa, wane tsari ne ke haifar da shi, kuma akan wane CPU. Mun ga cewa saboda tsananin toshewa, Linux kernel ba shi da lokacin tsarawa ksoftirqd. Kuma muna ganin cewa ana sarrafa fakiti a cikin mahallin cadvisor. Yana da ma'ana a ɗauka cewa cadvisor yana ƙaddamar da sysscall a hankali, bayan haka ana sarrafa duk fakitin da aka tara a wancan lokacin:

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

Wannan ka'ida ce, amma ta yaya za a gwada ta? Abin da za mu iya yi shi ne gano ainihin CPU a cikin wannan tsari, nemo wurin da adadin fakiti ya wuce kasafin kuɗi kuma ana kiran ksoftirqd, sa'an nan kuma duba kadan gaba don ganin ainihin abin da ke gudana akan CPU core kafin wannan batu. . Yana kama da x-ray da CPU kowane ƴan millise seconds. Zai yi kama da wani abu kamar haka:

Gyara jinkirin hanyar sadarwa a cikin Kubernetes

A dacewa, duk wannan ana iya yin shi tare da kayan aikin da ke akwai. Misali, rikodin perf yana bincika ainihin abin da aka ba da CPU a ƙayyadadden mitar kuma yana iya samar da jadawalin kira zuwa tsarin aiki, gami da sararin mai amfani da kernel na Linux. Kuna iya ɗaukar wannan rikodin kuma sarrafa shi ta amfani da ƙaramin cokali mai yatsa na shirin FlameGraph daga Brendan Gregg, wanda ke kiyaye tsari na alamar tari. Za mu iya ajiye alamun tari guda ɗaya kowane 1 ms, sannan mu haskaka da ajiye samfurin mil 100 na daƙiƙa XNUMX kafin alamar ta ci gaba. ksoftirqd:

# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100

Ga sakamakon:

(сотни следов, которые выглядят похожими)

cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run

Akwai abubuwa da yawa a nan, amma babban abu shine mun sami tsarin "cadvisor before ksoftirqd" wanda muka gani a baya a cikin ICMP tracer. Me ake nufi?

Kowane layi alama ce ta CPU a wani takamaiman lokaci cikin lokaci. Kowane kira saukar da tari akan layi yana raba shi da wani ɗan ƙaramin abu. A tsakiyar layi muna ganin sysscall ana kiransa: read(): .... ;do_syscall_64;sys_read; .... Don haka cadvisor yana ciyar da lokaci mai yawa akan kiran tsarin read()alaka da ayyuka mem_cgroup_* (saman tarin kira/ƙarshen layi).

Ba shi da daɗi don ganin a cikin alamar kira abin da ake karantawa daidai, don haka mu gudu strace kuma bari mu ga abin da cadvisor ke yi kuma mu nemo tsarin kira ya fi tsayi 100ms:

theojulienne@kube-node-bad ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0.[1-9]'
[pid 10436] <... futex resumed> ) = 0 <0.156784>
[pid 10432] <... futex resumed> ) = 0 <0.258285>
[pid 10137] <... futex resumed> ) = 0 <0.678382>
[pid 10384] <... futex resumed> ) = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> ) = 0 <0.104614>
[pid 10436] <... futex resumed> ) = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> ) = 0 <0.118113>
[pid 10382] <... pselect6 resumed> ) = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> ) = 0 <0.917495>
[pid 10436] <... futex resumed> ) = 0 <0.208172>
[pid 10417] <... futex resumed> ) = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 576 <0.154442>

Kamar yadda kuke tsammani, muna ganin kira a hankali a nan read(). Daga abubuwan da ke cikin ayyukan karantawa da mahallin mem_cgroup a fili yake cewa wadannan kalubale read() koma ga fayil memory.stat, wanda ke nuna amfani da ƙwaƙwalwar ajiya da iyakokin ƙungiyoyi (fasahar keɓe albarkatun albarkatun Docker). Kayan aikin cadvisor yana tambayar wannan fayil don samun bayanan amfani da albarkatu don kwantena. Bari mu bincika idan kernel ko cadvisor ne ke yin wani abin da ba zato ba tsammani:

theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null

real 0m0.153s
user 0m0.000s
sys 0m0.152s
theojulienne@kube-node-bad ~ $

Yanzu za mu iya sake haifar da kwaro kuma mu fahimci cewa kernel na Linux yana fuskantar cututtukan cututtuka.

Me yasa aikin karatun yake a hankali?

A wannan mataki, yana da sauƙin nemo saƙonni daga wasu masu amfani game da irin waɗannan matsalolin. Kamar yadda ya fito, a cikin ma'ajin ma'ajin binciken wannan kwaro an ruwaito shi azaman matsalar yawan amfani da CPU, kawai cewa babu wanda ya lura cewa latency shima yana nunawa ba da gangan ba a cikin tarin cibiyar sadarwa. Lallai an lura cewa cadvisor yana cin lokacin CPU fiye da yadda ake tsammani, amma wannan ba a ba shi mahimmanci ba, tunda sabobinmu suna da albarkatun CPU da yawa, don haka ba a yi nazari sosai kan matsalar ba.

Matsalar ita ce ƙungiyoyi suna yin la'akari da amfani da ƙwaƙwalwar ajiya a cikin sarari suna (kwantena). Lokacin da duk tafiyar matakai a cikin wannan rukunin ƙungiyar, Docker yana sakin rukunin ƙwaƙwalwar ajiya. Koyaya, "ƙwaƙwalwar ajiya" ba kawai sarrafa ƙwaƙwalwar ajiya ba ne. Ko da yake ba a ƙara amfani da memorin tsarin da kanta ba, ya bayyana cewa kernel ɗin har yanzu yana ba da abubuwan da aka adana, kamar haƙoran haƙora da inodes (directory da metadata fayil), waɗanda ke cikin rukunin ƙwaƙwalwar ajiya. Daga bayanin matsalar:

Ƙungiyoyin aljanu: ƙungiyoyi waɗanda ba su da tsari kuma an share su, amma har yanzu suna da ƙayyadaddun ƙwaƙwalwar ajiya (a cikin akwati na, daga cache na hakori, amma kuma ana iya rarraba shi daga cache na shafi ko tmpfs).

Binciken kernel na duk shafukan da ke cikin cache lokacin 'yantar rukuni na iya zama a hankali sosai, don haka za a zaɓi tsarin kasala: jira har sai an sake buƙatar waɗannan shafukan, sannan a ƙarshe share rukunin lokacin da ake buƙatar ƙwaƙwalwar ajiya. Har zuwa wannan batu, har yanzu ana la'akari da ƙungiyar yayin tattara ƙididdiga.

Daga yanayin aiki, sun sadaukar da ƙwaƙwalwar ajiya don yin aiki: hanzarta tsaftacewar farko ta hanyar barin wasu ƙwaƙwalwar ajiya a baya. Wannan yayi kyau. Lokacin da kernel yayi amfani da ƙarshen ƙwaƙwalwar ajiya, rukunin yana sharewa daga ƙarshe, don haka ba za a iya kiran shi "leak". Abin takaici, ƙayyadaddun aiwatar da tsarin bincike memory.stat a cikin wannan sigar kernel (4.9), haɗe tare da ɗimbin ƙwaƙwalwar ajiya akan sabobin mu, yana nufin cewa yana ɗaukar lokaci mai tsawo don dawo da sabbin bayanan da aka adana da kuma share aljanu na rukuni.

Ya zama cewa wasu daga cikin nodes ɗinmu suna da aljanu masu yawa da yawa wanda karatun da latency ya wuce daƙiƙa guda.

Hanyar da za a yi don batun cadvisor shine a ba da ajiyar hakoran hakoran hako / nodes nan da nan a ko'ina cikin tsarin, wanda nan da nan ya kawar da latency na karantawa da kuma latency na cibiyar sadarwa akan mai watsa shiri, tunda share cache yana kunna shafukan cgroup na aljanu kuma an sake su. Wannan ba shine mafita ba, amma yana tabbatar da musabbabin matsalar.

Ya bayyana cewa a cikin sabbin nau'ikan kernel (4.19+) an inganta aikin kira memory.stat, don haka canzawa zuwa wannan kernel ya gyara matsalar. A lokaci guda, muna da kayan aikin don gano matsala masu matsala a cikin gungu na Kubernetes, da kyau mu kwashe su kuma mu sake yin su. Mun tsefe duk gungu, mun sami nodes tare da isasshen latency kuma muka sake kunna su. Wannan ya ba mu lokaci don sabunta OS akan sauran sabobin.

Don taƙaita

Saboda wannan kwaro ya dakatar da sarrafa layin RX NIC na ɗaruruwan milliseconds, lokaci guda ya haifar da babban latency akan gajerun hanyoyin haɗi da latency na tsakiyar haɗin gwiwa, kamar tsakanin buƙatun MySQL da fakitin amsawa.

Fahimtar da kiyaye aikin mafi mahimmancin tsarin, irin su Kubernetes, yana da mahimmanci ga aminci da saurin duk ayyuka bisa su. Kowane tsarin da kuke gudana yana amfana daga haɓaka ayyukan Kubernetes.

source: www.habr.com

Add a comment