Shekaru biyu da suka gabata Kubernetes
Mahimmanci, aikace-aikace suna fuskantar da alama bazuwar lat ɗin hanyar sadarwa na har zuwa 100ms ko fiye, wanda ke haifar da ƙarewar lokaci ko sake gwadawa. Ana tsammanin sabis ɗin zai iya amsa buƙatun da sauri fiye da 100ms. Amma wannan ba zai yiwu ba idan haɗin kanta yana ɗaukar lokaci mai yawa. Na dabam, mun lura da tambayoyin MySQL cikin sauri waɗanda yakamata su ɗauki milliseconds, kuma MySQL sun cika a cikin millise seconds, amma daga hangen aikace-aikacen neman, amsar ta ɗauki 100ms ko fiye.
Nan da nan ya bayyana a fili cewa matsalar ta faru ne kawai lokacin haɗawa zuwa kumburin Kubernetes, koda kuwa kiran ya fito daga wajen Kubernetes. Hanya mafi sauƙi don sake haifar da matsalar shine a cikin gwaji
Kawar da hadaddun da ba dole ba a cikin sarkar da ke haifar da gazawa
Ta hanyar sake fitar da misalin guda ɗaya, muna so mu taƙaita abin da matsala ta fi mayar da hankali kuma mu cire abubuwan da ba dole ba. Da farko, akwai abubuwa da yawa da yawa a cikin magudanar ruwa tsakanin Vegeta da kwas ɗin Kubernetes. Don gano matsalar hanyar sadarwa mai zurfi, kuna buƙatar kawar da wasu daga cikinsu.
Abokin ciniki (Vegeta) yana ƙirƙirar haɗin TCP tare da kowane kumburi a cikin tari. Kubernetes yana aiki azaman hanyar sadarwa mai rufi (a saman cibiyar sadarwar data data kasance) wacce ke amfani
Mai amfani tcpdump
a cikin gwajin Vegeta akwai jinkiri yayin musafaha na TCP (tsakanin SYN da SYN-ACK). Don cire wannan hadaddun da ba dole ba, zaka iya amfani hping3
don sauƙi "pings" tare da fakiti na SYN. Muna duba idan akwai jinkiri a cikin fakitin martani, sannan mu sake saita haɗin. Za mu iya tace bayanan don haɗa fakiti masu girma fiye da 100ms kawai kuma mu sami hanya mafi sauƙi don sake haifar da matsalar fiye da cikakken gwajin Layer 7 a cikin Vegeta. Anan akwai kumburin Kubernetes "pings" ta amfani da TCP SYN/SYN-ACK akan sabis ɗin "tashar tashar node" (30927) a tazarar 10ms, tace ta hanyar mafi ƙarancin martani:
theojulienne@shell ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms
Za a iya yin kallo na farko nan da nan. Idan aka yi la'akari da lambobi da lokuta, a bayyane yake cewa waɗannan ba cunkoso ba ne na lokaci ɗaya. Jinkirin yakan taru kuma a ƙarshe ana sarrafa shi.
Na gaba, muna so mu gano abubuwan da za su iya shiga cikin abin da ya faru na cunkoso. Wataƙila waɗannan su ne wasu ɗaruruwan ƙa'idodin iptables a cikin NAT? Ko akwai wasu matsaloli tare da tunneling IPIP akan hanyar sadarwa? Hanya ɗaya don gwada wannan ita ce gwada kowane mataki na tsarin ta hanyar kawar da shi. Me zai faru idan kun cire NAT da kalmar Tacewar zaɓi, barin ɓangaren IPIP kawai:
Abin farin ciki, Linux yana sauƙaƙa don samun dama ga Layer mai rufin IP kai tsaye idan injin yana kan hanyar sadarwa iri ɗaya:
theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms
Yin la'akari da sakamakon, matsalar har yanzu tana nan! Wannan ya ware iptables da NAT. Don haka matsalar ita ce TCP? Bari mu ga yadda ping na ICMP na yau da kullun ke tafiya:
theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms
len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms
len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms
len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms
Sakamakon ya nuna cewa matsalar ba ta kau ba. Wataƙila wannan rami ne na IPIP? Bari mu ƙara sauƙaƙe gwajin:
An aika duk fakiti tsakanin waɗannan runduna biyu?
theojulienne@kube-node-client ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms
len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms
Mun sauƙaƙa lamarin zuwa nodes na Kubernetes guda biyu suna aika juna kowane fakiti, har ma da ping na ICMP. Har yanzu suna ganin latency idan wanda aka yi niyya ya kasance "mara kyau" (wasu sun fi wasu muni).
Yanzu tambaya ta ƙarshe: me yasa jinkirin ke faruwa kawai akan sabar kube-node? Kuma yana faruwa lokacin kube-node shine mai aikawa ko mai karɓa? Sa'ar al'amarin shine, wannan kuma abu ne mai sauƙi don ganowa ta hanyar aika fakiti daga mai masaukin baki a wajen Kubernetes, amma tare da mai karɓar "sananan mara kyau". Kamar yadda kuke gani, matsalar ba ta ɓace ba:
theojulienne@shell ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms
Za mu gudanar da buƙatun iri ɗaya daga tushen kube-node na baya zuwa mai masaukin waje (wanda ya keɓance mai watsa shiri tun lokacin da ping ya ƙunshi duka ɓangaren RX da TX):
theojulienne@kube-node-client ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms
Ta binciki fakitin latency, mun sami ƙarin bayani. Musamman, cewa mai aikawa (a ƙasa) yana ganin wannan lokacin ƙarewa, amma mai karɓa (saman) baya - duba ginshiƙin Delta (a cikin daƙiƙa):
Bugu da kari, idan ka kalli bambance-bambance a cikin tsari na fakitin TCP da ICMP (ta jerin lambobi) a gefen mai karɓa, fakitin ICMP koyaushe suna zuwa a cikin jeri ɗaya da aka aiko su, amma tare da lokaci daban-daban. A lokaci guda, fakitin TCP wani lokaci suna shiga tsakani, kuma wasu daga cikinsu suna makale. Musamman, idan ka bincika tashoshin jiragen ruwa na fakitin SYN, suna cikin tsari a gefen mai aikawa, amma ba a gefen mai karɓa ba.
Akwai bambanci a hankali a yadda
Wani sabon abin lura: a wannan lokacin muna ganin jinkirin ICMP akan duk sadarwa tsakanin runduna biyu, amma TCP baya. Wannan yana gaya mana cewa mai yuwuwa dalilin yana da alaƙa da hashing line na RX: tabbas cunkoson yana cikin sarrafa fakitin RX, ba a aika da martani ba.
Wannan yana kawar da fakitin aikawa daga jerin abubuwan da za su iya haifar da su. Yanzu mun san cewa matsalar sarrafa fakiti tana kan ɓangaren karɓa akan wasu sabar kube-node.
Fahimtar sarrafa fakiti a cikin Linux kernel
Don fahimtar dalilin da yasa matsalar ke faruwa a mai karɓa akan wasu sabar kube-node, bari mu kalli yadda kernel Linux ke aiwatar da fakiti.
Komawa zuwa aiwatar da al'ada mafi sauƙi, katin sadarwar yana karɓar fakiti kuma aika
Wannan canjin mahallin yana jinkiri: ƙila ba a iya lura da latency akan katunan cibiyar sadarwa na 10Mbps a cikin '90s, amma akan katunan 10G na zamani tare da matsakaicin kayan aiki na fakiti miliyan 15 a sakan daya, kowane cibiya na ƙaramin sabar takwas-core ana iya katse shi miliyoyin. na lokuta a sakan daya.
Don kar a ci gaba da sarrafa katsewa, an ƙara Linux shekaru da yawa da suka gabata
Wannan ya fi sauri, amma yana haifar da matsala ta daban. Idan akwai fakiti da yawa, to duk lokacin ana kashe fakitin sarrafa fakiti daga katin cibiyar sadarwa, kuma hanyoyin sararin samaniyar masu amfani ba su da lokacin da za su ɓata waɗannan layukan (karanta daga haɗin TCP, da sauransu). Daga ƙarshe layukan sun cika kuma muka fara zubar da fakiti. A ƙoƙarin nemo ma'auni, kernel yana saita kasafin kuɗi don iyakar adadin fakitin da aka sarrafa a cikin mahallin softirq. Da zarar an wuce wannan kasafin kuɗi, ana tada zaren daban ksoftirqd
(zaka ga daya daga cikinsu a ciki ps
kowane core) wanda ke sarrafa waɗannan softirqs a waje da hanyar syscall/katse ta al'ada. An tsara wannan zaren ta hanyar amfani da daidaitaccen tsari mai tsara tsari, wanda ke ƙoƙarin rarraba albarkatu daidai.
Bayan nazarin yadda kwaya ke sarrafa fakiti, za ku ga cewa akwai yuwuwar cunkoso. Idan ana karɓar kiran softirq ƙasa akai-akai, fakiti za su jira na ɗan lokaci don sarrafa su a cikin layin RX akan katin sadarwar. Wannan na iya zama saboda wasu ɗawainiya da ke toshe core processor, ko kuma wani abu dabam ke hana core gudu softirq.
Ƙuntataccen sarrafawa zuwa ainihin ko hanya
Jinkirin Softirq hasashe ne kawai a yanzu. Amma yana da ma'ana, kuma mun san muna ganin wani abu mai kama da haka. Don haka mataki na gaba shine tabbatar da wannan ka'idar. Idan kuma ya tabbata, to a nemo dalilin jinkirin.
Mu koma kan fakitinmu sannu a hankali:
len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms
len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms
len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms
len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms
len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms
len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms
Kamar yadda aka tattauna a baya, waɗannan fakitin ICMP ana haɗe su cikin layin RX NIC guda ɗaya kuma ana sarrafa su ta hanyar cibiya guda ɗaya. Idan muna son fahimtar yadda Linux ke aiki, yana da amfani mu san inda (a kan wanne CPU core) da kuma yadda (softirq, ksoftirqd) ake sarrafa waɗannan fakitin don bin tsarin.
Yanzu lokaci ya yi da za a yi amfani da kayan aikin da ke ba ku damar saka idanu akan kernel na Linux a ainihin lokacin. A nan mun yi amfani
Shirin a nan yana da sauƙi: mun san cewa kernel yana aiwatar da waɗannan pings na ICMP, don haka za mu sanya ƙugiya a kan aikin kwaya. hping3
mafi girma.
Lambar icmp_echo
watsa struct sk_buff *skb
: Wannan fakiti ne tare da "echo request". Za mu iya bin sa, fitar da jerin echo.sequence
(wanda yayi daidai da icmp_seq
da hping3 выше
), kuma aika shi zuwa sararin mai amfani. Hakanan ya dace don ɗaukar sunan / id ɗin tsari na yanzu. A ƙasa akwai sakamakon da muke gani kai tsaye yayin fakitin sarrafa kwaya:
TGID PID TARIHIN SUNA ICMP_SEQ 0 0 swapper/11 770 0 swapper/0 11 771 swapper/0 0 11 swapper/772 0 0 swapper/11 773 0 na 0 11 774 swapper/20041 20086 775 swapper/0 0 11 swapper/776 0 0 kakakin-rahoton-s 11
Ya kamata a lura a nan cewa a cikin mahallin softirq
Hanyoyin da suka yi kiran tsarin za su bayyana a matsayin "tsari" yayin da a zahiri kernel ne ke sarrafa fakiti cikin aminci a cikin mahallin kernel.
Tare da wannan kayan aiki za mu iya haɗa takamaiman matakai tare da takamaiman fakiti waɗanda ke nuna jinkirin hping3
. Bari mu yi shi mai sauƙi grep
akan wannan kama don wasu dabi'u icmp_seq
. Fakitin da suka dace da ƙimar icmp_seq na sama an yi alama tare da RTT ɗin su da muka lura a sama (a cikin bakan gizo ana tsammanin ƙimar RTT don fakitin da muka tace saboda ƙimar RTT ƙasa da 50ms):
TGID PID TARIHIN SUNA ICMP_SEQ ** RTT -- 10137 10436 cadvisor 1951 10137 10436 cadvisor 1952 76 76 ksoftirqd/11 1953 ** 99ms 76 76 ksoftirqd/11 1954 ** 89ms 76 76 ksoftirqd/11 1955 ** 79ms 76 76 ksoftirqd/11 1956 ** 69ms 76 76 ksoftirqd/11 1957 ** 59ms 76 76 ksoftirqd/11 1958 ** (49ms) 76 76 ksoftirqd/11 1959 ** (39ms) 76 76 ksoftirqd/11 1960 ** (29ms) 76 76 ksoftirqd/11 1961 ** (19ms) 76 76 ksoftirqd/11 1962 ** (9ms) -- Farashin 10137 10436 2068 cadvisor 10137 10436 2069 ksoftirqd/76 76 ** 11ms 2070 75 ksoftirqd/76 76 ** 11ms 2071 65 ksoftirqd/76 76 ** 11ms 2072 55 ksoftirqd/76 76 ** (11ms) 2073 45 ksoftirqd/76 76 ** (11ms) 2074 35 ksoftirqd/76 76 ** (11ms) 2075 25 ksoftirqd/76 76 ** (11ms) 2076 15 ksoftirqd/76 76 ** (11ms)
Sakamakon ya gaya mana abubuwa da yawa. Na farko, duk waɗannan fakitin ana sarrafa su ta hanyar mahallin ksoftirqd/11
. Wannan yana nufin cewa don wannan takamaiman injunan guda biyu, fakitin ICMP an haɗe su zuwa ainihin 11 a ƙarshen karɓa. Mun kuma ga cewa a duk lokacin da aka samu jam, akwai fakitin da ake sarrafa su a yanayin kiran tsarin cadvisor
... Sannan ksoftirqd
yana ɗaukar aikin kuma yana aiwatar da jerin gwanon da aka tara: daidai adadin fakitin da suka taru bayan cadvisor
.
Gaskiyar cewa nan da nan kafin shi ko da yaushe yana aiki cadvisor
, yana nuna shigarsa cikin matsalar. Abin ban mamaki, manufar
Kamar yadda yake tare da sauran bangarorin kwantena, waɗannan duk kayan aikin ci gaba ne kuma ana iya sa ran su fuskanci al'amuran aiki a ƙarƙashin wasu yanayi mara kyau.
Menene cadvisor ke yi wanda ke rage layin fakiti?
Yanzu muna da kyakkyawar fahimtar yadda hadarin ke faruwa, wane tsari ne ke haifar da shi, kuma akan wane CPU. Mun ga cewa saboda tsananin toshewa, Linux kernel ba shi da lokacin tsarawa ksoftirqd
. Kuma muna ganin cewa ana sarrafa fakiti a cikin mahallin cadvisor
. Yana da ma'ana a ɗauka cewa cadvisor
yana ƙaddamar da sysscall a hankali, bayan haka ana sarrafa duk fakitin da aka tara a wancan lokacin:
Wannan ka'ida ce, amma ta yaya za a gwada ta? Abin da za mu iya yi shi ne gano ainihin CPU a cikin wannan tsari, nemo wurin da adadin fakiti ya wuce kasafin kuɗi kuma ana kiran ksoftirqd, sa'an nan kuma duba kadan gaba don ganin ainihin abin da ke gudana akan CPU core kafin wannan batu. . Yana kama da x-ray da CPU kowane ƴan millise seconds. Zai yi kama da wani abu kamar haka:
A dacewa, duk wannan ana iya yin shi tare da kayan aikin da ke akwai. Misali, ksoftirqd
:
# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100
Ga sakamakon:
(сотни следов, которые выглядят похожими)
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run
Akwai abubuwa da yawa a nan, amma babban abu shine mun sami tsarin "cadvisor before ksoftirqd" wanda muka gani a baya a cikin ICMP tracer. Me ake nufi?
Kowane layi alama ce ta CPU a wani takamaiman lokaci cikin lokaci. Kowane kira saukar da tari akan layi yana raba shi da wani ɗan ƙaramin abu. A tsakiyar layi muna ganin sysscall ana kiransa: read(): .... ;do_syscall_64;sys_read; ...
. Don haka cadvisor yana ciyar da lokaci mai yawa akan kiran tsarin read()
alaka da ayyuka mem_cgroup_*
(saman tarin kira/ƙarshen layi).
Ba shi da daɗi don ganin a cikin alamar kira abin da ake karantawa daidai, don haka mu gudu strace
kuma bari mu ga abin da cadvisor ke yi kuma mu nemo tsarin kira ya fi tsayi 100ms:
theojulienne@kube-node-bad ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0.[1-9]'
[pid 10436] <... futex resumed> ) = 0 <0.156784>
[pid 10432] <... futex resumed> ) = 0 <0.258285>
[pid 10137] <... futex resumed> ) = 0 <0.678382>
[pid 10384] <... futex resumed> ) = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> ) = 0 <0.104614>
[pid 10436] <... futex resumed> ) = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> ) = 0 <0.118113>
[pid 10382] <... pselect6 resumed> ) = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> ) = 0 <0.917495>
[pid 10436] <... futex resumed> ) = 0 <0.208172>
[pid 10417] <... futex resumed> ) = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 576 <0.154442>
Kamar yadda kuke tsammani, muna ganin kira a hankali a nan read()
. Daga abubuwan da ke cikin ayyukan karantawa da mahallin mem_cgroup
a fili yake cewa wadannan kalubale read()
koma ga fayil memory.stat
, wanda ke nuna amfani da ƙwaƙwalwar ajiya da iyakokin ƙungiyoyi (fasahar keɓe albarkatun albarkatun Docker). Kayan aikin cadvisor yana tambayar wannan fayil don samun bayanan amfani da albarkatu don kwantena. Bari mu bincika idan kernel ko cadvisor ne ke yin wani abin da ba zato ba tsammani:
theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null
real 0m0.153s
user 0m0.000s
sys 0m0.152s
theojulienne@kube-node-bad ~ $
Yanzu za mu iya sake haifar da kwaro kuma mu fahimci cewa kernel na Linux yana fuskantar cututtukan cututtuka.
Me yasa aikin karatun yake a hankali?
A wannan mataki, yana da sauƙin nemo saƙonni daga wasu masu amfani game da irin waɗannan matsalolin. Kamar yadda ya fito, a cikin ma'ajin ma'ajin binciken wannan kwaro an ruwaito shi azaman
Matsalar ita ce ƙungiyoyi suna yin la'akari da amfani da ƙwaƙwalwar ajiya a cikin sarari suna (kwantena). Lokacin da duk tafiyar matakai a cikin wannan rukunin ƙungiyar, Docker yana sakin rukunin ƙwaƙwalwar ajiya. Koyaya, "ƙwaƙwalwar ajiya" ba kawai sarrafa ƙwaƙwalwar ajiya ba ne. Ko da yake ba a ƙara amfani da memorin tsarin da kanta ba, ya bayyana cewa kernel ɗin har yanzu yana ba da abubuwan da aka adana, kamar haƙoran haƙora da inodes (directory da metadata fayil), waɗanda ke cikin rukunin ƙwaƙwalwar ajiya. Daga bayanin matsalar:
Ƙungiyoyin aljanu: ƙungiyoyi waɗanda ba su da tsari kuma an share su, amma har yanzu suna da ƙayyadaddun ƙwaƙwalwar ajiya (a cikin akwati na, daga cache na hakori, amma kuma ana iya rarraba shi daga cache na shafi ko tmpfs).
Binciken kernel na duk shafukan da ke cikin cache lokacin 'yantar rukuni na iya zama a hankali sosai, don haka za a zaɓi tsarin kasala: jira har sai an sake buƙatar waɗannan shafukan, sannan a ƙarshe share rukunin lokacin da ake buƙatar ƙwaƙwalwar ajiya. Har zuwa wannan batu, har yanzu ana la'akari da ƙungiyar yayin tattara ƙididdiga.
Daga yanayin aiki, sun sadaukar da ƙwaƙwalwar ajiya don yin aiki: hanzarta tsaftacewar farko ta hanyar barin wasu ƙwaƙwalwar ajiya a baya. Wannan yayi kyau. Lokacin da kernel yayi amfani da ƙarshen ƙwaƙwalwar ajiya, rukunin yana sharewa daga ƙarshe, don haka ba za a iya kiran shi "leak". Abin takaici, ƙayyadaddun aiwatar da tsarin bincike memory.stat
a cikin wannan sigar kernel (4.9), haɗe tare da ɗimbin ƙwaƙwalwar ajiya akan sabobin mu, yana nufin cewa yana ɗaukar lokaci mai tsawo don dawo da sabbin bayanan da aka adana da kuma share aljanu na rukuni.
Ya zama cewa wasu daga cikin nodes ɗinmu suna da aljanu masu yawa da yawa wanda karatun da latency ya wuce daƙiƙa guda.
Hanyar da za a yi don batun cadvisor shine a ba da ajiyar hakoran hakoran hako / nodes nan da nan a ko'ina cikin tsarin, wanda nan da nan ya kawar da latency na karantawa da kuma latency na cibiyar sadarwa akan mai watsa shiri, tunda share cache yana kunna shafukan cgroup na aljanu kuma an sake su. Wannan ba shine mafita ba, amma yana tabbatar da musabbabin matsalar.
Ya bayyana cewa a cikin sabbin nau'ikan kernel (4.19+) an inganta aikin kira memory.stat
, don haka canzawa zuwa wannan kernel ya gyara matsalar. A lokaci guda, muna da kayan aikin don gano matsala masu matsala a cikin gungu na Kubernetes, da kyau mu kwashe su kuma mu sake yin su. Mun tsefe duk gungu, mun sami nodes tare da isasshen latency kuma muka sake kunna su. Wannan ya ba mu lokaci don sabunta OS akan sauran sabobin.
Don taƙaita
Saboda wannan kwaro ya dakatar da sarrafa layin RX NIC na ɗaruruwan milliseconds, lokaci guda ya haifar da babban latency akan gajerun hanyoyin haɗi da latency na tsakiyar haɗin gwiwa, kamar tsakanin buƙatun MySQL da fakitin amsawa.
Fahimtar da kiyaye aikin mafi mahimmancin tsarin, irin su Kubernetes, yana da mahimmanci ga aminci da saurin duk ayyuka bisa su. Kowane tsarin da kuke gudana yana amfana daga haɓaka ayyukan Kubernetes.
source: www.habr.com