O ni nai tausaga talu ai Kubernetes
O le mea moni, o talosaga e o'o atu i feso'otaiga vavave fa'afuase'i e o'o atu i le 100ms pe sili atu, e i'u ai i taimi fa'amuta pe toe taumafai. O auaunaga sa fa'amoemoe e mafai ona tali vave atu i talosaga nai lo le 100ms. Ae e le mafai pe afai o le fesoʻotaʻiga lava ia e umi se taimi. E ese mai, na matou matauina vave fesili MySQL e tatau ona ave milliseconds, ma MySQL na maeʻa i milliseconds, ae mai le vaaiga a le talosaga, o le tali e 100 ms pe sili atu.
Na vave ona manino mai o le faʻafitauli naʻo le tupu pe a faʻafesoʻotaʻi i se node Kubernetes, tusa lava pe o le valaau na sau mai fafo o Kubernetes. O le auala pito sili ona faigofie e toe faia ai le faʻafitauli o se suʻega
Ave'esea lavelave le talafeagai i le filifili e tau atu i le toilalo
E ala i le toe faia o le faʻataʻitaʻiga tutusa, matou te manaʻo e faʻaitiʻitia le taulaʻi o le faʻafitauli ma aveese faʻalavelave le manaʻomia o le lavelave. I le taimi muamua, e tele naua elemene i le tafe i le va o Vegeta ma le Kubernetes pods. Ina ia iloa se faʻafitauli loloto fesoʻotaʻiga, e tatau ona e faʻamalo nisi o ia mea.
O le kalani (Vegeta) e faia se feso'ota'iga TCP ma so'o se node i totonu o le fuifui. Kubernetes o loʻo galue o se fesoʻotaʻiga faʻapipiʻi (i luga o le fesoʻotaʻiga nofoaga autu o loʻo i ai) e faʻaogaina
Тилита tcpdump
i le suega Vegeta o loʻo i ai se faʻatuai i le taimi o le lululima TCP (i le va o SYN ma SYN-ACK). Ina ia aveese lenei faʻalavelave le manaʻomia, e mafai ona e faʻaogaina hping3
mo "pings" faigofie ma SYN pepa. Matou te siaki pe i ai se tuai i le pusa tali, ona toe setiina lea o le fesoʻotaʻiga. E mafai ona matou fa'amama fa'amaumauga e na'o le aofia ai o pepa e sili atu nai lo le 100ms ma maua ai se auala faigofie e toe fa'afo'i ai le fa'afitauli nai lo le su'ega 7 layer network atoa a Vegeta. O Kubernetes node "pings" e fa'aaoga ai le TCP SYN/SYN-ACK i luga o le 'au'aunaga "node port" (30927) i le 10ms vaeluaga, fa'amama e tali lemu:
theojulienne@shell ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms
E mafai ona vave faia le matau muamua. A fua i le faasologa o numera ma taimi, e manino lava e le o se taimi e tasi le pisi. O le tuai e masani ona faʻaputuina ma mulimuli ane faʻatautaia.
O le isi, matou te fia su'e po'o fea vaega e ono a'afia i le fa'alavelave. Masalo o nisi nei o le selau o tulafono iptables i le NAT? Pe i ai ni fa'afitauli ile IPIP tunneling ile feso'otaiga? O se tasi o auala e suʻe ai lenei mea o le suʻeina lea o laasaga taʻitasi o le faiga e ala i le faʻaumatia. O le a le mea e tupu pe afai e te aveese le NAT ma le firewall logic, tu'u na'o le vaega IPIP:
O le mea e lelei ai, o Linux e faʻafaigofie ona maua saʻo le IP overlay layer pe afai o le masini o loʻo i luga o le fesoʻotaʻiga tutusa:
theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms
A fua i taunuuga, o loo tumau pea le faafitauli! E le aofia ai iptables ma NAT. O le faafitauli la o le TCP? Sei o tatou vaʻai pe faʻafefea ona alu le ICMP ping masani:
theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms
len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms
len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms
len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms
O fa'ai'uga ua fa'aalia ai e le'i alu ese le fa'afitauli. Masalo o se alalaupapa IPIP lea? Se'i fa'afaigofie atili le su'ega:
Po'o lafo uma pepa i le va o nei 'au e lua?
theojulienne@kube-node-client ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms
len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms
Ua matou faafaigofieina le tulaga i le lua Kubernetes nodes e lafo e le tasi i le isi soʻo se pepa, e oʻo lava i le ICMP ping. O lo'o latou va'ai pea pe a 'leaga' le 'au fa'amoemoe (o nisi e leaga nai lo isi).
Ole fesili mulimuli nei: aisea e na'o le fa'atuai e tupu ile kube-node servers? Ma e tupu pe a o le kube-node o le tagata e auina atu poʻo le tagata e taliaina? O le mea e laki ai, e faigofie foʻi ona iloa e ala i le lafoina o se afifi mai se talimalo i fafo atu o Kubernetes, ae faʻatasi ai ma le "tagata leaga" e maua. E pei ona e vaʻai, e leʻi mou atu le faʻafitauli:
theojulienne@shell ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms
Ona matou faia lea o talosaga tutusa mai le puna muamua kube-node i le talimalo i fafo (lea e le aofia ai le punavai autu talu ai o le ping e aofia uma ai le RX ma le TX):
theojulienne@kube-node-client ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms
E ala i le su'esu'eina o pu'e pu'e o le latency, na matou maua ai nisi fa'amatalaga fa'aopoopo. Aemaise lava, o le tagata e auina atu (lalo) e vaʻaia lenei taimi, ae o le tagata e mauaina (pito i luga) e le - vaʻai i le Delta column (i sekone):
E le gata i lea, afai e te vaʻavaʻai i le eseesega i le faasologa o TCP ma ICMP packets (e ala i numera faʻasologa) i luga o le itu e mauaina, ICMP paʻu e taunuu i taimi uma i le faasologa lava e tasi na auina atu ai, ae e ese le taimi. I le taimi lava e tasi, o paʻu TCP i nisi taimi e vaʻaia, ma o nisi o latou e pipii. Aemaise lava, afai e te suʻesuʻeina ports o SYN packets, o loʻo faʻatulagaina i le itu a le tagata e auina atu, ae le o le itu a le tagata e taliaina.
E i ai se eseesega laititi i le auala
O le isi faʻamatalaga fou: i lenei vaitau matou te vaʻaia le ICMP faʻatuai i fesoʻotaʻiga uma i le va o 'au e lua, ae e leai se TCP. E taʻu mai ia i tatou o le mafuaʻaga e foliga mai e fesoʻotaʻi ma le RX queue hashing: o le faʻalavelave e toetoe lava a mautinoa i le faagasologa o paʻu RX, ae le o le lafoina o tali.
O lenei mea e faʻaumatia ai le lafoina o afifi mai le lisi o mafuaʻaga talafeagai. Ua matou iloa nei o le faʻafitauli o le faʻaogaina o pepa o loʻo i luga o le itu maua i luga o nisi o kube-node servers.
Malamalama i le gaosiga o pusa i le fatu Linux
Ina ia malamalama pe aisea e tupu ai le faʻafitauli i le tagata e taliaina i luga o nisi o kube-node servers, seʻi o tatou vaʻavaʻai pe faʻafefea ona faʻaogaina e le Linux kernel packets.
Toe foʻi i le faʻatinoga masani masani, e maua e le network card le afifi ma lafo
E telegese le suiga o le fa'amatalaga: atonu e le'i iloa le leo i luga o kata feso'ota'iga 10Mbps i le 90s, ae i luga o kata 10G fa'aonaponei ma le maualuga o le gaosiga o le 15 miliona pepa i le sekone, e mafai ona fa'alavelaveina fa'alavelave ta'itasi i totonu ole la'ititi valu-core server. o taimi i le sekone.
Ina ia aua le taulimaina i taimi uma faʻalavelave, tele tausaga talu ai faʻaopoopo Linux
E sili atu le vave, ae mafua ai se faʻafitauli ese. Afai e tele naua pepa, ona faʻaalu uma lea o le taimi i le faʻaogaina o pepa mai le network card, ma e leai se taimi e faʻaogaina ai avanoa e faʻaoga ai nei laina (faitau mai fesoʻotaʻiga TCP, ma isi). Mulimuli ane ua tumu laina ma amata ona matou lafoa'i taga. I se taumafaiga e su'e se paleni, e fa'atulaga e le fatu se paketi mo le aofa'i maualuga o fa'aputu o lo'o fa'agaioia i totonu o le fa'amatalaga softirq. O le taimi lava e sili atu ai lenei paketi, e fafaguina se filo eseese ksoftirqd
(o le a e vaʻai i se tasi oi latou i totonu ps
per core) lea e taulimaina nei softirqs i fafo atu o le ala masani syscall/interrupt. O lenei filo o loʻo faʻatulagaina e faʻaaoga ai le faʻasologa masani o faʻasologa, lea e taumafai e faʻasoa saʻo punaoa.
I le suʻesuʻeina pe faʻafefea ona faʻaogaina e le fatu pusa, e mafai ona e vaʻaia o loʻo i ai se faʻalavelave faʻapitoa. Afai e le maua soo le telefoni, o le a tatau ona faatali pepa mo sina taimi e faʻagasolo i le RX queue i luga o le network card. Atonu e mafua ona o nisi o galuega o loʻo poloka ai le processor core, poʻo se isi mea o loʻo taofia ai le autu mai le taʻavale softirq.
Fa'aiti'itia le fa'agaioiga i lalo i le 'autu po'o le metotia
Softirq tuai ua na o se matematega mo le taimi nei. Ae e talafeagai, ma matou te iloa o loʻo matou vaʻai i se mea e talitutusa lava. O le isi la'asaga la o le fa'amaonia lea o le talitonuga. Ma afai e faʻamaonia, ona suʻe lea o le mafuaʻaga o le tuai.
Se'i tatou toe fo'i i a tatou taga fa'agesegese:
len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms
len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms
len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms
len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms
len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms
len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms
E pei ona talanoaina muamua, o nei pusa ICMP o loʻo faʻapipiʻiina i totonu o se laina RX NIC e tasi ma faʻatautaia e se tasi CPU autu. Afai tatou te fia malamalama pe faʻafefea ona galue Linux, e aoga le iloa poʻo fea (o fea CPU core) ma pe faʻafefea (softirq, ksoftirqd) o loʻo faʻagasolo ai nei afifi ina ia siaki ai le faagasologa.
Ua oʻo nei i le taimi e faʻaaoga ai meafaigaluega e mafai ai ona e mataʻituina le fatu Linux i le taimi moni. O iinei sa matou faaaogaina
O le fuafuaga iinei e faigofie: matou te iloa o le fatu o loʻo faʻaogaina nei ICMP pings, o lea o le a matou tuʻuina ai se matau i luga o le galuega fatu. hping3
maualuga atu
kote icmp_echo
momoli atu struct sk_buff *skb
: Ole pepa lea e iai se "talosaga echo". E mafai ona tatou siakiina, toso ese le faasologa echo.sequence
(lea e faatusatusa i icmp_seq
e hping3 выше
), ma auina atu i le avanoa e faaaoga ai. E faigofie fo'i le pu'eina ole igoa/id ole faiga o iai nei. O loʻo i lalo iʻuga tatou te vaʻaia saʻo aʻo faʻagasolo e le fatu pusa:
TGID PID Offidy Igoa ICMP_SQ 0 0 SWOY / 11 770S0 0 11 771 0 0 11STR / 772 0 0 11 773 0 suisui/0 11 774 20041 tautala-lipoti-s 20086
E tatau ona matauina iinei i le tulaga softirq
fa'agasologa na faia ai le telefoni feavea'i o le a foliga mai o "fa'agasologa" ae o le mea moni o le fatu lea e fa'agasolo saogalemu ai afifi i totonu o le fatu.
Faatasi ai ma lenei meafaigaluega e mafai ona tatou faʻafesoʻotaʻi faiga faʻapitoa ma afifi faʻapitoa e faʻaalia ai le tuai o hping3
. Sei o tatou faafaigofie grep
i luga o lenei pu'eina mo nisi tulaga taua icmp_seq
. O pusa e fetaui ma tau o le icmp_seq o loʻo i luga na maitauina faʻatasi ai ma a latou RTT na matou matauina i luga (i puipui o loʻo faʻamoemoeina RTT tau mo paʻu na matou faʻamama ona o tau RTT e itiiti ifo i le 50 ms):
TGID PID PROCESS IGOA ICMP_SEQ ** RTT -- 10137 10436 cadvisor 1951 10137 10436 cadvisor 1952 76 76 ksoftirqd/11 1953 ** 99ms 76 76 ksoftirqd 11ds ** /1954 89 ** 76ms 76 11 ksoftirqd/ 1955 79 ** 76ms 76 11 ksoftirqd/1956 69 ** 76ms 76 11 ksoftirqd/1957 59 ** (76ms) 76 11 ksoftirqd/1958 49 ** (76ms) 76 11 ksoftirqd 1959 39 ksoftirqd 76ms (76ms) irqd/ 11 1960 ** (29ms) 76 76 ksoftirqd/11 1961 ** (19ms) -- 76 76 cadvisor 11 1962 9 cadvisor 10137 10436 2068 ksoftir 10137/10436 ksoftir 2069/76 ksoftir /76 11 ** 2070ms 75 76 ksoftirqd/ 76 11 ** 2071ms 65 76 ksoftirqd/76 11 ** (2072ms) 55 76 ksoftirqd/76 11 ** (2073ms) 45 76 ksoftirqd/76 11 ** (2074ms) 35 ms) ) 76 76 ksoftirqd/11 2075 ** (25ms)
O i'uga e ta'u mai ai ia i tatou ni nai mea. Muamua, o nei afifi uma o loʻo faʻatautaia e le tala ksoftirqd/11
. O lona uiga, mo lenei paga faapitoa o masini, ICMP packets na faʻapipiʻiina i le autu 11 i le pito e maua ai. Matou te vaʻai foʻi i soʻo se taimi e iai se siamu, o loʻo i ai faʻamaumauga o loʻo faʻatautaia i le tulaga o le telefoni feaveaʻi cadvisor
... Ona ksoftirqd
ave le galuega ma faʻagasolo le faʻaputuina o laina: o le aofaʻi tonu o pepa na faʻaputuina mulimuli ane cadvisor
.
O le mea moni e faapea i le taimi lava e galue ai i taimi uma cadvisor
, o lona uiga o lona auai i le faafitauli. O le mea e malie ai, o le faamoemoega
E pei o isi vaega o koneteina, o mea uma nei e sili ona maualuga ma e mafai ona faʻamoemoe e oʻo i faʻafitauli faʻatinoga i lalo o ni tulaga e leʻi mafaufauina.
O le a le mea e fai e le cadvisor e fa'agesegese ai le fa'asologa o pusa?
Ua matou maua nei se malamalamaga lelei pe faʻafefea ona tupu le faʻalavelave, o le a le gaioiga e mafua ai, ma o le a le PPU. Matou te vaʻaia ona o le faigata o le poloka, o le Linux kernel e leai se taimi e faʻatulagaina ksoftirqd
. Ma ua tatou va'ai o lo'o fa'agasolo fa'asolo i totonu o fa'amatalaga cadvisor
. E talafeagai le manatu faapea cadvisor
fa'alauiloa se syscall fa'agesegese, pe a mae'a ona fa'agasolo uma pusa fa'aputuina i lena taimi:
O se aʻoaʻoga lea, ae faʻafefea ona suʻeina? O le mea e mafai ona tatou faia o le suʻeina lea o le CPU core i lenei faagasologa, suʻe le tulaga o loʻo alu ai le numera o paʻu i luga o le paketi ma ksoftirqd e valaʻau, ona toe tilotilo lea i tua e vaʻai poʻo le a tonu le mea o loʻo tamoe i luga o le CPU core i luma o lena tulaga. . E pei o le x-ray le PPU i nai milliseconds uma. O le a foliga e pei o lenei:
Fa'afaigofie, o nei mea uma e mafai ona faia i meafaigaluega o iai. Faataitaiga, ksoftirqd
:
# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100
O i'uga nei:
(сотни следов, которые выглядят похожими)
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run
E tele mea iinei, ae o le mea autu o loʻo tatou mauaina le "cadvisor before ksoftirqd" mamanu na matou vaʻaia muamua i le ICMP tracer. O le a le uiga?
O laina taʻitasi o se faʻasologa o le CPU i se taimi patino. O vala'au ta'itasi i lalo o le fa'aputuga i luga o se laina e tu'u 'ese'ese i se semicolon. I le ogatotonu o laina tatou te vaʻai i le syscall o loʻo taʻua: read(): .... ;do_syscall_64;sys_read; ...
. O le mea lea e faʻaalu ai e le cadvisor le tele o taimi i luga o le telefoni read()
fa'atatau i galuega mem_cgroup_*
(pito i luga ole fa'aputuga telefoni/fa'ai'uga ole laina).
E le faigofie le va'ai i se vala'au su'esu'e le mea tonu o lo'o faitauina, o lea tatou tamo'e strace
ma se'i o tatou va'ai po'o le a le mea e fai e le cadvisor ma su'e le telefoni e umi atu nai lo le 100 ms:
theojulienne@kube-node-bad ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0.[1-9]'
[pid 10436] <... futex resumed> ) = 0 <0.156784>
[pid 10432] <... futex resumed> ) = 0 <0.258285>
[pid 10137] <... futex resumed> ) = 0 <0.678382>
[pid 10384] <... futex resumed> ) = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> ) = 0 <0.104614>
[pid 10436] <... futex resumed> ) = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> ) = 0 <0.118113>
[pid 10382] <... pselect6 resumed> ) = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> ) = 0 <0.917495>
[pid 10436] <... futex resumed> ) = 0 <0.208172>
[pid 10417] <... futex resumed> ) = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 576 <0.154442>
E pei ona e fa'amoemoeina, matou te va'ai lemu telefoni iinei read()
. Mai mea o lo'o i totonu o fa'agaioiga faitau ma fa'amatalaga mem_cgroup
e manino lava o nei luitau read()
faasino i le faila memory.stat
, lea e faʻaalia ai le faʻaogaina o manatuaga ma tapulaʻa cgroup (Docker's resources isolation technology). E fesiligia e le meafaigaluega cadvisor lenei faila e maua ai faʻamatalaga faʻaogaina o punaoa mo pusa. Sei o tatou siaki pe o le fatu poʻo le cadvisor o loʻo faia se mea e leʻi mafaufauina:
theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null
real 0m0.153s
user 0m0.000s
sys 0m0.152s
theojulienne@kube-node-bad ~ $
Ole taimi nei e mafai ona tatou toe gaosia le pusa ma malamalama o le fatu Linux o loʻo feagai ma se faʻamaʻi.
Aisea ua tuai tele ai le faitau?
I lenei laʻasaga, e sili atu ona faigofie le mauaina o feʻau mai isi tagata faʻaoga e uiga i faʻafitauli tutusa. E pei ona aliali mai, i totonu o le cadvisor tracker na lipotia ai lenei bug
O le fa'afitauli o le fa'aogaina lea e vaega o le mafaufau i totonu ole igoa (container). A o'o uma faiga i totonu o le vaega lea, e fa'asa'oloto e Docker le vaega manatua. Ae ui i lea, o le "manatua" e le na o le faagasologa o le manatua. E ui lava e le o toe faʻaaogaina le manatua o le faagasologa, e foliga mai o loʻo tuʻuina atu pea e le fatu meaʻai, e pei o dentries ma inodes (directory ma faila metadata), o loʻo teuina i le memory cgroup. Mai le faʻamatalaga faʻafitauli:
zombie cgroups: cgroups e leai ni faʻagasologa ma ua tapeina, ae o loʻo i ai pea le manatua na tuʻuina atu (i loʻu tulaga, mai le dentry cache, ae mafai foi ona tuʻuina mai le itulau cache poʻo tmpfs).
O le siaki a le fatu o itulau uma o loʻo i totonu o le cache pe a faʻasaʻolotoina se cgroup e mafai ona matua tuai, o lea e filifilia ai le paie: faʻatali seʻi toe talosagaina nei itulau, ona faʻamama lea o le cgroup pe a manaʻomia moni le manatua. Se'ia o'o mai i le taimi nei, o lo'o fa'atumauina pea le cgroup pe a aoina fa'amaumauga.
Mai se vaaiga faʻatinoga, na latou ositaulagaina le manatua mo le faʻatinoga: faʻavaveina le faʻamamāina muamua e ala i le tuʻuina o se manatuaga natia i tua. E lelei lea. A faʻaaoga e le fatu le mea mulimuli o le manatuaga natia, e iu lava ina kilia le cgroup, o lea e le mafai ai ona taʻua o le "leak". Ae paga lea, o le faʻatinoina faʻapitoa o le masini suʻesuʻe memory.stat
i lenei kernel version (4.9), faʻatasi ma le tele o manatuaga i luga oa tatou 'auʻaunaga, o lona uiga e umi se taimi e toe faʻafoʻi ai faʻamaumauga lata mai faʻamaumauga ma manino cgroup zombies.
E foliga mai o nisi o matou node e tele naua cgroup zombies ma o le faitau ma le taofiofia na sili atu i le sekone.
O le workaround mo le faafitauli cadvisor o le vave saoloto dentries / inodes caches i le faiga atoa, lea vave aveesea le latency faitau faapea foi le latency fesootaiga i luga o le talimalo, talu ai kilia le cache liliu i luga o le cached zombie cgroup itulau ma faasaolotoina foi i latou. E le o se fofo lea, ae e faʻamaonia ai le mafuaʻaga o le faʻafitauli.
Na aliali mai i totonu o faʻamatalaga fou kernel (4.19+) na faʻaleleia le faʻatinoga o le telefoni memory.stat
, o lea o le fesuia'i i lenei fatu na fa'aleleia ai le fa'afitauli. I le taimi lava e tasi, sa i ai a matou mea faigaluega e suʻesuʻe ai faʻafitauli faʻafitauli i fuifui Kubernetes, faʻafefe lelei ma toe faʻafou. Na matou selu uma fuifui, maua ni pona e lava le maualuga o le taofi ma toe fa'afou. O lenei mea na matou maua ai le taimi e faʻafou ai le OS i luga o sapalai o totoe.
E tauaofai
Talu ai ona o lenei pusa na taofia le RX NIC faʻasologa o laina mo le faitau selau o milliseconds, na mafua ai i le taimi lava e tasi le maualuga o le taofiofi i luga o fesoʻotaʻiga pupuu ma le vaeluaga o fesoʻotaʻiga latency, e pei o le va o MySQL talosaga ma tali tali.
O le malamalama ma le faatumauina o le faatinoga o faiga sili ona taua, e pei o Kubernetes, e taua tele i le faatuatuaina ma le saoasaoa o auaunaga uma e faavae i luga. So'o se faiga e te fa'atinoina e manuia mai le fa'aleleia atili o fa'atinoga a Kubernetes.
puna: www.habr.com