Debugging feso'otaiga vavave ile Kubernetes

Debugging feso'otaiga vavave ile Kubernetes

O ni nai tausaga talu ai Kubernetes ua uma ona talanoaina i luga ole laupepa aloaia GitHub. Talu mai lena taimi, ua avea ma tekinolosi masani mo le faʻaogaina o auaunaga. Ua pulea nei e Kubernetes se vaega tele o auaunaga i totonu ma le lautele. A'o fa'atupula'ia a matou fuifui ma fa'atupula'ia mana'oga o fa'atinoga, na amata ona matou maitauina o nisi o au'aunaga i Kubernetes o lo'o fa'asolo taimi e le mafai ona fa'amatalaina e le uta o le talosaga lava ia.

O le mea moni, o talosaga e o'o atu i feso'otaiga vavave fa'afuase'i e o'o atu i le 100ms pe sili atu, e i'u ai i taimi fa'amuta pe toe taumafai. O auaunaga sa fa'amoemoe e mafai ona tali vave atu i talosaga nai lo le 100ms. Ae e le mafai pe afai o le fesoʻotaʻiga lava ia e umi se taimi. E ese mai, na matou matauina vave fesili MySQL e tatau ona ave milliseconds, ma MySQL na maeʻa i milliseconds, ae mai le vaaiga a le talosaga, o le tali e 100 ms pe sili atu.

Na vave ona manino mai o le faʻafitauli naʻo le tupu pe a faʻafesoʻotaʻi i se node Kubernetes, tusa lava pe o le valaau na sau mai fafo o Kubernetes. O le auala pito sili ona faigofie e toe faia ai le faʻafitauli o se suʻega Vegeta, lea e sau mai so'o se tagata talimalo i totonu, su'e le auaunaga Kubernetes i luga o se uafu fa'apitoa, ma fa'atalatala fa'asolo taimi maualuga. I lenei tusiga, o le a tatou vaʻavaʻai i le auala na mafai ai ona matou suʻeina le mafuaʻaga o lenei faʻafitauli.

Ave'esea lavelave le talafeagai i le filifili e tau atu i le toilalo

E ala i le toe faia o le faʻataʻitaʻiga tutusa, matou te manaʻo e faʻaitiʻitia le taulaʻi o le faʻafitauli ma aveese faʻalavelave le manaʻomia o le lavelave. I le taimi muamua, e tele naua elemene i le tafe i le va o Vegeta ma le Kubernetes pods. Ina ia iloa se faʻafitauli loloto fesoʻotaʻiga, e tatau ona e faʻamalo nisi o ia mea.

Debugging feso'otaiga vavave ile Kubernetes

O le kalani (Vegeta) e faia se feso'ota'iga TCP ma so'o se node i totonu o le fuifui. Kubernetes o loʻo galue o se fesoʻotaʻiga faʻapipiʻi (i luga o le fesoʻotaʻiga nofoaga autu o loʻo i ai) e faʻaogaina IPIP, o lona uiga, o loʻo faʻapipiʻiina ai pusa IP o le fesoʻotaʻiga faʻapipiʻi i totonu o pusa IP o le nofoaga autu o faʻamatalaga. A fa'afeso'ota'i i le node muamua, e faia le fa'aliliuga o tuatusi feso'ota'iga Faaliliuga tuatusi o fesootaiga (NAT) e fa'aliliu le tuatusi IP ma le taulaga o le node Kubernetes i le tuatusi IP ma le taulaga i luga o feso'otaiga (aemaise, le pod ma le talosaga). Mo afifi o loʻo oʻo mai, o le faʻasologa faʻasolosolo o gaioiga e faia. O se faiga faʻalavelave faʻatasi ma le tele o setete ma le tele o elemene e faʻafouina pea ma suia aʻo faʻapipiʻiina ma faʻagaoioia auaunaga.

Тилита tcpdump i le suega Vegeta o loʻo i ai se faʻatuai i le taimi o le lululima TCP (i le va o SYN ma SYN-ACK). Ina ia aveese lenei faʻalavelave le manaʻomia, e mafai ona e faʻaogaina hping3 mo "pings" faigofie ma SYN pepa. Matou te siaki pe i ai se tuai i le pusa tali, ona toe setiina lea o le fesoʻotaʻiga. E mafai ona matou fa'amama fa'amaumauga e na'o le aofia ai o pepa e sili atu nai lo le 100ms ma maua ai se auala faigofie e toe fa'afo'i ai le fa'afitauli nai lo le su'ega 7 layer network atoa a Vegeta. O Kubernetes node "pings" e fa'aaoga ai le TCP SYN/SYN-ACK i luga o le 'au'aunaga "node port" (30927) i le 10ms vaeluaga, fa'amama e tali lemu:

theojulienne@shell ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms

E mafai ona vave faia le matau muamua. A fua i le faasologa o numera ma taimi, e manino lava e le o se taimi e tasi le pisi. O le tuai e masani ona faʻaputuina ma mulimuli ane faʻatautaia.

O le isi, matou te fia su'e po'o fea vaega e ono a'afia i le fa'alavelave. Masalo o nisi nei o le selau o tulafono iptables i le NAT? Pe i ai ni fa'afitauli ile IPIP tunneling ile feso'otaiga? O se tasi o auala e suʻe ai lenei mea o le suʻeina lea o laasaga taʻitasi o le faiga e ala i le faʻaumatia. O le a le mea e tupu pe afai e te aveese le NAT ma le firewall logic, tu'u na'o le vaega IPIP:

Debugging feso'otaiga vavave ile Kubernetes

O le mea e lelei ai, o Linux e faʻafaigofie ona maua saʻo le IP overlay layer pe afai o le masini o loʻo i luga o le fesoʻotaʻiga tutusa:

theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms

A fua i taunuuga, o loo tumau pea le faafitauli! E le aofia ai iptables ma NAT. O le faafitauli la o le TCP? Sei o tatou vaʻai pe faʻafefea ona alu le ICMP ping masani:

theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms

len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms

len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms

len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms

O fa'ai'uga ua fa'aalia ai e le'i alu ese le fa'afitauli. Masalo o se alalaupapa IPIP lea? Se'i fa'afaigofie atili le su'ega:

Debugging feso'otaiga vavave ile Kubernetes

Po'o lafo uma pepa i le va o nei 'au e lua?

theojulienne@kube-node-client ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms

len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms

Ua matou faafaigofieina le tulaga i le lua Kubernetes nodes e lafo e le tasi i le isi soʻo se pepa, e oʻo lava i le ICMP ping. O lo'o latou va'ai pea pe a 'leaga' le 'au fa'amoemoe (o nisi e leaga nai lo isi).

Ole fesili mulimuli nei: aisea e na'o le fa'atuai e tupu ile kube-node servers? Ma e tupu pe a o le kube-node o le tagata e auina atu poʻo le tagata e taliaina? O le mea e laki ai, e faigofie foʻi ona iloa e ala i le lafoina o se afifi mai se talimalo i fafo atu o Kubernetes, ae faʻatasi ai ma le "tagata leaga" e maua. E pei ona e vaʻai, e leʻi mou atu le faʻafitauli:

theojulienne@shell ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms

Ona matou faia lea o talosaga tutusa mai le puna muamua kube-node i le talimalo i fafo (lea e le aofia ai le punavai autu talu ai o le ping e aofia uma ai le RX ma le TX):

theojulienne@kube-node-client ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms

E ala i le su'esu'eina o pu'e pu'e o le latency, na matou maua ai nisi fa'amatalaga fa'aopoopo. Aemaise lava, o le tagata e auina atu (lalo) e vaʻaia lenei taimi, ae o le tagata e mauaina (pito i luga) e le - vaʻai i le Delta column (i sekone):

Debugging feso'otaiga vavave ile Kubernetes

E le gata i lea, afai e te vaʻavaʻai i le eseesega i le faasologa o TCP ma ICMP packets (e ala i numera faʻasologa) i luga o le itu e mauaina, ICMP paʻu e taunuu i taimi uma i le faasologa lava e tasi na auina atu ai, ae e ese le taimi. I le taimi lava e tasi, o paʻu TCP i nisi taimi e vaʻaia, ma o nisi o latou e pipii. Aemaise lava, afai e te suʻesuʻeina ports o SYN packets, o loʻo faʻatulagaina i le itu a le tagata e auina atu, ae le o le itu a le tagata e taliaina.

E i ai se eseesega laititi i le auala kata feso'ota'iga 'au'aunaga fa'aonaponei (pei o lo'o i totonu o le matou nofoaga autu o fa'amaumauga) fa'agaoioi pepa o lo'o iai le TCP po'o le ICMP. A o'o mai se afifi, o le feso'ota'iga feso'ota'iga "fa'asalaina i le feso'ota'iga", o lona uiga, e taumafai e motusia feso'ota'iga i laina ma tu'uina atu laina ta'itasi i se 'ese'ese fa'aulu. Mo TCP, o lenei hash e aofia uma ai le puna ma le tuatusi IP tuatusi ma le taulaga. I se isi faaupuga, o fesoʻotaʻiga taʻitasi e faʻapipiʻiina (atonu) eseese. Mo ICMP, naʻo tuatusi IP e faʻasalaina, talu ai e leai ni ports.

O le isi faʻamatalaga fou: i lenei vaitau matou te vaʻaia le ICMP faʻatuai i fesoʻotaʻiga uma i le va o 'au e lua, ae e leai se TCP. E taʻu mai ia i tatou o le mafuaʻaga e foliga mai e fesoʻotaʻi ma le RX queue hashing: o le faʻalavelave e toetoe lava a mautinoa i le faagasologa o paʻu RX, ae le o le lafoina o tali.

O lenei mea e faʻaumatia ai le lafoina o afifi mai le lisi o mafuaʻaga talafeagai. Ua matou iloa nei o le faʻafitauli o le faʻaogaina o pepa o loʻo i luga o le itu maua i luga o nisi o kube-node servers.

Malamalama i le gaosiga o pusa i le fatu Linux

Ina ia malamalama pe aisea e tupu ai le faʻafitauli i le tagata e taliaina i luga o nisi o kube-node servers, seʻi o tatou vaʻavaʻai pe faʻafefea ona faʻaogaina e le Linux kernel packets.

Toe foʻi i le faʻatinoga masani masani, e maua e le network card le afifi ma lafo fa'alavelave le fatu Linux o loʻo i ai se afifi e manaʻomia ona faʻatautaia. E taofi e le fatu isi galuega, fesuia'i le tala i le tagata fa'alavelave, fa'agasolo le afifi, ona toe fo'i lea i galuega o lo'o iai nei.

Debugging feso'otaiga vavave ile Kubernetes

E telegese le suiga o le fa'amatalaga: atonu e le'i iloa le leo i luga o kata feso'ota'iga 10Mbps i le 90s, ae i luga o kata 10G fa'aonaponei ma le maualuga o le gaosiga o le 15 miliona pepa i le sekone, e mafai ona fa'alavelaveina fa'alavelave ta'itasi i totonu ole la'ititi valu-core server. o taimi i le sekone.

Ina ia aua le taulimaina i taimi uma faʻalavelave, tele tausaga talu ai faʻaopoopo Linux NAPI: Network API o loʻo faʻaogaina e avetaʻavale faʻaonaponei uma e faʻaleleia ai le faʻatinoga i le maualuga o le saoasaoa. I le maualalo o saoasaoa e maua pea e le fatu fa'alavelave mai le network card i le auala tuai. O le taimi lava e o'o mai ai pepa e sili atu i le fa'ataga, e fa'amalo le fatu fa'alavelave ae amata loa le palotaina o le feso'ota'iga feso'ota'iga ma pikiina pepa i ni pusi. O le faagasologa e faia i le softirq, o lona uiga, in tala'aga o fa'alavelave fa'apolokalame pe a uma le telefoni feaveaʻi ma mea faʻalavelave faʻalavelave, pe a faʻaogaina le fatu (e ese mai le avanoa faʻaoga) ua uma ona tamoʻe.

Debugging feso'otaiga vavave ile Kubernetes

E sili atu le vave, ae mafua ai se faʻafitauli ese. Afai e tele naua pepa, ona faʻaalu uma lea o le taimi i le faʻaogaina o pepa mai le network card, ma e leai se taimi e faʻaogaina ai avanoa e faʻaoga ai nei laina (faitau mai fesoʻotaʻiga TCP, ma isi). Mulimuli ane ua tumu laina ma amata ona matou lafoa'i taga. I se taumafaiga e su'e se paleni, e fa'atulaga e le fatu se paketi mo le aofa'i maualuga o fa'aputu o lo'o fa'agaioia i totonu o le fa'amatalaga softirq. O le taimi lava e sili atu ai lenei paketi, e fafaguina se filo eseese ksoftirqd (o le a e vaʻai i se tasi oi latou i totonu ps per core) lea e taulimaina nei softirqs i fafo atu o le ala masani syscall/interrupt. O lenei filo o loʻo faʻatulagaina e faʻaaoga ai le faʻasologa masani o faʻasologa, lea e taumafai e faʻasoa saʻo punaoa.

Debugging feso'otaiga vavave ile Kubernetes

I le suʻesuʻeina pe faʻafefea ona faʻaogaina e le fatu pusa, e mafai ona e vaʻaia o loʻo i ai se faʻalavelave faʻapitoa. Afai e le maua soo le telefoni, o le a tatau ona faatali pepa mo sina taimi e faʻagasolo i le RX queue i luga o le network card. Atonu e mafua ona o nisi o galuega o loʻo poloka ai le processor core, poʻo se isi mea o loʻo taofia ai le autu mai le taʻavale softirq.

Fa'aiti'itia le fa'agaioiga i lalo i le 'autu po'o le metotia

Softirq tuai ua na o se matematega mo le taimi nei. Ae e talafeagai, ma matou te iloa o loʻo matou vaʻai i se mea e talitutusa lava. O le isi la'asaga la o le fa'amaonia lea o le talitonuga. Ma afai e faʻamaonia, ona suʻe lea o le mafuaʻaga o le tuai.

Se'i tatou toe fo'i i a tatou taga fa'agesegese:

len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms

len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms

len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms

len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms

E pei ona talanoaina muamua, o nei pusa ICMP o loʻo faʻapipiʻiina i totonu o se laina RX NIC e tasi ma faʻatautaia e se tasi CPU autu. Afai tatou te fia malamalama pe faʻafefea ona galue Linux, e aoga le iloa poʻo fea (o fea CPU core) ma pe faʻafefea (softirq, ksoftirqd) o loʻo faʻagasolo ai nei afifi ina ia siaki ai le faagasologa.

Ua oʻo nei i le taimi e faʻaaoga ai meafaigaluega e mafai ai ona e mataʻituina le fatu Linux i le taimi moni. O iinei sa matou faaaogaina faletupe. O lenei seti o meafaigaluega e mafai ai ona e tusia ni polokalame C laiti e faʻaogaina ai galuega faʻapitoa i totonu o le fatu ma faʻapipiʻi mea na tutupu i totonu o se polokalame Python-space e mafai ona faʻaogaina ma toe faʻafoʻi atu ia oe. O le faʻaogaina o galuega faʻapitoa i totonu o le fatu o se pisinisi taufaasese, ae o le aoga ua mamanuina mo le maualuga o le puipuiga ma ua mamanuina e siaki tonu ai le ituaiga o gaosiga o mataupu e le faigofie ona toe gaosia i se suʻega poʻo se siosiomaga atinaʻe.

O le fuafuaga iinei e faigofie: matou te iloa o le fatu o loʻo faʻaogaina nei ICMP pings, o lea o le a matou tuʻuina ai se matau i luga o le galuega fatu. icmp_echo, lea e talia se pepa talosaga a le ICMP e sau ma amata le auina atu o se tali a le ICMP. E mafai ona matou iloa se afifi e ala i le faʻateleina o le numera icmp_seq, lea e faʻaalia hping3 maualuga atu

kote bcc tusitusiga foliga lavelave, ae e le taufaafefe e pei ona foliga mai. Galuega icmp_echo momoli atu struct sk_buff *skb: Ole pepa lea e iai se "talosaga echo". E mafai ona tatou siakiina, toso ese le faasologa echo.sequence (lea e faatusatusa i icmp_seq e hping3 выше), ma auina atu i le avanoa e faaaoga ai. E faigofie fo'i le pu'eina ole igoa/id ole faiga o iai nei. O loʻo i lalo iʻuga tatou te vaʻaia saʻo aʻo faʻagasolo e le fatu pusa:

TGID PID Offidy Igoa ICMP_SQ 0 0 SWOY / 11 770S0 0 11 771 0 0 11STR / 772 0 0 11 773 0 suisui/0 11 774 20041 tautala-lipoti-s 20086

E tatau ona matauina iinei i le tulaga softirq fa'agasologa na faia ai le telefoni feavea'i o le a foliga mai o "fa'agasologa" ae o le mea moni o le fatu lea e fa'agasolo saogalemu ai afifi i totonu o le fatu.

Faatasi ai ma lenei meafaigaluega e mafai ona tatou faʻafesoʻotaʻi faiga faʻapitoa ma afifi faʻapitoa e faʻaalia ai le tuai o hping3. Sei o tatou faafaigofie grep i luga o lenei pu'eina mo nisi tulaga taua icmp_seq. O pusa e fetaui ma tau o le icmp_seq o loʻo i luga na maitauina faʻatasi ai ma a latou RTT na matou matauina i luga (i puipui o loʻo faʻamoemoeina RTT tau mo paʻu na matou faʻamama ona o tau RTT e itiiti ifo i le 50 ms):

TGID PID PROCESS IGOA ICMP_SEQ ** RTT -- 10137 10436 cadvisor 1951 10137 10436 cadvisor 1952 76 76 ksoftirqd/11 1953 ** 99ms 76 76 ksoftirqd 11ds ** /1954 89 ** 76ms 76 11 ksoftirqd/ 1955 79 ** 76ms 76 11 ksoftirqd/1956 69 ** 76ms 76 11 ksoftirqd/1957 59 ** (76ms) 76 11 ksoftirqd/1958 49 ** (76ms) 76 11 ksoftirqd 1959 39 ksoftirqd 76ms (76ms) irqd/ 11 1960 ** (29ms) 76 76 ksoftirqd/11 1961 ** (19ms) -- 76 76 cadvisor 11 1962 9 cadvisor 10137 10436 2068 ksoftir 10137/10436 ksoftir 2069/76 ksoftir /76 11 ** 2070ms 75 76 ksoftirqd/ 76 11 ** 2071ms 65 76 ksoftirqd/76 11 ** (2072ms) 55 76 ksoftirqd/76 11 ** (2073ms) 45 76 ksoftirqd/76 11 ** (2074ms) 35 ms) ) 76 76 ksoftirqd/11 2075 ** (25ms)

O i'uga e ta'u mai ai ia i tatou ni nai mea. Muamua, o nei afifi uma o loʻo faʻatautaia e le tala ksoftirqd/11. O lona uiga, mo lenei paga faapitoa o masini, ICMP packets na faʻapipiʻiina i le autu 11 i le pito e maua ai. Matou te vaʻai foʻi i soʻo se taimi e iai se siamu, o loʻo i ai faʻamaumauga o loʻo faʻatautaia i le tulaga o le telefoni feaveaʻi cadvisor... Ona ksoftirqd ave le galuega ma faʻagasolo le faʻaputuina o laina: o le aofaʻi tonu o pepa na faʻaputuina mulimuli ane cadvisor.

O le mea moni e faapea i le taimi lava e galue ai i taimi uma cadvisor, o lona uiga o lona auai i le faafitauli. O le mea e malie ai, o le faamoemoega faufautua - "su'esu'e le fa'aogaina o puna'oa ma uiga fa'atinoga o koneteina fa'agaioia" nai lo le fa'atupuina o lenei fa'afitauli.

E pei o isi vaega o koneteina, o mea uma nei e sili ona maualuga ma e mafai ona faʻamoemoe e oʻo i faʻafitauli faʻatinoga i lalo o ni tulaga e leʻi mafaufauina.

O le a le mea e fai e le cadvisor e fa'agesegese ai le fa'asologa o pusa?

Ua matou maua nei se malamalamaga lelei pe faʻafefea ona tupu le faʻalavelave, o le a le gaioiga e mafua ai, ma o le a le PPU. Matou te vaʻaia ona o le faigata o le poloka, o le Linux kernel e leai se taimi e faʻatulagaina ksoftirqd. Ma ua tatou va'ai o lo'o fa'agasolo fa'asolo i totonu o fa'amatalaga cadvisor. E talafeagai le manatu faapea cadvisor fa'alauiloa se syscall fa'agesegese, pe a mae'a ona fa'agasolo uma pusa fa'aputuina i lena taimi:

Debugging feso'otaiga vavave ile Kubernetes

O se aʻoaʻoga lea, ae faʻafefea ona suʻeina? O le mea e mafai ona tatou faia o le suʻeina lea o le CPU core i lenei faagasologa, suʻe le tulaga o loʻo alu ai le numera o paʻu i luga o le paketi ma ksoftirqd e valaʻau, ona toe tilotilo lea i tua e vaʻai poʻo le a tonu le mea o loʻo tamoe i luga o le CPU core i luma o lena tulaga. . E pei o le x-ray le PPU i nai milliseconds uma. O le a foliga e pei o lenei:

Debugging feso'otaiga vavave ile Kubernetes

Fa'afaigofie, o nei mea uma e mafai ona faia i meafaigaluega o iai. Faataitaiga, perf faamaumauga siaki se PPU ua tu'uina atu i se taimi fa'apitoa ma e mafai ona fa'atupuina se fa'asologa o vala'au i le faiga o lo'o fa'agaoioia, e aofia uma ai avanoa fa'aoga ma le fatu Linux. E mafai ona e ave le fa'amaumauga lenei ma fa'agasolo i le fa'aogaina o se tui la'ititi o le polokalame FlameGraph mai Brendan Gregg, lea e fa'asaoina le fa'asologa o le fa'asologa o fa'aputuga. E mafai ona matou fa'asaoina fa'asologa o fa'aputuga o laina ta'itasi i le 1 ms, ona fa'ailoga lea ma fa'asaoina se fa'ata'ita'iga 100 milliseconds a'o le'i o'o le fa'asologa. ksoftirqd:

# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100

O i'uga nei:

(сотни следов, которые выглядят похожими)

cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run

E tele mea iinei, ae o le mea autu o loʻo tatou mauaina le "cadvisor before ksoftirqd" mamanu na matou vaʻaia muamua i le ICMP tracer. O le a le uiga?

O laina taʻitasi o se faʻasologa o le CPU i se taimi patino. O vala'au ta'itasi i lalo o le fa'aputuga i luga o se laina e tu'u 'ese'ese i se semicolon. I le ogatotonu o laina tatou te vaʻai i le syscall o loʻo taʻua: read(): .... ;do_syscall_64;sys_read; .... O le mea lea e faʻaalu ai e le cadvisor le tele o taimi i luga o le telefoni read()fa'atatau i galuega mem_cgroup_* (pito i luga ole fa'aputuga telefoni/fa'ai'uga ole laina).

E le faigofie le va'ai i se vala'au su'esu'e le mea tonu o lo'o faitauina, o lea tatou tamo'e strace ma se'i o tatou va'ai po'o le a le mea e fai e le cadvisor ma su'e le telefoni e umi atu nai lo le 100 ms:

theojulienne@kube-node-bad ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0.[1-9]'
[pid 10436] <... futex resumed> ) = 0 <0.156784>
[pid 10432] <... futex resumed> ) = 0 <0.258285>
[pid 10137] <... futex resumed> ) = 0 <0.678382>
[pid 10384] <... futex resumed> ) = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> ) = 0 <0.104614>
[pid 10436] <... futex resumed> ) = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> ) = 0 <0.118113>
[pid 10382] <... pselect6 resumed> ) = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> ) = 0 <0.917495>
[pid 10436] <... futex resumed> ) = 0 <0.208172>
[pid 10417] <... futex resumed> ) = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 576 <0.154442>

E pei ona e fa'amoemoeina, matou te va'ai lemu telefoni iinei read(). Mai mea o lo'o i totonu o fa'agaioiga faitau ma fa'amatalaga mem_cgroup e manino lava o nei luitau read() faasino i le faila memory.stat, lea e faʻaalia ai le faʻaogaina o manatuaga ma tapulaʻa cgroup (Docker's resources isolation technology). E fesiligia e le meafaigaluega cadvisor lenei faila e maua ai faʻamatalaga faʻaogaina o punaoa mo pusa. Sei o tatou siaki pe o le fatu poʻo le cadvisor o loʻo faia se mea e leʻi mafaufauina:

theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null

real 0m0.153s
user 0m0.000s
sys 0m0.152s
theojulienne@kube-node-bad ~ $

Ole taimi nei e mafai ona tatou toe gaosia le pusa ma malamalama o le fatu Linux o loʻo feagai ma se faʻamaʻi.

Aisea ua tuai tele ai le faitau?

I lenei laʻasaga, e sili atu ona faigofie le mauaina o feʻau mai isi tagata faʻaoga e uiga i faʻafitauli tutusa. E pei ona aliali mai, i totonu o le cadvisor tracker na lipotia ai lenei bug fa'afitauli ole fa'aoga tele ole PPU, na'o le leai o se tasi na matauina o le latency o lo'o fa'aalia fa'afuase'i fo'i i le fa'aputuga o feso'ota'iga. E moni lava na maitauina o le cadvisor o loʻo faʻaaogaina le tele o le CPU nai lo le mea na faʻamoemoeina, ae e leʻi faʻatauaina tele, talu ai e tele a matou 'auʻaunaga CPU, o lea e leʻi suʻesuʻeina ma le totoa le faafitauli.

O le fa'afitauli o le fa'aogaina lea e vaega o le mafaufau i totonu ole igoa (container). A o'o uma faiga i totonu o le vaega lea, e fa'asa'oloto e Docker le vaega manatua. Ae ui i lea, o le "manatua" e le na o le faagasologa o le manatua. E ui lava e le o toe faʻaaogaina le manatua o le faagasologa, e foliga mai o loʻo tuʻuina atu pea e le fatu meaʻai, e pei o dentries ma inodes (directory ma faila metadata), o loʻo teuina i le memory cgroup. Mai le faʻamatalaga faʻafitauli:

zombie cgroups: cgroups e leai ni faʻagasologa ma ua tapeina, ae o loʻo i ai pea le manatua na tuʻuina atu (i loʻu tulaga, mai le dentry cache, ae mafai foi ona tuʻuina mai le itulau cache poʻo tmpfs).

O le siaki a le fatu o itulau uma o loʻo i totonu o le cache pe a faʻasaʻolotoina se cgroup e mafai ona matua tuai, o lea e filifilia ai le paie: faʻatali seʻi toe talosagaina nei itulau, ona faʻamama lea o le cgroup pe a manaʻomia moni le manatua. Se'ia o'o mai i le taimi nei, o lo'o fa'atumauina pea le cgroup pe a aoina fa'amaumauga.

Mai se vaaiga faʻatinoga, na latou ositaulagaina le manatua mo le faʻatinoga: faʻavaveina le faʻamamāina muamua e ala i le tuʻuina o se manatuaga natia i tua. E lelei lea. A faʻaaoga e le fatu le mea mulimuli o le manatuaga natia, e iu lava ina kilia le cgroup, o lea e le mafai ai ona taʻua o le "leak". Ae paga lea, o le faʻatinoina faʻapitoa o le masini suʻesuʻe memory.stat i lenei kernel version (4.9), faʻatasi ma le tele o manatuaga i luga oa tatou 'auʻaunaga, o lona uiga e umi se taimi e toe faʻafoʻi ai faʻamaumauga lata mai faʻamaumauga ma manino cgroup zombies.

E foliga mai o nisi o matou node e tele naua cgroup zombies ma o le faitau ma le taofiofia na sili atu i le sekone.

O le workaround mo le faafitauli cadvisor o le vave saoloto dentries / inodes caches i le faiga atoa, lea vave aveesea le latency faitau faapea foi le latency fesootaiga i luga o le talimalo, talu ai kilia le cache liliu i luga o le cached zombie cgroup itulau ma faasaolotoina foi i latou. E le o se fofo lea, ae e faʻamaonia ai le mafuaʻaga o le faʻafitauli.

Na aliali mai i totonu o faʻamatalaga fou kernel (4.19+) na faʻaleleia le faʻatinoga o le telefoni memory.stat, o lea o le fesuia'i i lenei fatu na fa'aleleia ai le fa'afitauli. I le taimi lava e tasi, sa i ai a matou mea faigaluega e suʻesuʻe ai faʻafitauli faʻafitauli i fuifui Kubernetes, faʻafefe lelei ma toe faʻafou. Na matou selu uma fuifui, maua ni pona e lava le maualuga o le taofi ma toe fa'afou. O lenei mea na matou maua ai le taimi e faʻafou ai le OS i luga o sapalai o totoe.

E tauaofai

Talu ai ona o lenei pusa na taofia le RX NIC faʻasologa o laina mo le faitau selau o milliseconds, na mafua ai i le taimi lava e tasi le maualuga o le taofiofi i luga o fesoʻotaʻiga pupuu ma le vaeluaga o fesoʻotaʻiga latency, e pei o le va o MySQL talosaga ma tali tali.

O le malamalama ma le faatumauina o le faatinoga o faiga sili ona taua, e pei o Kubernetes, e taua tele i le faatuatuaina ma le saoasaoa o auaunaga uma e faavae i luga. So'o se faiga e te fa'atinoina e manuia mai le fa'aleleia atili o fa'atinoga a Kubernetes.

puna: www.habr.com

Faaopoopo i ai se faamatalaga