"Kubernetes ya karu latency da sau 10": wanene ke da alhakin wannan?

Lura. fassara: Wannan labarin, wanda Galo Navarro ya rubuta, wanda ke rike da mukamin Babban Injiniyan Software a kamfanin Turai Adevinta, "bincike" mai ban sha'awa ne kuma mai koyarwa a fagen ayyukan samar da ababen more rayuwa. Takensa na asali an ɗan faɗaɗa shi cikin fassarar saboda dalilin da marubucin ya yi bayani a farkonsa.

"Kubernetes ya karu latency da sau 10": wanene ke da alhakin wannan?

Bayani daga marubucin: Yayi kama da wannan sakon janyo hankali hankali fiye da yadda ake tsammani. Har yanzu ina jin haushin maganganun cewa taken labarin yaudara ne kuma wasu masu karatu suna bakin ciki. Na fahimci dalilan abin da ke faruwa, sabili da haka, duk da hadarin lalata dukan makirci, ina so in gaya muku nan da nan abin da wannan labarin yake. Wani abu mai ban sha'awa da na gani yayin da ƙungiyoyi ke ƙaura zuwa Kubernetes shine cewa duk lokacin da matsala ta taso (kamar ƙara jinkirin bayan hijira), abin da ake zargi na farko shine Kubernetes, amma sai ya zama cewa mawallafin ba da gaske ba ne. zargi. Wannan labarin yana magana game da irin wannan yanayin. Sunan sa yana maimaita faɗar ɗaya daga cikin masu haɓaka mu (daga baya za ku ga cewa Kubernetes ba shi da alaƙa da shi). Ba za ku sami wani wahayi mai ban mamaki game da Kubernetes ba a nan, amma kuna iya tsammanin wasu darussa masu kyau game da tsarin hadaddun.

Makonni biyu da suka gabata, ƙungiyara tana ƙaura microservice guda ɗaya zuwa babban dandamali wanda ya haɗa da CI/CD, lokacin aiki na tushen Kubernetes, awo, da sauran abubuwan alheri. Yunkurin ya kasance na yanayin gwaji: mun shirya ɗaukar shi azaman tushe kuma mu canja wurin ƙarin ayyuka kusan 150 a cikin watanni masu zuwa. Dukkansu suna da alhakin gudanar da wasu manyan dandamali na kan layi a Spain (Infojobs, Fotocasa, da sauransu).

Bayan mun aika da aikace-aikacen zuwa Kubernetes kuma muka karkatar da wasu zirga-zirga zuwa gare ta, wani abin mamaki mai ban tsoro yana jiranmu. Jinkiri (latency) buƙatun a Kubernetes sun ninka sau 10 sama da na EC2. Gabaɗaya, ya zama dole a sami mafita ga wannan matsala, ko kuma watsi da ƙaura na microservice (kuma, watakila, duka aikin).

Me yasa latency ya fi girma a Kubernetes fiye da na EC2?

Don nemo bakin kwalba, mun tattara ma'auni tare da duk hanyar neman. Gine-ginenmu mai sauƙi ne: ƙofar API (Zuul) yana ba da buƙatun buƙatun microservice a cikin EC2 ko Kubernetes. A cikin Kubernetes muna amfani da NGINX Ingress Controller, kuma bayan baya abubuwa ne na yau da kullun kamar girke tare da aikace-aikacen JVM akan dandalin bazara.

                                  EC2
                            +---------------+
                            |  +---------+  |
                            |  |         |  |
                       +-------> BACKEND |  |
                       |    |  |         |  |
                       |    |  +---------+  |                   
                       |    +---------------+
             +------+  |
Public       |      |  |
      -------> ZUUL +--+
traffic      |      |  |              Kubernetes
             +------+  |    +-----------------------------+
                       |    |  +-------+      +---------+ |
                       |    |  |       |  xx  |         | |
                       +-------> NGINX +------> BACKEND | |
                            |  |       |  xx  |         | |
                            |  +-------+      +---------+ |
                            +-----------------------------+

Matsalar kamar tana da alaƙa da latency na farko a bayan baya (Na yiwa yankin matsalar alama a jadawali a matsayin "xx"). A kan EC2, martanin aikace-aikacen ya ɗauki kusan 20ms. A cikin Kubernetes, latency ya karu zuwa 100-200 ms.

Mun yi gaggawar korar wadanda ake zargi da alaka da canjin lokacin aiki. Sigar JVM ta kasance iri daya. Matsalolin kwantena kuma ba su da alaƙa da shi: aikace-aikacen ya riga ya gudana cikin nasara a cikin kwantena akan EC2. Ana lodawa? Amma mun lura da manyan latencies har ma a buƙatun 1 a sakan daya. Hakanan ana iya yin watsi da dakatarwar don tara shara.

Ɗaya daga cikin masu kula da Kubernetes ɗin mu ya yi mamakin ko aikace-aikacen yana da abubuwan dogaro na waje saboda tambayoyin DNS sun haifar da irin wannan batutuwa a baya.

Hasashe 1: Ƙimar sunan DNS

Ga kowane buƙatu, aikace-aikacen mu yana samun damar yin amfani da misalin AWS Elasticsearch sau ɗaya zuwa sau uku a cikin yanki kamar elastic.spain.adevinta.com. A cikin kwantenanmu akwai harsashi, don haka za mu iya bincika ko neman yanki a zahiri yana ɗaukar lokaci mai tsawo.

Tambayoyin DNS daga akwati:

[root@be-851c76f696-alf8z /]# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done
;; Query time: 22 msec
;; Query time: 22 msec
;; Query time: 29 msec
;; Query time: 21 msec
;; Query time: 28 msec
;; Query time: 43 msec
;; Query time: 39 msec

Makamantan buƙatun daga ɗaya daga cikin misalan EC2 inda aikace-aikacen ke gudana:

bash-4.4# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done
;; Query time: 77 msec
;; Query time: 0 msec
;; Query time: 0 msec
;; Query time: 0 msec
;; Query time: 0 msec

Ganin cewa binciken ya ɗauki kusan 30ms, ya bayyana a fili cewa ƙudurin DNS lokacin samun damar Elasticsearch da gaske yana ba da gudummawa ga haɓakar latency.

Koyaya, wannan baƙon abu ne don dalilai guda biyu:

  1. Mun riga mun sami ton na aikace-aikacen Kubernetes waɗanda ke hulɗa tare da albarkatun AWS ba tare da wahala daga latency ba. Ko menene dalili, yana da alaƙa musamman da wannan harka.
  2. Mun san cewa JVM yana yin caching na DNS a cikin ƙwaƙwalwar ajiya. A cikin hotunan mu, an rubuta ƙimar TTL a ciki $JAVA_HOME/jre/lib/security/java.security kuma saita zuwa 10 seconds: networkaddress.cache.ttl = 10. A takaice dai, JVM yakamata ya adana duk tambayoyin DNS na daƙiƙa 10.

Don tabbatar da hasashe na farko, mun yanke shawarar dakatar da kiran DNS na ɗan lokaci kuma mu ga idan matsalar ta tafi. Da farko, mun yanke shawarar sake fasalin aikace-aikacen ta yadda za a sadarwa kai tsaye tare da Elasticsearch ta adireshin IP, maimakon ta hanyar sunan yanki. Wannan yana buƙatar canje-canje na lamba da sabon turawa, don haka kawai mun tsara yankin zuwa adireshin IP ɗin sa /etc/hosts:

34.55.5.111 elastic.spain.adevinta.com

Yanzu akwati ya karɓi IP kusan nan take. Wannan ya haifar da ɗan ingantawa, amma mun ɗan ɗan kusanci matakan jinkirin da ake tsammani. Kodayake ƙudurin DNS ya ɗauki lokaci mai tsawo, ainihin dalilin har yanzu ya kuɓuce mana.

Bincike ta hanyar hanyar sadarwa

Mun yanke shawarar yin nazarin zirga-zirga daga akwati ta amfani da tcpdumpdon ganin ainihin abin da ke faruwa akan hanyar sadarwar:

[root@be-851c76f696-alf8z /]# tcpdump -leni any -w capture.pcap

Sai muka aika buƙatu da yawa kuma muka zazzage kama su (kubectl cp my-service:/capture.pcap capture.pcap) don ƙarin bincike a cikin Wireshark.

Babu wani abu mai shakku game da tambayoyin DNS (sai dai ɗan ƙaramin abu da zan yi magana game da shi daga baya). Amma da akwai wasu abubuwa marasa kyau a yadda hidimarmu ta bi da kowace bukata. A ƙasa akwai hoton hoton da aka ɗauka yana nuna ana karɓar buƙatar kafin a fara amsa:

"Kubernetes ya karu latency da sau 10": wanene ke da alhakin wannan?

Ana nuna lambobin fakitin a shafi na farko. Don tsabta, Na yi launi daban-daban rafukan TCP.

Kogin kore wanda ya fara da fakiti 328 yana nuna yadda abokin ciniki (172.17.22.150) ya kafa haɗin TCP zuwa akwati (172.17.36.147). Bayan musafaha na farko (328-330), kunshin 331 ya kawo HTTP GET /v1/.. - buƙatun mai shigowa zuwa sabis ɗinmu. Gabaɗayan tsari ya ɗauki 1 ms.

Rafi mai launin toka (daga fakiti 339) yana nuna cewa sabis ɗinmu ya aika da buƙatar HTTP zuwa misalin Elasticsearch (babu musafaha na TCP saboda yana amfani da haɗin da ke akwai). Wannan ya ɗauki 18ms.

Ya zuwa yanzu komai yana da kyau, kuma lokutan sun yi daidai da jinkirin da ake tsammanin (20-30 ms lokacin da aka auna daga abokin ciniki).

Koyaya, sashin shuɗi yana ɗaukar 86ms. Me ke faruwa a ciki? Tare da fakiti 333, sabis ɗinmu ya aika da buƙatar HTTP GET zuwa ga /latest/meta-data/iam/security-credentials, kuma nan da nan bayansa, akan haɗin TCP guda ɗaya, wani buƙatar GET zuwa /latest/meta-data/iam/security-credentials/arn:...

Mun gano cewa an maimaita wannan tare da kowane buƙatu a cikin binciken. Lallai ƙudurin DNS ya ɗan yi hankali a cikin kwantenanmu (bayanin wannan lamari yana da ban sha'awa sosai, amma zan adana shi don wani labarin daban). Ya bayyana cewa dalilin dogon jinkiri shine kira zuwa sabis na Metadata na AWS akan kowace buƙata.

Hasashe 2: Kiran da ba dole ba zuwa AWS

Duka wuraren ƙarewa na API ɗin Misali na AWS. Microservice ɗin mu yana amfani da wannan sabis ɗin yayin gudanar da bincike na Elastick. Dukansu kiran wani yanki ne na ainihin tsarin izini. Ƙarshen ƙarshen da aka samu akan buƙatun farko yana ba da rawar IAM mai alaƙa da misalin.

/ # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
arn:aws:iam::<account_id>:role/some_role

Buƙatar ta biyu tana tambayar ƙarshen ƙarshen na biyu don izini na ɗan lokaci don wannan misali:

/ # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::<account_id>:role/some_role`
{
    "Code" : "Success",
    "LastUpdated" : "2012-04-26T16:39:16Z",
    "Type" : "AWS-HMAC",
    "AccessKeyId" : "ASIAIOSFODNN7EXAMPLE",
    "SecretAccessKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    "Token" : "token",
    "Expiration" : "2017-05-17T15:09:54Z"
}

Abokin ciniki zai iya amfani da su na ɗan gajeren lokaci kuma dole ne ya sami sabbin takaddun shaida lokaci-lokaci (kafin su kasance Expiration). Samfurin abu ne mai sauƙi: AWS yana jujjuya maɓallan wucin gadi akai-akai don dalilai na tsaro, amma abokan ciniki na iya adana su na ƴan mintuna kaɗan don rama hukuncin aikin da ke da alaƙa da samun sabbin takaddun shaida.

Ya kamata AWS Java SDK ya dauki nauyin shirya wannan tsari, amma saboda wasu dalilai wannan baya faruwa.

Bayan bincika batutuwa akan GitHub, mun sami matsala #1921. Ta taimaka mana mu san alkiblar da za mu ƙara “hana” a ciki.

AWS SDK yana sabunta takaddun shaida lokacin da ɗayan waɗannan sharuɗɗan ya faru:

  • Ranar karewa (Expiration) Fada cikin EXPIRATION_THRESHOLD, hardcoded zuwa minti 15.
  • Ƙarin lokaci ya wuce tun lokacin ƙoƙarin sabunta takaddun shaida fiye da REFRESH_THRESHOLD, hardcode na tsawon mintuna 60.

Don ganin ainihin ranar ƙarewar takaddun takaddun da muke karɓa, mun gudanar da umarnin CURL na sama daga duka akwati da misalin EC2. Lokacin ingancin takardar shaidar da aka karɓa daga akwati ya zama ya fi guntu: daidai minti 15.

Yanzu komai ya bayyana: don buƙatun farko, sabis ɗinmu ya sami takaddun shaida na ɗan lokaci. Tun da ba su da inganci fiye da mintuna 15, AWS SDK za ta yanke shawarar sabunta su akan buƙatu ta gaba. Kuma wannan ya faru da kowace bukata.

Me yasa lokacin ingancin takaddun shaida ya zama guntu?

An tsara Metadata na AWS don aiki tare da misalin EC2, ba Kubernetes ba. A gefe guda, ba mu so mu canza tsarin aikace-aikacen. Don wannan mun yi amfani da shi KIAM - kayan aiki wanda, ta yin amfani da wakilai akan kowane kullin Kubernetes, yana bawa masu amfani damar (injiniyoyi da ke tura aikace-aikacen zuwa gungu) don sanya ayyukan IAM zuwa kwantena a cikin kwasfa kamar dai sun kasance lokuta na EC2. KIAM yana katse kira zuwa sabis na Metadata na AWS kuma yana sarrafa su daga cache ɗin sa, tun da ya karɓi su daga AWS. Daga ra'ayi na aikace-aikacen, babu abin da ke canzawa.

KIAM yana ba da takaddun shaida na ɗan gajeren lokaci zuwa kwasfa. Wannan yana da ma'ana idan aka yi la'akari da cewa matsakaicin tsawon rayuwar kwafsa ya fi guntu fiye da misalin EC2. Tsohuwar lokacin ingancin takaddun shaida daidai da minti 15 iri ɗaya.

A sakamakon haka, idan kun lullube dabi'u na tsoho biyu a saman juna, matsala ta taso. Kowace satifiket ɗin da aka bayar ga aikace-aikacen zai ƙare bayan mintuna 15. Koyaya, AWS Java SDK yana tilasta sabunta kowace takardar shaidar da ke da ƙasa da mintuna 15 kafin ranar karewa.

Sakamakon haka, ana tilasta sabunta takardar shedar wucin gadi tare da kowace buƙata, wanda ya ƙunshi kira biyu zuwa API na AWS kuma yana haifar da ƙaruwa mai yawa a cikin latency. A cikin AWS Java SDK mun samo bukatar neman aiki, wanda ya ambaci irin wannan matsala.

Maganin ya zama mai sauƙi. Mun sake saita KIAM kawai don neman takaddun shaida tare da tsawon lokacin inganci. Da zarar wannan ya faru, buƙatun sun fara gudana ba tare da sa hannun sabis na Metadata na AWS ba, kuma latency ya ragu zuwa ko da ƙananan matakan fiye da na EC2.

binciken

Dangane da kwarewarmu game da ƙaura, ɗayan mafi yawan tushen matsalolin ba kwari ba ne a cikin Kubernetes ko wasu abubuwan dandamali. Hakanan baya magance kowane kuskure na asali a cikin ƙananan sabis ɗin da muke aikawa. Matsaloli sukan taso kawai saboda mun haɗa abubuwa daban-daban tare.

Muna haɗuwa tare da hadaddun tsarin da ba su taɓa yin hulɗa da juna ba, muna tsammanin cewa tare za su samar da tsari guda ɗaya, mafi girma. Alas, mafi yawan abubuwan, da ƙarin dakin kurakurai, mafi girma da entropy.

A cikin yanayinmu, babban latency ba sakamakon kwari ba ne ko yanke shawara mara kyau a cikin Kubernetes, KIAM, AWS Java SDK, ko microservice ɗin mu. Sakamakon haɗa tsoffin saitunan masu zaman kansu guda biyu: ɗaya a cikin KIAM, ɗayan a cikin AWS Java SDK. An ɗauka daban, duka sigogi biyu suna da ma'ana: manufar sabunta takaddun shaida a cikin AWS Java SDK, da ɗan gajeren lokacin ingancin takaddun shaida a cikin KAIM. Amma idan kun haɗa su tare, sakamakon ya zama wanda ba a iya faɗi ba. Ba dole ba ne mafita biyu masu zaman kansu da ma'ana su yi ma'ana idan aka haɗa su.

PS daga mai fassara

Kuna iya ƙarin koyo game da gine-ginen kayan aikin KIAM don haɗa AWS IAM tare da Kubernetes a wannan labarin daga mahaliccinta.

Hakanan karanta a shafinmu:

source: www.habr.com

Add a comment