"Kubernetes nce latency los ntawm 10 zaug": leej twg yuav liam rau qhov no?

Nco tseg. txhais.: Kab lus no, sau los ntawm Galo Navarro, uas tuav txoj hauj lwm ntawm Thawj Tswj Hwm Software Engineer ntawm European lub tuam txhab Adevinta, yog ib tug fascinating thiab qhia "kev tshawb fawb" nyob rau hauv lub teb ntawm infrastructure kev khiav hauj lwm. Nws thawj lub npe tau nthuav dav me ntsis hauv kev txhais lus vim li cas tus sau piav qhia thaum pib.

"Kubernetes nce latency los ntawm 10 zaug": leej twg yuav liam rau qhov no?

Nco tseg los ntawm tus sau: Zoo li no ncej nyiam ntau xim ntau dua li qhov xav tau. Kuv tseem tau npau taws cov lus hais tias lub npe ntawm tsab xov xwm yog kev dag ntxias thiab ua rau qee cov neeg nyeem tau tu siab. Kuv nkag siab cov laj thawj ntawm qhov tshwm sim, yog li ntawd, txawm tias muaj kev pheej hmoo ntawm kev puas tsuaj tag nrho, Kuv xav qhia tam sim ntawd rau koj tias tsab xov xwm no hais txog dab tsi. Qhov xav paub kuv tau pom thaum pab pawg tsiv mus rau Kubernetes yog tias thaum twg muaj teeb meem tshwm sim (xws li nce latency tom qab tsiv teb tsaws), thawj yam uas tau raug liam yog Kubernetes, tab sis tom qab ntawd nws hloov tawm tias tus kws ntaus suab paj nruag tsis yog tiag tiag. liam. Kab lus no qhia txog ib rooj plaub ntawd. Nws lub npe rov hais dua qhov exclamation ntawm ib qho ntawm peb cov neeg tsim khoom (tom qab ntawd koj yuav pom tias Kubernetes tsis muaj dab tsi ua nrog nws). Koj yuav tsis pom ib qho kev tshwm sim uas xav tsis thoob txog Kubernetes ntawm no, tab sis koj tuaj yeem xav txog ob peb zaj lus qhia zoo txog cov txheej txheem nyuaj.

Ob peb lub lis piam dhau los, kuv pab neeg tau tsiv mus nyob ib leeg microservice mus rau lub platform tseem ceeb uas suav nrog CI/CD, Kubernetes-based runtime, metrics, thiab lwm yam khoom zoo. Qhov kev txav mus los yog qhov kev sim siab: peb npaj yuav coj nws los ua lub hauv paus thiab hloov pauv kwv yees li 150 qhov kev pabcuam ntxiv nyob rau lub hlis tom ntej. Txhua tus ntawm lawv yog lub luag haujlwm rau kev ua haujlwm ntawm qee qhov loj tshaj plaws online platforms hauv Spain (Infojobs, Fotocasa, thiab lwm yam).

Tom qab peb xa daim ntawv thov mus rau Kubernetes thiab xa qee qhov kev khiav mus rau nws, qhov kev ceeb toom ceeb toom tos peb. ncua (latency) Kev thov hauv Kubernetes yog 10 npaug siab dua hauv EC2. Feem ntau, nws yog qhov tsim nyog los nrhiav kev daws teeb meem rau qhov teeb meem no, lossis tso tseg kev tsiv teb tsaws chaw ntawm microservice (thiab, tejzaum nws, tag nrho qhov project).

Vim li cas latency siab dua hauv Kubernetes dua hauv EC2?

Txhawm rau nrhiav qhov khoob khoob, peb tau sau cov ntsuas raws tag nrho txoj kev thov. Peb lub tuam txhab yog qhov yooj yim: API gateway (Zuul) proxies thov rau microservice zaus hauv EC2 lossis Kubernetes. Hauv Kubernetes peb siv NGINX Ingress Controller, thiab cov backends yog cov khoom zoo li txiag nrog rau daim ntawv thov JVM ntawm lub caij nplooj ntoos hlav platform.

                                  EC2
                            +---------------+
                            |  +---------+  |
                            |  |         |  |
                       +-------> BACKEND |  |
                       |    |  |         |  |
                       |    |  +---------+  |                   
                       |    +---------------+
             +------+  |
Public       |      |  |
      -------> ZUUL +--+
traffic      |      |  |              Kubernetes
             +------+  |    +-----------------------------+
                       |    |  +-------+      +---------+ |
                       |    |  |       |  xx  |         | |
                       +-------> NGINX +------> BACKEND | |
                            |  |       |  xx  |         | |
                            |  +-------+      +---------+ |
                            +-----------------------------+

Qhov teeb meem zoo li muaj feem xyuam rau qhov pib latency nyob rau hauv lub backend (Kuv cim qhov teeb meem cheeb tsam ntawm daim duab li "xx"). Ntawm EC2, daim ntawv thov teb tau siv li 20ms. Hauv Kubernetes, latency tau nce mus rau 100-200 ms.

Peb sai sai tshem tawm cov neeg xav tias muaj feem cuam tshuam nrog kev hloov pauv sijhawm. JVM version tseem zoo li qub. Containerization teeb meem kuj tsis muaj dab tsi ua nrog nws: daim ntawv thov twb tau ua tiav hauv cov thawv ntawm EC2. Chaw thau khoom? Tab sis peb pom latencies siab txawm tias ntawm 1 thov ib ob. Kev ncua rau kev khaws cov khib nyiab tuaj yeem raug tsis saib xyuas.

Ib tug ntawm peb cov Kubernetes admins xav tsis thoob yog tias daim ntawv thov muaj kev vam khom sab nraud vim tias cov lus nug DNS tau ua rau muaj teeb meem zoo sib xws yav dhau los.

Hypothesis 1: DNS lub npe daws teeb meem

Rau txhua qhov kev thov, peb daim ntawv thov nkag mus rau AWS Elasticsearch piv txwv ib mus rau peb zaug hauv ib lub npe zoo li elastic.spain.adevinta.com. Hauv peb lub thawv muaj lub plhaub, yog li peb tuaj yeem tshawb xyuas yog tias kev tshawb nrhiav tus sau tau siv sijhawm ntev.

DNS queries los ntawm lub thawv:

[root@be-851c76f696-alf8z /]# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done
;; Query time: 22 msec
;; Query time: 22 msec
;; Query time: 29 msec
;; Query time: 21 msec
;; Query time: 28 msec
;; Query time: 43 msec
;; Query time: 39 msec

Cov kev thov zoo sib xws los ntawm ib qho ntawm EC2 qhov chaw uas daim ntawv thov tau khiav:

bash-4.4# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done
;; Query time: 77 msec
;; Query time: 0 msec
;; Query time: 0 msec
;; Query time: 0 msec
;; Query time: 0 msec

Xav tias qhov kev saib xyuas tau siv li 30ms, nws tau pom meej tias DNS daws teeb meem thaum nkag mus rau Elasticsearch tiag tiag ua rau muaj qhov nce hauv latency.

Txawm li cas los xij, qhov no tau coj txawv txawv rau ob qho laj thawj:

  1. Peb twb muaj ib tuj ntawm Kubernetes cov ntawv thov uas cuam tshuam nrog AWS cov peev txheej yam tsis muaj kev cuam tshuam los ntawm latency siab. Txawm yog vim li cas, nws muaj feem xyuam rau cov ntaub ntawv no.
  2. Peb paub tias JVM ua hauv-nco DNS caching. Hauv peb cov duab, tus nqi TTL tau sau rau hauv $JAVA_HOME/jre/lib/security/java.security thiab teem rau 10 vib nas this: networkaddress.cache.ttl = 10. Hauv lwm lo lus, JVM yuav tsum cache tag nrho cov lus nug DNS rau 10 vib nas this.

Txhawm rau kom paub meej thawj qhov kev xav, peb txiav txim siab kom tsis txhob hu DNS ib ntus thiab saib seb qhov teeb meem puas ploj mus. Ua ntej, peb txiav txim siab los kho daim ntawv thov kom nws sib txuas lus ncaj qha nrog Elasticsearch los ntawm IP chaw nyob, tsis yog los ntawm lub npe sau. Qhov no yuav xav tau cov lej hloov pauv thiab kev xa tawm tshiab, yog li peb tsuas yog mapped lub npe rau nws qhov chaw nyob IP hauv /etc/hosts:

34.55.5.111 elastic.spain.adevinta.com

Tam sim no lub thawv tau txais tus IP yuav luag tam sim ntawd. Qhov no ua rau qee qhov kev txhim kho, tab sis peb tsuas yog ze dua me ntsis rau qhov xav tau latency theem. Txawm hais tias DNS daws teeb meem tau siv sijhawm ntev, qhov laj thawj tiag tiag tseem eluded peb.

Diagnostics ntawm lub network

Peb txiav txim siab los soj ntsuam kev khiav tsheb los ntawm lub thawv siv tcpdumpkom pom dab tsi tshwm sim hauv lub network:

[root@be-851c76f696-alf8z /]# tcpdump -leni any -w capture.pcap

Peb mam li xa ob peb qhov kev thov thiab rub tawm lawv qhov kev ntes (kubectl cp my-service:/capture.pcap capture.pcap) rau kev txheeb xyuas ntxiv hauv Wireshark.

Tsis muaj ib yam dab tsi txawv txav ntawm cov lus nug DNS (tshwj tsis yog ib qho me me uas kuv yuav tham txog tom qab). Tab sis muaj qee qhov tsis zoo nyob rau hauv txoj kev uas peb cov kev pabcuam daws txhua qhov kev thov. Hauv qab no yog ib qho screenshot ntawm kev ntes uas qhia qhov kev thov raug lees txais ua ntej cov lus teb pib:

"Kubernetes nce latency los ntawm 10 zaug": leej twg yuav liam rau qhov no?

Cov naj npawb pob tau pom nyob rau hauv thawj kab. Kom meej meej, kuv tau xim-coded qhov sib txawv TCP ntws.

Cov kwj ntsuab pib nrog pob ntawv 328 qhia tau hais tias tus neeg siv khoom (172.17.22.150) tau tsim TCP txuas rau lub thawv li cas (172.17.36.147). Tom qab thawj zaug tuav tes (328-330), pob 331 coj HTTP GET /v1/.. - ib qho kev thov tuaj rau peb qhov kev pabcuam. Tag nrho cov txheej txheem coj 1 ms.

Cov kwj grey (los ntawm pob ntawv 339) qhia tau tias peb qhov kev pabcuam xa HTTP thov rau Elasticsearch piv txwv (tsis muaj TCP tuav tes vim nws siv qhov kev sib txuas uas twb muaj lawm). Qhov no coj 18ms.

Txog tam sim no txhua yam yog qhov zoo, thiab lub sijhawm kwv yees sib haum rau qhov xav tau qeeb (20-30 ms thaum ntsuas los ntawm tus neeg siv khoom).

Txawm li cas los xij, ntu xiav siv 86ms. Dab tsi tshwm sim hauv nws? Nrog pob ntawv 333, peb qhov kev pabcuam xa HTTP GET thov rau /latest/meta-data/iam/security-credentials, thiab tam sim ntawd tom qab nws, dhau ntawm tib TCP kev sib txuas, lwm qhov tau thov rau /latest/meta-data/iam/security-credentials/arn:...

Peb pom tias qhov no rov ua dua nrog txhua qhov kev thov thoob plaws hauv kab. DNS kev daws teeb meem yog qhov qeeb qeeb me ntsis hauv peb cov thawv (qhov kev piav qhia rau qhov tshwm sim no yog qhov nthuav heev, tab sis kuv yuav khaws nws rau ib tsab xov xwm cais). Nws tau pom tias qhov ua rau ncua sij hawm ntev yog hu rau AWS Instance Metadata kev pabcuam ntawm txhua qhov kev thov.

Hypothesis 2: hu tsis tsim nyog rau AWS

Ob lub ntsiab lus xaus rau AWS Instance Metadata API. Peb lub microservice siv qhov kev pabcuam no thaum ua haujlwm Elasticsearch. Ob qho kev hu yog ib feem ntawm cov txheej txheem tso cai yooj yim. Qhov kawg uas nkag mus rau ntawm thawj qhov kev thov teeb meem IAM lub luag haujlwm cuam tshuam nrog qhov piv txwv.

/ # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
arn:aws:iam::<account_id>:role/some_role

Qhov kev thov thib ob nug qhov kawg thib ob rau kev tso cai ib ntus rau qhov piv txwv no:

/ # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::<account_id>:role/some_role`
{
    "Code" : "Success",
    "LastUpdated" : "2012-04-26T16:39:16Z",
    "Type" : "AWS-HMAC",
    "AccessKeyId" : "ASIAIOSFODNN7EXAMPLE",
    "SecretAccessKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    "Token" : "token",
    "Expiration" : "2017-05-17T15:09:54Z"
}

Tus neeg siv khoom siv tau rau lub sijhawm luv luv thiab yuav tsum tau txais daim ntawv pov thawj tshiab (ua ntej lawv Expiration). Tus qauv yog qhov yooj yim: AWS tig cov yuam sij ib ntus nquag rau kev ruaj ntseg, tab sis cov neeg siv khoom tuaj yeem khaws cia rau ob peb feeb kom them nyiaj rau qhov kev nplua nuj cuam tshuam nrog kev tau txais daim ntawv pov thawj tshiab.

AWS Java SDK yuav tsum tuav lub luag haujlwm rau kev teeb tsa cov txheej txheem no, tab sis rau qee qhov laj thawj qhov no tsis tshwm sim.

Tom qab tshawb nrhiav teeb meem ntawm GitHub, peb tau ntsib teeb meem #1921. Nws tau pab peb txiav txim siab qhov kev taw qhia uas yuav "kv" ntxiv.

AWS SDK hloov kho daim ntawv pov thawj thaum ib qho ntawm cov xwm txheej hauv qab no tshwm sim:

  • Hnub tas sij hawm (Expiration) poob rau hauv EXPIRATION_THRESHOLD, hardcoded rau 15 feeb.
  • Ntau lub sij hawm tau dhau mus txij li qhov kawg sim rov ua dua daim ntawv pov thawj REFRESH_THRESHOLD, hardcoded rau 60 feeb.

Txhawm rau pom hnub tas sij hawm ntawm daim ntawv pov thawj peb tau txais, peb tau khiav cov lus txib cURL saum toj no los ntawm ob lub thawv thiab EC2 piv txwv. Lub sij hawm siv tau ntawm daim ntawv pov thawj tau txais los ntawm lub thawv tau dhau los ua luv luv: raws nraim 15 feeb.

Tam sim no txhua yam tau paub meej: rau thawj qhov kev thov, peb cov kev pabcuam tau txais daim ntawv pov thawj ib ntus. Txij li thaum lawv tsis siv tau ntau tshaj 15 feeb, AWS SDK yuav txiav txim siab hloov kho lawv ntawm qhov kev thov tom ntej. Thiab qhov no tshwm sim nrog txhua qhov kev thov.

Vim li cas lub sij hawm siv tau ntawm daim ntawv pov thawj tau luv dua?

AWS Instance Metadata yog tsim los ua haujlwm nrog EC2 piv txwv, tsis yog Kubernetes. Ntawm qhov tod tes, peb tsis xav hloov daim ntawv thov interface. Rau qhov no peb siv KIAM - ib lub cuab yeej uas, siv cov neeg sawv cev ntawm txhua lub Kubernetes node, tso cai rau cov neeg siv (cov kws tsim qauv siv cov ntawv thov rau ib pawg) los muab IAM lub luag haujlwm rau cov ntim hauv cov pods zoo li lawv yog EC2 piv txwv. KIAM cuam tshuam hu rau AWS Instance Metadata kev pabcuam thiab ua haujlwm los ntawm nws lub cache, yav dhau los tau txais los ntawm AWS. Los ntawm daim ntawv thov point of view, tsis muaj dab tsi hloov.

KIAM muab daim ntawv pov thawj luv luv rau cov pods. Qhov no ua rau kev txiav txim siab tias qhov nruab nrab lub neej ntawm lub pod yog luv dua li ntawm EC2 piv txwv. Default validity lub sij hawm rau daim ntawv pov thawj sib npaug li qub 15 feeb.

Yog li ntawd, yog tias koj overlay ob qho tib si qhov tseem ceeb nyob rau sab saum toj ntawm ib leeg, ib qho teeb meem tshwm sim. Txhua daim ntawv pov thawj muab rau daim ntawv thov yuav tas sijhawm tom qab 15 feeb. Txawm li cas los xij, AWS Java SDK yuam kev txuas ntxiv ntawm daim ntawv pov thawj uas muaj tsawg dua 15 feeb ua ntej nws hnub tas sijhawm.

Raws li qhov tshwm sim, daim ntawv pov thawj ib ntus raug yuam kom rov ua dua tshiab nrog txhua qhov kev thov, uas suav nrog ob peb hu rau AWS API thiab ua rau muaj kev nce ntxiv hauv latency. Hauv AWS Java SDK peb pom feature thov, uas hais txog qhov teeb meem zoo sib xws.

Txoj kev daws tau los ua kom yooj yim. Peb tsuas yog rov kho KIAM kom thov daim ntawv pov thawj nrog lub sijhawm siv tau ntev dua. Thaum qhov no tshwm sim, kev thov pib ntws yam tsis muaj kev koom tes ntawm AWS Metadata cov kev pabcuam, thiab qhov latency poob mus rau qib qis dua hauv EC2.

tshawb pom

Raws li peb cov kev paub txog kev tsiv teb tsaws chaw, ib qho ntawm cov teeb meem feem ntau tsis yog kab mob hauv Kubernetes lossis lwm yam ntawm lub platform. Nws kuj tsis hais txog qhov tsis zoo hauv microservices peb tab tom xa khoom. Cov teeb meem feem ntau tshwm sim vim peb muab cov ntsiab lus sib txawv ua ke.

Peb sib xyaw ua ke cov txheej txheem nyuaj uas tsis tau muaj kev cuam tshuam nrog ib leeg ua ntej, cia siab tias ua ke lawv yuav tsim ib qho, loj dua. Alas, ntau cov ntsiab lus, ntau chav rau qhov yuam kev, qhov siab dua qhov entropy.

Hauv peb qhov xwm txheej, qhov latency siab tsis yog qhov tshwm sim ntawm cov kab mob lossis kev txiav txim siab tsis zoo hauv Kubernetes, KIAM, AWS Java SDK, lossis peb lub microservice. Nws yog qhov tshwm sim los ntawm kev sib txuas ob qho kev ua haujlwm ywj pheej: ib qho hauv KIAM, lwm qhov hauv AWS Java SDK. Cais cais, ob qho tib si ua rau muaj kev nkag siab zoo: daim ntawv pov thawj txuas ntxiv txoj cai hauv AWS Java SDK, thiab lub sijhawm luv luv ntawm daim ntawv pov thawj hauv KAIM. Tab sis thaum koj muab lawv ua ke, cov txiaj ntsig tau dhau los ua qhov tsis paub tseeb. Ob txoj kev ywj pheej thiab cov kev daws teeb meem tsis tas yuav ua kom muaj kev nkag siab thaum ua ke.

PS los ntawm tus txhais lus

Koj tuaj yeem kawm paub ntau ntxiv txog kev tsim qauv ntawm KIAM kev siv hluav taws xob rau kev sib koom ua ke AWS IAM nrog Kubernetes ntawm qhov no tsab xov xwm los ntawm nws cov creators.

Kuj nyeem ntawm peb blog:

Tau qhov twg los: www.hab.com

Ntxiv ib saib