Canjin Tinder zuwa Kubernetes

Lura. fassara: Ma'aikatan shahararrun sabis na Tinder na duniya kwanan nan sun raba wasu bayanan fasaha na ƙaura kayan aikin su zuwa Kubernetes. Tsarin ya ɗauki kusan shekaru biyu kuma ya haifar da ƙaddamar da wani babban dandamali akan K8s, wanda ya ƙunshi ayyuka 200 da aka shirya akan kwantena dubu 48. Waɗanne matsaloli masu ban sha'awa ne injiniyoyin Tinder suka fuskanta kuma menene sakamakon suka samu? Karanta wannan fassarar.

Canjin Tinder zuwa Kubernetes

Me ya sa?

Kusan shekaru biyu da suka gabata, Tinder ya yanke shawarar matsar da dandalinsa zuwa Kubernetes. Kubernetes zai ba da damar ƙungiyar Tinder don ɗaukar kaya da motsawa zuwa samarwa tare da ƙaramin ƙoƙari ta hanyar turawa. (Tsarin aiki mara canzawa). A wannan yanayin, taron aikace-aikacen, tura su, da kayan aikin da kanta za a bayyana ta musamman ta lamba.

Har ila yau, muna neman mafita ga matsalar scalability da kwanciyar hankali. Lokacin da sikelin ya zama mai mahimmanci, sau da yawa sai mun jira mintuna da yawa don sabbin abubuwan EC2 su juyo. Tunanin ƙaddamar da kwantena da fara hidimar zirga-zirga a cikin daƙiƙa maimakon mintuna ya zama abin sha'awa a gare mu.

Tsarin ya zama mai wahala. Yayin hijirarmu a farkon 2019, gungu na Kubernetes ya kai ga taro mai mahimmanci kuma mun fara cin karo da matsaloli daban-daban saboda yawan zirga-zirga, girman gungu, da DNS. Tare da hanyar, mun warware matsaloli masu ban sha'awa da yawa da suka danganci ƙaura sabis na 200 da kuma kula da gungu na Kubernetes wanda ya ƙunshi nodes 1000, 15000 pods da 48000 masu gudana.

Ta yaya?

Tun daga watan Janairun 2018, mun shiga matakai daban-daban na ƙaura. Mun fara ta hanyar tattara duk ayyukanmu da tura su zuwa wuraren gwajin Kubernetes. Tun daga watan Oktoba, mun fara ƙaura ta hanyar bin duk ayyukan da ake da su zuwa Kubernetes. A watan Maris na shekara mai zuwa, mun kammala ƙaura kuma yanzu dandalin Tinder yana gudana akan Kubernetes kawai.

Gina hotuna don Kubernetes

Muna da wuraren ajiyar lambar tushe sama da 30 don microservices da ke gudana akan gungu na Kubernetes. An rubuta lambar a cikin waɗannan ma'ajiyar a cikin yaruka daban-daban (misali, Node.js, Java, Scala, Go) tare da mahallin lokacin aiki da yawa don harshe ɗaya.

An ƙera tsarin ginin don samar da cikakkiyar “halin gini” da za a iya daidaitawa ga kowane ƙaramin sabis. Yakan ƙunshi Dockerfile da jerin umarnin harsashi. Abubuwan da ke cikin su gabaɗaya ana iya daidaita su, kuma a lokaci guda, duk waɗannan mahallin ginawa an rubuta su bisa ƙayyadaddun tsari. Daidaita mahallin ginawa yana ba da damar tsarin gini guda ɗaya don sarrafa duk microservices.

Canjin Tinder zuwa Kubernetes
Hoto na 1-1. Daidaitaccen tsarin gini ta hanyar kwandon magini

Don cimma matsakaicin matsakaici tsakanin lokutan aiki (yanayin lokacin aiki) Ana amfani da tsarin ginin iri ɗaya yayin haɓakawa da gwaji. Mun fuskanci ƙalubale mai ban sha'awa: dole ne mu samar da hanyar da za ta tabbatar da daidaiton yanayin ginawa a duk faɗin dandamali. Don cimma wannan, ana aiwatar da duk matakan haɗuwa a cikin akwati na musamman. magini.

Aiwatar da kwantenansa yana buƙatar dabarun Docker na ci gaba. Mai gini ya gaji ID na mai amfani na gida da sirri (kamar maɓallin SSH, takaddun shaidar AWS, da sauransu) da ake buƙata don samun damar ma'ajiyar Tinder masu zaman kansu. Yana hawa kundayen adireshi na cikin gida masu ɗauke da tushe don adana kayan tarihi ta zahiri. Wannan tsarin yana inganta aikin saboda yana kawar da buƙatar kwafin kayan aikin gine-gine tsakanin kwandon Gina da mai watsa shiri. Ana iya sake amfani da kayan gini da aka adana ba tare da ƙarin tsari ba.

Don wasu ayyuka, dole ne mu ƙirƙiri wani akwati don taswirar yanayin haɗawa zuwa yanayin lokacin aiki (misali, ɗakin karatu na Node.js bcrypt yana haifar da takamaiman kayan aikin binary na dandamali yayin shigarwa). Yayin aiwatar da harhadawa, buƙatun na iya bambanta tsakanin sabis, kuma ana haɗa Dockerfile na ƙarshe akan tashi.

Kubernetes cluster gine da ƙaura

Gudanar da girman gungu

Mun yanke shawarar amfani kube-aws don tura gungu ta atomatik akan abubuwan Amazon EC2. A farkon farkon, komai yayi aiki a cikin tafkin gama gari na nodes. Mun gane da sauri buƙatar raba nauyin aiki ta girman da nau'in misali don yin amfani da albarkatu mai inganci. Ma’anar ita ce gudanar da ɗorawa da yawa masu zaren zare da yawa sun juya sun zama mafi tsinkaya dangane da aiki fiye da kasancewar su tare da adadi mai yawa na kwas ɗin zaren guda ɗaya.

A karshe mun zauna akan:

  • m5.4 ku - don saka idanu (Prometheus);
  • c5.4 ku - don aikin Node.js (nauyin aiki mai zaren guda ɗaya);
  • c5.2 ku - don Java da Go (nauyin aikin multithreaded);
  • c5.4 ku - don kula da panel (3 nodes).

Hijira

Ɗaya daga cikin matakan shirye-shiryen ƙaura daga tsoffin kayan aikin zuwa Kubernetes shine a tura sadarwar kai tsaye tsakanin sabis zuwa sabbin ma'aunin nauyi (Elastic Load Balancers (ELB). An ƙirƙira su akan takamaiman ƙayyadadden ƙayyadaddun girgije na girgije mai zaman kansa (VPC). An haɗa wannan rukunin yanar gizon zuwa Kubernetes VPC. Wannan ya ba mu damar yin ƙaura a hankali a hankali, ba tare da la'akari da takamaiman tsari na dogaro da sabis ba.

An ƙirƙiri waɗannan wuraren ƙarshen ta amfani da ma'auni na bayanan DNS waɗanda ke da CNAMEs da ke nuni zuwa kowane sabon ELB. Don canzawa, mun ƙara sabon shigarwa yana nuna sabon ELB na sabis na Kubernetes tare da nauyin 0. Sa'an nan kuma muka saita Time To Live (TTL) na shigarwa zuwa 0. Bayan wannan, tsofaffi da sababbin ma'auni sun kasance. a hankali an daidaita shi, kuma a ƙarshe an aika 100% na lodi zuwa sabon uwar garken. Bayan an gama sauyawa, ƙimar TTL ta dawo zuwa madaidaicin matakin.

Modulolin Java da muke da su na iya jure ƙarancin TTL DNS, amma aikace-aikacen Node ba za su iya ba. Ɗaya daga cikin injiniyoyin ya sake rubuta wani ɓangare na lambar haɗin gwiwa kuma ya nannade shi a cikin wani manajan da ke sabunta wuraren tafkunan kowane daƙiƙa 60. Hanyar da aka zaɓa ta yi aiki sosai kuma ba tare da wani lahani na aikin da aka sani ba.

Darasi

Iyakar Kayan Sadarwar Sadarwa

A safiyar ranar 8 ga Janairu, 2019, dandalin Tinder ya fado ba zato ba tsammani. Dangane da haɓakar da ba a haɗa ba a cikin latency na dandamali a farkon wannan safiya, adadin kwasfa da nodes a cikin tari ya karu. Wannan ya sa cache ɗin ARP ya ƙare akan duk nodes ɗin mu.

Akwai zaɓuɓɓukan Linux guda uku masu alaƙa da cache na ARP:

Canjin Tinder zuwa Kubernetes
(source)

gc_cika3 - wannan shi ne mai wuya iyaka. Bayyanar abubuwan shigarwar "tebur makwabci" a cikin log ɗin yana nufin cewa ko da bayan tara shara (GC), babu isasshen sarari a cikin ma'ajiyar ARP don adana shigarwar makwabta. A wannan yanayin, kwaya kawai ta watsar da fakitin gaba daya.

Muna amfani Flannel a matsayin masana'anta na cibiyar sadarwa a Kubernetes. Ana watsa fakiti ta hanyar VXLAN. VXLAN rami ne na L2 da aka tashe a saman hanyar sadarwar L3. Fasaha tana amfani da MAC-in-UDP (MAC Address-in-User Datagram Protocol) rufewa kuma tana ba da damar fadada sassan cibiyar sadarwa na Layer 2. Ka'idar sufuri akan cibiyar sadarwar bayanan jiki shine IP da UDP.

Canjin Tinder zuwa Kubernetes
Hoto na 2-1. Tsarin Flannel (source)

Canjin Tinder zuwa Kubernetes
Hoto na 2-2. kunshin VXLAN (source)

Kowane kumburin ma'aikacin Kubernetes yana keɓance sararin adireshi mai kama-da-wane tare da abin rufe fuska /24 daga babban shinge / 9. Ga kowane kumburi wannan shine yana nufin shigarwa ɗaya a cikin tebur mai tuƙi, shigarwa ɗaya a cikin tebur na ARP (akan haɗin flannel.1), da shigarwa ɗaya a cikin tebur mai sauyawa (FDB). Ana ƙara su a farkon lokacin da aka fara kumburin ma'aikaci ko duk lokacin da aka gano sabon kumburi.

Bugu da ƙari, node-pod (ko pod-pod) sadarwa a ƙarshe yana wucewa ta hanyar sadarwa eth0 (kamar yadda aka nuna a cikin zane na Flannel a sama). Wannan yana haifar da ƙarin shigarwa a cikin tebur na ARP don kowane madaidaicin tushe da masaukin masauki.

A muhallinmu, irin wannan nau'in sadarwa ya zama ruwan dare. Don abubuwan sabis a cikin Kubernetes, an ƙirƙiri ELB kuma Kubernetes yana yin rijistar kowane kumburi tare da ELB. ELB bai san komai game da kwasfa ba kuma kumburin da aka zaɓa na iya zama ƙarshen makoman fakitin. Ma'anar ita ce lokacin da kumburi ya sami fakiti daga ELB, yana la'akari da shi yana la'akari da dokoki iptables don takamaiman sabis kuma ba da gangan ya zaɓi kwafsa akan wani kumburi ba.

A lokacin gazawar, akwai nodes 605 a cikin gungu. Don dalilan da aka ambata a sama, wannan ya isa ya shawo kan mahimmancin gc_cika3, wanda shine tsoho. Lokacin da wannan ya faru, ba kawai fakiti za a fara jefar ba, amma duk sararin adireshi na Flannel tare da abin rufe fuska / 24 yana ɓacewa daga teburin ARP. An katse sadarwar Node-pod da tambayoyin DNS (an shirya DNS a cikin tari; karanta daga baya a cikin wannan labarin don cikakkun bayanai).

Don magance wannan matsalar, kuna buƙatar ƙara ƙimar gc_cika1, gc_cika2 и gc_cika3 kuma zata sake kunna Flannel don sake yin rijistar cibiyoyin sadarwar da suka ɓace.

Sikelin DNS mara tsammani

A yayin aiwatar da ƙaura, mun yi amfani da DNS sosai don sarrafa zirga-zirga da kuma canja wurin ayyuka a hankali daga tsoffin abubuwan more rayuwa zuwa Kubernetes. Mun saita ƙananan ƙimar TTL don haɗin RecordSets a cikin Route53. Lokacin da tsofaffin kayan aikin ke gudana akan misalan EC2, daidaitawar mu na warwarewa ya nuna Amazon DNS. Mun ɗauki wannan a banza kuma tasirin ƙananan TTL akan ayyukanmu da sabis na Amazon (kamar DynamoDB) ba a lura da su ba.

Yayin da muka yi ƙaura zuwa Kubernetes, mun gano cewa DNS yana sarrafa buƙatun 250 dubu a sakan daya. A sakamakon haka, aikace-aikacen sun fara fuskantar tsayayyen lokaci mai tsanani don tambayoyin DNS. Wannan ya faru ne duk da yunƙurin ingantawa da canza mai bada DNS zuwa CoreDNS (wanda a mafi girman nauyin ya kai 1000 pods masu gudana akan 120 cores).

Yayin da muke binciken wasu dalilai da mafita, mun gano labarin, yana bayyana yanayin tseren da ke shafar tsarin tace fakiti tarkon mai a cikin Linux. Ƙayyadaddun lokaci da muka lura, haɗe tare da ƙarar ƙira saka_kasa a cikin ƙirar Flannel sun kasance daidai da binciken labarin.

Matsalar tana faruwa a matakin Fassara Adireshin Sadarwar Sadarwar Tushen da Ƙaddamarwa (SNAT da DNAT) da shigarwa na gaba a cikin tebur. sabani. Ɗaya daga cikin abubuwan da aka tattauna a ciki da al'umma suka ba da shawarar ita ce matsar da DNS zuwa kullin ma'aikacin kanta. A wannan yanayin:

  • Ba a buƙatar SNAT saboda zirga-zirgar ababen hawa suna tsayawa a cikin kumburi. Ba ya buƙatar a bi da shi ta hanyar sadarwa eth0.
  • Ba a buƙatar DNAT tun lokacin da wurin IP na gida ne zuwa kumburi, kuma ba kwafsa da aka zaɓa ba bisa ga ƙa'idodi. iptables.

Mun yanke shawarar tsayawa tare da wannan hanya. An tura CoreDNS azaman DaemonSet a cikin Kubernetes kuma mun aiwatar da sabar DNS na gida a ciki. shawarwari kowane kwasfa ta kafa tuta --cluster-dns umarni cubelet . Wannan maganin ya juya ya zama mai tasiri don ƙarewar DNS.

Duk da haka, har yanzu mun ga asarar fakiti da karuwa a cikin counter saka_kasa a cikin Flannel interface. Wannan ya ci gaba bayan an aiwatar da aikin saboda mun sami damar kawar da SNAT da/ko DNAT don zirga-zirgar DNS kawai. An kiyaye yanayin tsere don sauran nau'ikan zirga-zirga. Abin farin ciki, yawancin fakitinmu TCP ne, kuma idan matsala ta faru ana sake tura su kawai. Har yanzu muna ƙoƙarin nemo mafita mai dacewa ga kowane nau'in zirga-zirga.

Amfani da Manzo don Ma'auni Mai Kyau

Yayin da muke ƙaura sabis na baya zuwa Kubernetes, mun fara fama da rashin daidaiton kaya tsakanin kwasfa. Mun gano cewa HTTP Keepalive ya haifar da haɗin ELB don rataya a kan shirye-shiryen farko na kowane turawa da aka fitar. Don haka, yawancin zirga-zirgar ababen hawa sun ratsa cikin ƙaramin kaso na kwas ɗin da ake da su. Maganin farko da muka gwada shine saita MaxSurge zuwa 100% akan sabbin tura kayan aiki don mafi munin yanayi. Tasirin ya zama maras muhimmanci kuma maras tabbas dangane da manyan ayyukan turawa.

Wata mafita da muka yi amfani da ita ita ce haɓaka buƙatun albarkatun don ayyuka masu mahimmanci. A wannan yanayin, kwas ɗin da aka sanya a kusa za su sami ƙarin ɗaki don motsawa idan aka kwatanta da sauran kwasfa masu nauyi. Hakanan ba zai yi aiki na dogon lokaci ba saboda zai zama asarar albarkatu. Bugu da ƙari, aikace-aikacen mu na Node sun kasance masu zaren guda ɗaya kuma, saboda haka, suna iya amfani da cibiya ɗaya kawai. Iyakar ainihin mafita ita ce amfani da mafi kyawun daidaita ma'aunin nauyi.

Mun dade muna so mu yaba sosai Wakilin. Halin da ake ciki ya ba mu damar tura shi ta hanya mai iyaka da samun sakamako nan take. Manzo babban aiki ne, buɗaɗɗen tushe, wakili na Layer-XNUMX wanda aka tsara don manyan aikace-aikacen SOA. Yana iya aiwatar da dabarun daidaita nauyi na ci gaba, gami da sake gwadawa ta atomatik, na'urorin da'ira, da iyakance ƙimar duniya. (Lura. fassara: Kuna iya karanta ƙarin game da wannan a ciki wannan labarin game da Istio, wanda ya dogara ne akan Manzo.)

Mun fito da tsari mai zuwa: sami motar da ke gefe na Manzo don kowane kwafsa da hanya ɗaya, kuma haɗa tarin zuwa akwati a cikin gida ta tashar jiragen ruwa. Don rage yuwuwar jujjuyawar da kuma kula da ƙaramin radiyo mai buguwa, mun yi amfani da tawaga ta Manzo na gaba-kwas ɗin wakili, ɗaya a kowane Wurin Samun Samun (AZ) don kowane sabis. Sun dogara da injin gano sabis mai sauƙi wanda ɗaya daga cikin injiniyoyinmu ya rubuta wanda kawai ya dawo da jerin kwasfa a cikin kowane AZ don sabis ɗin da aka bayar.

Wakilan Sabis na gaba-gaba sun yi amfani da wannan hanyar gano sabis tare da tari guda ɗaya na sama da hanya. Mun saita isassun lokutan ƙarewa, ƙara duk saitunan mai watsewar da'ira, kuma mun ƙara ƙa'idar sake gwadawa don taimakawa tare da gazawar guda ɗaya da tabbatar da turawa cikin santsi. Mun sanya TCP ELB a gaban kowane ɗayan waɗannan Manzo na gaba-gaba na sabis. Ko da ma'auni daga babban wakilin mu ya makale akan wasu faifan jakada, har yanzu sun sami damar ɗaukar nauyin da kyau sosai kuma an saita su don daidaitawa ta ƙarancin_request a baya.

Don turawa, mun yi amfani da ƙugiya na preStop a kan kwafs ɗin aikace-aikacen duka da kwas ɗin motar gefe. Kugiyan ya jawo kuskure wajen duba matsayin admin ɗin ƙarshen wurin da ke kan akwati na gefen mota kuma ya tafi barci na ɗan lokaci don ba da damar haɗin kai don ƙare.

Ɗaya daga cikin dalilan da muka sami damar motsawa cikin sauri shine saboda cikakkun ma'auni wanda muka sami damar haɗawa cikin sauƙi a cikin shigarwa na Prometheus na yau da kullum. Wannan ya ba mu damar ganin ainihin abin da ke faruwa yayin da muke daidaita sigogin daidaitawa da sake rarraba zirga-zirga.

Sakamakon ya kasance nan da nan kuma a bayyane. Mun fara da mafi yawan ayyuka marasa daidaituwa, kuma a halin yanzu yana aiki a gaban ayyuka 12 mafi mahimmanci a cikin gungu. A wannan shekara muna shirin sauye-sauye zuwa cikakkiyar ragamar sabis tare da ƙarin binciken sabis na ci gaba, watsewar da'ira, gano nesa, iyakance ƙima da ganowa.

Canjin Tinder zuwa Kubernetes
Hoto na 3-1. Haɗin CPU na sabis ɗaya yayin sauyawa zuwa Manzo

Canjin Tinder zuwa Kubernetes

Canjin Tinder zuwa Kubernetes

Sakamakon ƙarshe

Ta hanyar wannan ƙwarewar da ƙarin bincike, mun gina ƙaƙƙarfan ƙungiyar abubuwan more rayuwa tare da ƙwarewa masu ƙarfi a cikin ƙira, ƙaddamarwa, da kuma gudanar da manyan gungu na Kubernetes. Duk injiniyoyin Tinder yanzu suna da ilimi da gogewa don haɗa kwantena da tura aikace-aikace zuwa Kubernetes.

Lokacin da buƙatar ƙarin ƙarfin aiki ya taso akan tsoffin abubuwan more rayuwa, dole ne mu jira mintuna da yawa don ƙaddamar da sabbin abubuwan EC2. Yanzu kwantena sun fara aiki kuma su fara sarrafa zirga-zirga a cikin daƙiƙa maimakon mintuna. Jadawalin kwantena da yawa akan misalin EC2 guda kuma yana ba da ingantacciyar maida hankali a kwance. Sakamakon haka, muna hasashen raguwar farashin EC2019 a cikin 2 idan aka kwatanta da bara.

Hijira ta ɗauki kusan shekaru biyu, amma mun kammala shi a cikin Maris 2019. A halin yanzu, dandalin Tinder yana aiki ne kawai akan gungu na Kubernetes wanda ya ƙunshi ayyuka 200, nodes 1000, pods 15 da kwantena 000 masu gudana. Kamfanonin ababen more rayuwa ba su ne kawai yanki na ƙungiyoyin ayyuka ba. Duk injiniyoyinmu suna raba wannan alhakin kuma suna sarrafa tsarin gini da tura aikace-aikacen su ta amfani da lamba kawai.

PS daga mai fassara

Karanta kuma jerin labaran kan shafinmu:

source: www.habr.com

Add a comment