Yadda mu a CIAN ke horar da terabytes na katako

Yadda mu a CIAN ke horar da terabytes na katako

Sannu kowa da kowa, sunana Alexander, Ina aiki a CIAN a matsayin injiniya kuma ina shiga cikin tsarin gudanarwa da sarrafa kansa na ayyukan ababen more rayuwa. A cikin sharhin daya daga cikin labaran da suka gabata, an nemi mu bayyana inda muke samun TB 4 na katako a kowace rana da abin da muke yi da su. Ee, muna da rajistan ayyukan da yawa, kuma an ƙirƙiri gungu na kayan aikin daban don sarrafa su, wanda ke ba mu damar magance matsaloli cikin sauri. A cikin wannan labarin zan yi magana game da yadda muka daidaita shi a tsawon shekara guda don yin aiki tare da haɓakar bayanai masu tasowa.

A ina muka fara?

Yadda mu a CIAN ke horar da terabytes na katako

A cikin 'yan shekarun da suka gabata, nauyin cian.ru ya girma cikin sauri, kuma a kashi na uku na 2018, zirga-zirgar albarkatun ya kai 11.2 masu amfani na musamman a kowane wata. A lokacin, a lokuta masu mahimmanci mun rasa kusan kashi 40% na rajistan ayyukan, wanda shine dalilin da ya sa ba za mu iya magance abubuwan da suka faru da sauri ba kuma muka yi amfani da lokaci da ƙoƙari mai yawa don magance su. Har ila yau, sau da yawa ba mu iya gano musabbabin matsalar, kuma takan sake faruwa bayan wani lokaci. Jahannama ce kuma dole a yi wani abu game da shi.

A lokacin, mun yi amfani da gungu na bayanan bayanan 10 tare da ElasticSearch sigar 5.5.2 tare da daidaitattun saitunan fihirisar don adana rajistan ayyukan. An gabatar da shi fiye da shekara guda da suka wuce a matsayin sanannen bayani mai araha kuma mai araha: to, kwararar katako ba ta da girma sosai, babu wata ma'ana a fito da matakan da ba daidai ba. 

Logstash ya samar da sarrafa rajistan ayyukan masu shigowa akan tashoshin jiragen ruwa daban-daban akan masu gudanar da binciken ElasticSearch guda biyar. Fihirisa ɗaya, ba tare da la'akari da girman ba, ya ƙunshi shards biyar. An shirya jujjuyawar sa'o'i guda ɗaya da kullun, sakamakon haka, kusan sabbin ɓangarorin 100 sun bayyana a cikin gungu kowace sa'a. Duk da yake babu rajistan ayyukan da yawa, tarin ya jimre da kyau kuma babu wanda ya kula da saitunan sa. 

Kalubalen girma cikin sauri

Girman gundumomi da aka samar ya girma cikin sauri, yayin da matakai biyu suka mamaye juna. A gefe guda, adadin masu amfani da sabis ya karu. A gefe guda, mun fara canzawa sosai zuwa gine-ginen microservice, muna ganin tsoffin monoliths a cikin C # da Python. Sabbin dozin ɗin sabbin ƙananan ayyuka waɗanda suka maye gurbin sassan monolith sun haifar da ƙarin rajistan ayyukan gungu na abubuwan more rayuwa. 

Ƙunƙasa ne ya kai mu ga inda tarin ya zama ba za a iya sarrafa shi ba. Lokacin da gundumomi suka fara isowa kan adadin saƙonni dubu 20 a cikin daƙiƙa guda, jujjuyawar da ba ta da amfani akai-akai ta ƙara adadin ɓangarorin zuwa 6, kuma akwai sama da 600 a kowane kumburi. 

Wannan ya haifar da matsaloli na rabon RAM, kuma lokacin da kumburi ya fado, duk tarkace sun fara motsi lokaci guda, suna ninka zirga-zirga da kuma loda wasu nodes, wanda ya sa kusan ba zai yiwu a rubuta bayanai zuwa gungu ba. Kuma a wannan lokacin an bar mu ba tare da gungumen azaba ba. Kuma idan akwai matsala tare da uwar garken, mun rasa 1/10 na tari. Babban adadin ƙananan fihirisa sun ƙara rikitarwa.

Ba tare da gungumen azaba ba, ba mu fahimci dalilan abin da ya faru ba, kuma ba dade ko ba dade ba za mu iya sake taka irin wannan rake ba, kuma a cikin akidar ƙungiyarmu wannan abu ne da ba za a yarda da shi ba, tun da duk hanyoyin aikinmu an tsara su don yin akasin haka - kar a sake maimaitawa. matsaloli iri daya. Don yin wannan, muna buƙatar cikakken ƙarar rajistan ayyukan da isar da su kusan a cikin ainihin lokacin, tunda ƙungiyar injiniyoyin da ke aiki suna lura da faɗakarwa ba kawai daga ma'auni ba, har ma daga rajistan ayyukan. Don fahimtar girman matsalar, a lokacin jimillar kuɗaɗen katako ya kai kusan TB 2 a rana. 

Mun kafa manufa don kawar da asarar rajistan ayyukan gaba ɗaya kuma rage lokacin isar da su zuwa gungu na ELK zuwa matsakaicin mintuna 15 yayin ƙarfin majeure (daga baya mun dogara da wannan adadi a matsayin KPI na ciki).

Sabuwar hanyar jujjuyawa da kuɗaɗe masu zafi

Yadda mu a CIAN ke horar da terabytes na katako

Mun fara jujjuya tari ta sabunta sigar ElasticSearch daga 5.5.2 zuwa 6.4.3. Har yanzu sigar mu ta 5 ta mutu, kuma mun yanke shawarar kashe shi kuma mu sabunta shi gaba daya - har yanzu babu rajistan ayyukan. Don haka muka yi wannan sauyi cikin sa'o'i biyu kacal.

Babban canji mafi girma a wannan mataki shine aiwatar da Apache Kafka akan nodes guda uku tare da mai gudanarwa a matsayin matsakaicin buffer. Dillalin saƙon ya cece mu daga asarar rajistan ayyukan yayin matsaloli tare da Binciken Elastic. A lokaci guda, mun ƙara nodes 2 zuwa gungu kuma mun canza zuwa gine-gine mai zafi mai zafi tare da nau'i uku na "zafi" da ke cikin racks daban-daban a cikin cibiyar bayanai. Mun tura rajistan ayyukan zuwa gare su ta amfani da abin rufe fuska wanda bai kamata a rasa a kowane yanayi ba - nginx, da kuma rajistan ayyukan kuskuren aikace-aikacen. An aika ƙananan rajistan ayyukan zuwa sauran nodes - gyara kuskure, gargadi, da dai sauransu, kuma bayan sa'o'i 24, an canja wurin "mahimmancin" rajistan ayyukan daga nodes "zafi".

Don kada a ƙara yawan ƙananan ƙananan ƙididdiga, mun canza daga juyawa lokaci zuwa tsarin jujjuyawar. Akwai bayanai da yawa a kan forums cewa juyawa ta girman girman ba shi da tabbas sosai, don haka mun yanke shawarar yin amfani da juyawa ta adadin takardun da ke cikin index. Mun bincika kowane maƙasudin kuma mun rubuta adadin takaddun bayan abin da juyawa ya kamata ya yi aiki. Don haka, mun kai mafi girman girman shard - bai wuce 50 GB ba. 

Inganta tari

Yadda mu a CIAN ke horar da terabytes na katako

Duk da haka, ba mu gama kawar da matsalolin gaba ɗaya ba. Abin baƙin ciki, har yanzu ƙananan fihirisa sun bayyana: ba su kai ga ƙayyadadden ƙarar ba, ba a juya su ba, kuma an share su ta hanyar tsaftacewa ta duniya na fihirisar da suka girmi kwanaki uku, tun da mun cire juyawa ta kwanan wata. Wannan ya haifar da asarar bayanai saboda gaskiyar cewa index daga gungu ya ɓace gaba ɗaya, kuma yunƙurin rubutawa zuwa maƙasudin da ba shi da shi ya karya tunanin mai kula da mu da muke amfani da shi don gudanarwa. An canza laƙabin rubutu zuwa maƙasudi kuma ya karya dabarar jujjuyawar, yana haifar da ci gaban rashin kulawa na wasu fihirisa har zuwa 600 GB. 

Misali, don tsarin juyi:

сurator-elk-rollover.yaml

---
actions:
  1:
    action: rollover
    options:
      name: "nginx_write"
      conditions:
        max_docs: 100000000
  2:
    action: rollover
    options:
      name: "python_error_write"
      conditions:
        max_docs: 10000000

Idan babu wani laƙabin rollover, an sami kuskure:

ERROR     alias "nginx_write" not found.
ERROR     Failed to complete action: rollover.  <type 'exceptions.ValueError'>: Unable to perform index rollover with alias "nginx_write".

Mun bar maganin wannan matsala don sake maimaitawa na gaba kuma mun ɗauki wani batu: mun canza zuwa ma'anar Logstash, wanda ke aiwatar da rajistan ayyukan masu shigowa (cire bayanan da ba dole ba da haɓakawa). Mun sanya shi a cikin docker, wanda muka ƙaddamar ta hanyar docker-compose, kuma mun sanya logstash-exporter a can, wanda ke aika ma'auni zuwa Prometheus don sa ido kan aikin rafin log ɗin. Ta wannan hanyar mun ba kanmu damar musanyawa cikin sauƙi don canza adadin lokuttan logstash da ke da alhakin sarrafa kowane nau'in log.

Yayin da muke haɓaka gungu, zirga-zirgar cian.ru ya ƙaru zuwa masu amfani na musamman miliyan 12,8 a kowane wata. A sakamakon haka, ya zama cewa canje-canjen mu sun kasance a baya bayan canje-canje a cikin samarwa, kuma mun fuskanci gaskiyar cewa nodes "dumi" ba zai iya jimre wa nauyin kaya ba kuma ya rage duk isar da katako. Mun sami bayanan "zafi" ba tare da gazawa ba, amma dole ne mu shiga tsakani a cikin isar da sauran kuma mu yi jujjuyawar hannu don rarraba fihirisar daidai. 

A lokaci guda, ƙididdigewa da canza saitunan logstash a cikin gungu ya kasance mai rikitarwa ta gaskiyar cewa shi ne mai docker-compose na gida, kuma duk ayyukan an yi su da hannu (don ƙara sabbin ƙarewa, ya zama dole a bi duk abin da hannu). sabobin da docker-compose up -d ko'ina).

Shiga sake rarrabawa

A cikin watan Satumba na wannan shekara, muna ci gaba da yanke monolith, nauyin da ke kan gungu yana karuwa, kuma kwararar katako yana kusantar sakonni 30 a kowace dakika. 

Yadda mu a CIAN ke horar da terabytes na katako

Mun fara ci gaba na gaba tare da sabunta kayan aiki. Mun canza daga masu gudanarwa biyar zuwa uku, mun maye gurbin bayanan bayanan kuma mun ci nasara ta fuskar kudi da sararin ajiya. Don nodes muna amfani da saiti biyu: 

  • Don nodes "zafi": E3-1270 v6 / 960Gb SSD / 32 Gb x 3 x 2 (3 don Hot1 da 3 don Hot2).
  • Don nodes "dumi": E3-1230 v6 / 4Tb SSD / 32 Gb x 4.

A wannan juzu'i, mun matsar da fihirisar tare da rajistan ayyukan microservices, wanda ke ɗaukar sarari iri ɗaya kamar rajistan ayyukan nginx na gaba, zuwa rukuni na biyu na nodes "zafi" uku. Yanzu muna adana bayanai a kan nodes na "zafi" na tsawon sa'o'i 20, sa'an nan kuma canza su zuwa "dumi" nodes zuwa sauran rajistan ayyukan. 

Mun warware matsalar ƙananan fihirisa bace ta hanyar sake fasalin jujjuyawar su. Yanzu ana jujjuya fihirisa kowane sa'o'i 23 a kowane hali, ko da akwai ƙananan bayanai a wurin. Wannan dan kadan ya ƙara yawan adadin shards (akwai kimanin 800 daga cikinsu), amma daga ra'ayi na aikin gungu yana da jurewa. 

A sakamakon haka, akwai "zafi" shida da kuma "dumi" hudu kawai a cikin tari. Wannan yana haifar da ɗan jinkiri kan buƙatun na dogon lokaci, amma ƙara yawan nodes a nan gaba zai magance wannan matsalar.

Wannan maimaitawa kuma ya gyara matsalar rashin sikelin sikeli na atomatik. Don yin wannan, mun tura gungu na kayan more rayuwa Nomad - kwatankwacin abin da muka riga muka tura wajen samarwa. A yanzu, adadin Logstash ba ya canzawa ta atomatik dangane da kaya, amma za mu zo ga wannan.

Yadda mu a CIAN ke horar da terabytes na katako

Shirye-shirye na nan gaba

Tsarin da aka aiwatar yana daidaita daidai, kuma yanzu muna adana 13,3 TB na bayanai - duk rajistan ayyukan don kwanaki 4, wanda ya zama dole don nazarin gaggawa na faɗakarwa. Muna canza wasu rajistan ayyukan zuwa ma'auni, wanda muke ƙarawa zuwa Graphite. Don sauƙaƙe aikin injiniyoyi, muna da ma'auni don gungu na kayan more rayuwa da rubutun don gyara ta atomatik na matsalolin gama gari. Bayan ƙara yawan adadin bayanan bayanan, wanda aka tsara don shekara mai zuwa, za mu canza zuwa ajiyar bayanai daga kwanaki 4 zuwa 7. Wannan zai isa ga aikin aiki, tunda koyaushe muna ƙoƙarin bincika abubuwan da suka faru da wuri-wuri, kuma don binciken dogon lokaci akwai bayanan telemetry. 

A cikin Oktoba 2019, zirga-zirga zuwa cian.ru ya riga ya girma zuwa masu amfani na musamman miliyan 15,3 a kowane wata. Wannan ya zama gwaji mai tsanani na maganin gine-gine don isar da gundumomi. 

Yanzu muna shirin sabunta ElasticSearch zuwa sigar 7. Duk da haka, saboda wannan dole ne mu sabunta taswirar taswira da yawa a cikin ElasticSearch, tunda sun ƙaura daga sigar 5.5 kuma an ayyana su azaman yankewa a sigar 6 (kawai ba su wanzu a cikin sigar ta 7). 7). Wannan yana nufin cewa yayin aiwatar da sabuntawa tabbas za a sami wani nau'in majeure mai ƙarfi, wanda zai bar mu ba tare da rajistan ayyukan ba yayin da aka warware matsalar. Na sigar XNUMX, mun fi sa ido ga Kibana tare da ingantacciyar dubawa da sabbin masu tacewa. 

Mun cim ma babban burinmu: mun dakatar da rasa rajistan ayyukan kuma mun rage lokacin raguwar tarin abubuwan more rayuwa daga hadarurruka 2-3 a kowane mako zuwa sa'o'i biyu na aikin kulawa kowane wata. Duk wannan aikin a cikin samarwa kusan ba a iya gani. Duk da haka, yanzu za mu iya ƙayyade ainihin abin da ke faruwa tare da sabis ɗinmu, za mu iya yin sauri a cikin yanayin shiru kuma kada ku damu cewa rajistan ayyukan za su ɓace. Gabaɗaya, mun gamsu, farin ciki da shirya don sabbin abubuwan amfani, waɗanda za mu yi magana game da su daga baya.

source: www.habr.com

Add a comment