Tarin Elasticsearch 200 TB+

Tarin Elasticsearch 200 TB+

Mutane da yawa suna kokawa da Elasticsearch. Amma menene zai faru lokacin da kake son amfani da shi don adana rajistan ayyukan "a cikin babban girma na musamman"? Kuma ba shi da zafi don fuskantar gazawar kowane ɗayan cibiyoyin bayanai da yawa? Wane irin gine-gine ya kamata ku yi, kuma waɗanne matsaloli za ku yi tuntuɓe a kai?

Mu a Odnoklassniki mun yanke shawarar amfani da elasticsearch don magance matsalar sarrafa loggu, kuma yanzu muna raba kwarewarmu da Habr: duka game da gine-gine da kuma game da ramuka.

Ni Pyotr Zaitsev, Ina aiki a matsayin mai kula da tsarin a Odnoklassniki. Kafin wannan, ni ma admin ne, na yi aiki tare da Manticore Search, Sphinx search, Elasticsearch. Wataƙila, idan wani ...bincike ya bayyana, tabbas zan yi aiki da shi ma. Ina kuma shiga cikin ayyukan buɗaɗɗiya da yawa bisa son rai.

Lokacin da na zo Odnoklassniki, na faɗa cikin rashin hankali a hirar cewa zan iya aiki da Elasticsearch. Bayan na kama shi kuma na kammala wasu ayyuka masu sauki, sai aka ba ni babban aiki na gyara tsarin sarrafa katako da aka yi a wancan lokacin.

bukatun

An tsara bukatun tsarin kamar haka:

  • Ya kamata a yi amfani da Graylog azaman gaba. Saboda kamfanin ya riga ya sami kwarewa ta amfani da wannan samfurin, masu shirye-shirye da masu gwadawa sun san shi, ya saba da su.
  • Adadin bayanai: akan matsakaicin saƙon 50-80 dubu a sakan daya, amma idan wani abu ya karye, to ba a iyakance zirga-zirgar da komai ba, yana iya zama layin miliyan 2-3 a sakan daya.
  • Bayan tattaunawa tare da abokan ciniki abubuwan da ake buƙata don saurin sarrafa tambayoyin bincike, mun fahimci cewa tsarin yau da kullun na amfani da irin wannan tsarin shine: mutane suna neman rajistan ayyukan su na kwanaki biyu na ƙarshe kuma ba sa so su jira fiye da ɗaya. na biyu don sakamakon da aka tsara.
  • Masu gudanar da aikin sun nace cewa tsarin ya kasance mai sauƙin daidaitawa idan ya cancanta, ba tare da buƙatar su zurfafa zurfin yadda yake aiki ba.
  • Don kawai aikin kulawa da waɗannan tsarin ke buƙata lokaci-lokaci shine canza wasu kayan masarufi.
  • Bugu da kari, Odnoklassniki yana da kyakkyawar al'adar fasaha: duk wani sabis da muka ƙaddamar dole ne mu tsira daga gazawar cibiyar bayanai (kwatsam, ba shiri da cikakken a kowane lokaci).

Abu na ƙarshe a cikin aiwatar da wannan aikin ya kashe mu mafi yawa, wanda zan yi magana dalla-dalla.

Yanayi

Muna aiki a cikin cibiyoyin bayanai guda huɗu, yayin da Elasticsearch data nodes za a iya samuwa a cikin uku kawai (saboda wasu dalilai marasa fasaha).

Waɗannan cibiyoyin bayanai guda huɗu sun ƙunshi kusan maɓuɓɓuka daban-daban dubu 18 - kayan aiki, kwantena, injunan kama-da-wane.

Muhimmin fasali: gungu yana farawa a cikin kwantena podman ba akan inji na zahiri ba, amma akan nasa samfurin girgije daya-girgije. An ba da garantin kwantena 2 cores, kama da 2.0Ghz v4, tare da yiwuwar sake yin amfani da ragowar muryoyin idan ba su da aiki.

Watau:

Tarin Elasticsearch 200 TB+

Topology

Da farko na ga cikakken tsarin maganin kamar haka:

  • 3-4 VIPs suna bayan rikodin A na yankin Graylog, wannan shine adireshin da aka aika rajistan ayyukan.
  • kowane VIP shine ma'auni na LVS.
  • Bayan shi, logins suna zuwa baturin Graylog, wasu bayanan suna cikin tsarin GELF, wasu a cikin tsarin syslog.
  • Sannan duk waɗannan ana rubuta su cikin manyan batches zuwa baturin Elasticsearch coordinators.
  • Kuma su, bi da bi, aika rubuce-rubuce da karanta buƙatun zuwa ga bayanan da suka dace.

Tarin Elasticsearch 200 TB+

Terminology

Watakila ba kowa ya fahimci kalmomin dalla-dalla ba, don haka zan so in dan dakata a kai.

Elasticsearch yana da nau'ikan nodes da yawa - master, mai gudanarwa, kumburin bayanai. Akwai wasu nau'ikan guda biyu don canje-canjen log daban-daban da sadarwa tsakanin gungu daban-daban, amma mun yi amfani da waɗanda aka jera kawai.

Master
Yana pings duk nodes ɗin da ke cikin gungu, yana kula da taswirar gungu na zamani kuma yana rarraba shi tsakanin nodes, aiwatar da dabaru na taron, kuma yana aiwatar da nau'ikan tsare-tsare masu fa'ida.

Mai gudanarwa
Yana aiwatar da ɗawainiya guda ɗaya: karɓar karantawa ko rubuta buƙatun daga abokan ciniki da hanyoyin wannan zirga-zirga. Idan akwai buƙatar rubutawa, da alama, zai tambayi maigidan wane ɓangarorin maƙasudin da ya dace ya saka shi, kuma zai ƙara tura buƙatar.

Kullin bayanai
Ajiye bayanai, yana yin tambayoyin bincike da suka zo daga waje kuma yana yin ayyuka akan shards da ke cikinsa.

Girki
Wannan wani abu ne kamar haɗakar Kibana tare da Logstash a cikin tarin ELK. Graylog ya haɗa duka UI da bututun sarrafa log. A ƙarƙashin hular, Graylog yana gudanar da Kafka da Zookeeper, waɗanda ke ba da haɗin kai zuwa Graylog azaman tari. Graylog na iya cache rajistan ayyukan (Kafka) idan babu Elasticsearch kuma maimaita karantawa da rubuta buƙatun rashin nasara, rukuni da yiwa rajistan ayyukan bisa ƙayyadaddun ƙayyadaddun ƙa'idodi. Kamar Logstash, Graylog yana da ayyuka don gyara layuka kafin rubuta su zuwa Elasticsearch.

Bugu da ƙari, Graylog yana da ginanniyar binciken sabis wanda ke ba da izini, dangane da kullin binciken Elasticsearch guda ɗaya da ake samu, don samun taswirar tari gaba ɗaya kuma tace ta takamaiman tambarin, wanda ke ba da damar kai tsaye buƙatun zuwa takamaiman kwantena.

A gani yana kama da wani abu kamar haka:

Tarin Elasticsearch 200 TB+

Wannan hoton allo ne daga takamaiman misali. Anan muna gina histogram dangane da tambayar nema kuma muna nuna layuka masu dacewa.

Fihirisa

Komawa ga tsarin gine-gine, Ina so in yi dalla-dalla kan yadda muka gina ƙirar ƙididdiga ta yadda duk ya yi aiki daidai.

A cikin zanen da ke sama, wannan shine mafi ƙanƙanta matakin: Elasticsearch data nodes.

Fihirisa babban abu ne mai kama-da-wane wanda ya ƙunshi shards na Elasticsearch. A cikin kanta, kowane shards ba kome ba ne face ma'anar Lucene. Kuma kowace alamar Lucene, bi da bi, ta ƙunshi sassa ɗaya ko fiye.

Tarin Elasticsearch 200 TB+

Lokacin zayyana, mun yi la'akari da cewa don saduwa da buƙatun saurin karantawa akan adadi mai yawa, muna buƙatar "yaɗa" wannan bayanan a ko'ina cikin bayanan bayanan.

Wannan ya haifar da gaskiyar cewa adadin shards a kowace index (tare da kwafi) ya kamata ya zama daidai da adadin bayanan bayanan. Da fari dai, don tabbatar da ma'aunin kwafi daidai da biyu (wato, zamu iya rasa rabin tarin). Kuma, na biyu, don aiwatar da karantawa da rubuta buƙatun akan aƙalla rabin gungu.

Mun fara ƙayyade lokacin ajiya azaman kwanaki 30.

Ana iya wakilta rabon shards ta hanyar zane kamar haka:

Tarin Elasticsearch 200 TB+

Gaba dayan murabba'in launin toka mai duhu shine fihirisa. Mafarkin ja na hagu a cikinsa shine sharadi na farko, na farko a cikin fihirisa. Kuma murabba'in shuɗin shuɗin shuɗi ne mai kwafi. Suna cikin cibiyoyin bayanai daban-daban.

Idan muka ƙara wani shard, yana zuwa cibiyar bayanai ta uku. Kuma, a ƙarshe, muna samun wannan tsarin, wanda ya sa ya yiwu a rasa DC ba tare da rasa daidaiton bayanai ba:

Tarin Elasticsearch 200 TB+

Juyawar fihirisa, i.e. ƙirƙirar sabon fihirisar da share mafi tsufa, mun sanya shi daidai da sa'o'i 48 (bisa ga tsarin amfani da fihirisa: ana bincika sa'o'i 48 na ƙarshe sau da yawa).

Wannan tazarar jujjuyawar fihirisa tana faruwa ne saboda dalilai masu zuwa:

Lokacin da buƙatun nema ya isa ga takamaiman kullin bayanai, to, daga mahangar aiki, yana da fa'ida sosai idan aka nemi shard ɗaya, idan girmansa ya yi daidai da girman kurgin kumburin. Wannan yana ba ku damar adana ɓangaren "zafi" na index a cikin tudu da sauri samun dama gare shi. Lokacin da akwai "sassan zafi" da yawa, saurin binciken index yana raguwa.

Lokacin da kumburi ya fara aiwatar da binciken bincike akan shard ɗaya, yana keɓance adadin zaren daidai da adadin muryoyin bugun bugun jini na na'ura ta zahiri. Idan tambayar nema ta shafi babban adadin shards, to adadin zaren yana girma daidai gwargwado. Wannan yana da mummunan tasiri akan saurin bincike kuma yana tasiri mummunan tasiri na sababbin bayanai.

Don samar da latency nema, mun yanke shawarar amfani da SSD. Don aiwatar da buƙatun cikin sauri, injinan da suka karɓi waɗannan kwantena dole ne su sami aƙalla nau'ikan 56. An zaɓi adadi na 56 a matsayin isasshiyar ƙima wanda ke ƙayyade adadin zaren da Elasticsearch zai haifar yayin aiki. A cikin Elasitcsearch, yawancin sigogin tafkin zaren kai tsaye sun dogara da adadin abubuwan da ake samu, wanda hakan ke shafar kai tsaye adadin nodes da ake buƙata a cikin gungu bisa ka'idar "ƙananan cores - ƙarin nodes".

A sakamakon haka, mun gano cewa a matsakaita shard yana auna kimanin gigabytes 20, kuma akwai 1 shards a kowace index. Don haka, idan muka juya su sau ɗaya a kowace awa 360, to muna da 48 daga cikinsu. Kowace fihirisar ta ƙunshi bayanai na kwanaki 15.

Rubutun bayanai da da'irar karatu

Bari mu gano yadda ake rikodin bayanai a cikin wannan tsarin.

Bari mu ce wasu buƙatu sun zo daga Graylog zuwa ga mai gudanarwa. Alal misali, muna so mu yi la'akari da layuka 2-3 dubu.

Mai gudanarwa, bayan ya sami buƙatu daga Graylog, ya tambayi maigidan: "A cikin buƙatun ƙididdigewa, mun ƙayyadaddun ƙayyadaddun ƙididdiga, amma a cikin abin da shard don rubuta shi ba a ƙayyade ba."

Maigidan ya amsa da cewa: “Rubuta wannan bayanin zuwa lamba ta 71,” bayan haka sai a aika shi kai tsaye zuwa ga kumburin bayanan da suka dace, inda lambar farko-shard lamba 71 take.

Bayan haka an sake yin lissafin ma'amala zuwa wani replica-shard, wanda ke cikin wata cibiyar bayanai.

Tarin Elasticsearch 200 TB+

Neman nema ya zo daga Graylog zuwa mai gudanarwa. Mai gudanarwa yana tura shi bisa ga fihirisar, yayin da Elasticsearch ke rarraba buƙatun tsakanin firamare-shard da replica-shard ta amfani da ka'idar zagaye-robin.

Tarin Elasticsearch 200 TB+

Ƙungiyoyin 180 suna amsawa ba daidai ba, kuma yayin da suke amsawa, mai gudanarwa yana tara bayanai waɗanda aka riga aka "tofa" ta hanyar bayanan bayanai masu sauri. Bayan wannan, lokacin da ko dai duk bayanan sun isa, ko buƙatar ta kai ga ƙarewar lokaci, yana ba da komai kai tsaye ga abokin ciniki.

Wannan gabaɗayan tsarin akan matsakaita yana aiwatar da tambayoyin bincike na sa'o'i 48 na ƙarshe a cikin 300-400ms, ban da waɗannan tambayoyin tare da babban kati.

Fure tare da Elasticsearch: Saitin Java

Tarin Elasticsearch 200 TB+

Don yin duka ya yi aiki kamar yadda muke so, mun ɗauki lokaci mai tsawo muna gyara abubuwa iri-iri a cikin tari.

Sashin farko na matsalolin da aka gano yana da alaƙa da yadda aka riga aka tsara Java ta tsohuwa a cikin Elasticsearch.

Matsala ta daya
Mun ga adadi mai yawa na rahotanni cewa a matakin Lucene, lokacin da ayyukan baya ke gudana, ɓangaren Lucene ya gaza tare da kuskure. A lokaci guda, ya bayyana a cikin rajistan ayyukan cewa wannan kuskuren OutOfMemoryError ne. Mun gani daga telemetry cewa hip ɗin yana da kyauta, kuma ba a san dalilin da yasa wannan aikin ya gaza ba.

Ya juya cewa haɗin gwiwar Lucene yana faruwa a waje da kwatangwalo. Kuma kwantena suna da iyaka sosai dangane da albarkatun da ake cinyewa. Heap ne kawai zai iya shiga cikin waɗannan albarkatun (ƙimar heap.size ya kusan daidai da RAM), kuma wasu ayyukan kashe-kashe sun yi karo tare da kuskuren rarraba ƙwaƙwalwar ajiya idan saboda wasu dalilai ba su dace da ~ 500MB da ya rage kafin iyaka ba.

Gyaran ya kasance maras muhimmanci: an ƙara adadin RAM ɗin da ake samu don akwati, bayan haka mun manta cewa har ma muna da irin waɗannan matsalolin.

Matsala ta biyu
Kwanaki 4-5 bayan ƙaddamar da gungu, mun lura cewa bayanan bayanan sun fara faɗuwa lokaci-lokaci daga gungu kuma shigar da shi bayan 10-20 seconds.

Lokacin da muka fara gano shi, ya zamana cewa wannan ƙwaƙwalwar ajiyar da ke cikin Elasticsearch ba ta da iko ta kowace hanya. Lokacin da muka ba da ƙarin ƙwaƙwalwar ajiya ga akwati, mun sami damar cika wuraren tafki kai tsaye tare da bayanai daban-daban, kuma an share shi ne kawai bayan ƙaddamar da GC na zahiri daga Elasticsearch.

A wasu lokuta, wannan aikin ya ɗauki lokaci mai tsawo, kuma a wannan lokacin ƙungiyar ta sami damar sanya alamar wannan kumburi kamar yadda ta rigaya ta fita. An kwatanta wannan matsala da kyau a nan.

Maganin ya kasance kamar haka: mun iyakance ikon Java don amfani da yawancin ƙwaƙwalwar ajiya a waje da tudun don waɗannan ayyuka. Mun iyakance shi zuwa gigabytes 16 (-XX: MaxDirectMemorySize = 16g), tabbatar da cewa ana kiran GC bayyane sau da yawa kuma ana sarrafa shi da sauri, ta yadda ba za ta sake rushe gungu ba.

Matsala ta uku
Idan kuna tunanin cewa matsalolin "nodes suna barin gungu a mafi yawan lokacin da ba tsammani" sun ƙare, kun yi kuskure.

Lokacin da muka saita aikin tare da fihirisa, mun zaɓi mmapfs zuwa rage lokacin bincike a kan sabobin shards tare da babban rabo. Wannan kuskure ne sosai, saboda lokacin amfani da mmapfs an tsara fayil ɗin zuwa RAM, sannan muna aiki tare da fayil ɗin da aka zana. Don haka sai ya zama idan GC ta yi kokarin tsayar da zaren da ke cikin aikace-aikacen, sai mu je wurin safepoint na dogon lokaci, kuma a kan hanyar zuwa gare shi, aikace-aikacen ya daina amsa buƙatun maigida game da ko yana raye. . Saboda haka, maigidan ya yi imanin cewa kumburin baya nan a cikin tari. Bayan haka, bayan dakika 5-10, mai tara shara yana aiki, kumburin ya zo rayuwa, ya sake shiga gungu kuma ya fara farawa shards. Duk ya ji sosai kamar "samar da muka cancanci" kuma bai dace da wani abu mai tsanani ba.

Don kawar da wannan dabi'a, mun fara canzawa zuwa daidaitattun niofs, sannan, lokacin da muka yi hijira daga nau'i na biyar na Elastic zuwa na shida, mun gwada hybridfs, inda wannan matsalar ba ta sake haifarwa ba. Kuna iya karanta ƙarin game da nau'ikan ajiya a nan.

Matsala ta hudu
Sa'an nan kuma akwai wata matsala mai ban sha'awa mai ban sha'awa da muka bi da ita don rikodin lokaci. Mun kama shi har tsawon watanni 2-3 saboda tsarin sa ya kasance mai wuyar fahimta.

Wani lokaci koordinators din mu suna zuwa Full GC, yawanci bayan cin abinci, kuma ba su dawo daga can ba. A lokaci guda, lokacin shigar da jinkirin GC, ya yi kama da haka: komai yana tafiya daidai, da kyau, da kyau, sannan kuma ba zato ba tsammani komai yana tafiya da muni.

Da farko mun yi tunanin cewa muna da mugun mai amfani wanda ke ƙaddamar da wani nau'i na buƙata wanda ya fitar da mai gudanarwa daga yanayin aiki. Mun yi rajistar buƙatun na dogon lokaci, muna ƙoƙarin gano abin da ke faruwa.

Sakamakon haka, ya bayyana cewa a lokacin da mai amfani ya ƙaddamar da babbar buƙata, kuma ya kai ga takamaiman mai daidaitawa na Elasticsearch, wasu nodes suna amsawa fiye da sauran.

Kuma yayin da mai gudanarwa ke jiran amsa daga dukkan nodes, yana tattara sakamakon da aka aiko daga nodes ɗin da suka rigaya sun amsa. Ga GC, wannan yana nufin cewa tsarin amfani da tarin mu yana canzawa da sauri. Kuma GC da muka yi amfani da shi ba zai iya jure wa wannan aikin ba.

Iyakar abin da muka samu don canza halayen gungu a cikin wannan yanayin shine ƙaura zuwa JDK13 da kuma amfani da masu tara shara na Shenandoah. Wannan ya warware matsalar, kodinetocin mu sun daina faɗuwa.

Anan ne matsalolin Java suka ƙare kuma matsalolin bandwidth sun fara.

"Berry" tare da Elasticsearch: kayan aiki

Tarin Elasticsearch 200 TB+

Matsaloli tare da kayan aiki suna nufin cewa tarin mu yana aiki da ƙarfi, amma a kololuwar adadin takaddun da aka jera da lokacin motsa jiki, aikin bai isa ba.

Alamar farko da aka ci karo da ita: yayin wasu “fashewa” a cikin samarwa, lokacin da aka haifar da adadi mai yawa na rajistan ayyukan kwatsam, kuskuren es_rejected_execution ya fara haskakawa akai-akai a cikin Graylog.

Wannan ya faru ne saboda gaskiyar cewa thread_pool.write.queue akan kullin bayanai guda ɗaya, har zuwa lokacin da Elasticsearch ya sami damar aiwatar da buƙatun firikwensin da loda bayanan zuwa shard akan faifai, yana iya adana buƙatun 200 kawai ta tsohuwa. Kuma a cikin Takaddun bincike na Elastick Ana faɗi kaɗan game da wannan siga. Matsakaicin adadin zaren kawai da girman tsoho ne aka nuna.

Tabbas, mun je don karkatar da wannan ƙimar kuma mun gano abubuwan masu zuwa: musamman, a cikin saitin mu, ana adana buƙatun har 300 da kyau, kuma ƙimar mafi girma tana cike da gaskiyar cewa mun sake tashi zuwa Cikakken GC.

Bugu da ƙari, tun da waɗannan saƙon saƙo ne waɗanda ke zuwa cikin buƙatu ɗaya, ya zama dole a tweak Graylog don kada ya yi rubutu akai-akai kuma a cikin ƙananan batches, amma a cikin manyan batches ko sau ɗaya kowane sakan 3 idan har yanzu batch ɗin bai cika ba. A wannan yanayin, ya zama cewa bayanin da muke rubutawa a cikin Elasticsearch yana samuwa ba a cikin dakika biyu ba, amma a cikin biyar (wanda ya dace da mu sosai), amma adadin sake dawowa da dole ne a yi don turawa ta hanyar babban. an rage tarin bayanai.

Wannan yana da mahimmanci musamman a waɗancan lokutan lokacin da wani abu ya faɗo wani wuri kuma yana ba da rahoto game da shi cikin fushi, don kada a sami Elastic ɗin gabaɗaya, kuma bayan ɗan lokaci - nodes ɗin Graylog waɗanda ba su iya aiki ba saboda buffers.

Bugu da ƙari, lokacin da muke da irin waɗannan fashewar abubuwa a cikin samarwa, mun sami gunaguni daga masu tsara shirye-shirye da masu gwadawa: a lokacin da suke buƙatar waɗannan katako, an ba su a hankali.

Suka fara gane shi. A gefe guda, a bayyane yake cewa duka tambayoyin bincike da kuma tambayoyin firikwensin ana sarrafa su, da gaske, akan injina guda ɗaya, kuma wata hanya ko wata za a sami wasu fassarori.

Amma wannan na iya jujjuya wani bangare saboda gaskiyar cewa a cikin nau'ikan Elasticsearch na shida, algorithm ya bayyana wanda zai ba ku damar rarraba tambayoyin tsakanin nodes ɗin bayanan da suka dace ba bisa ka'idar zagaye-robin ba (kwandon da ke yin nuni kuma yana riƙe da farko. -shard na iya zama mai aiki sosai, ba za a sami hanyar da za a iya amsawa da sauri ba), amma don tura wannan buƙatun zuwa akwati da ba a ɗora ba tare da replica-shard, wanda zai amsa da sauri. A takaice dai, mun isa amfani_adaptive_replica_selection: gaskiya.

Hoton karatun ya fara kama da haka:

Tarin Elasticsearch 200 TB+

Canji zuwa wannan algorithm ya ba da damar inganta lokacin tambaya sosai a waɗancan lokutan da muke da tarin rajistan ayyukan rubuta.

A ƙarshe, babbar matsalar ita ce kawar da cibiyar bayanai mara zafi.

Abin da muke so daga gungu nan da nan bayan rasa haɗin gwiwa tare da DC guda ɗaya:

  • Idan muna da maigidan na yanzu a cikin cibiyar bayanan da ta gaza, to za a sake zaɓe shi kuma a matsar da shi azaman rawa zuwa wani kumburi a cikin wani DC.
  • Maigidan zai cire duk nodes ɗin da ba za su iya shiga ba da sauri daga gungu.
  • Dangane da sauran, zai fahimta: a cikin ɓataccen cibiyar bayanai muna da irin waɗannan shards na farko, zai hanzarta haɓaka shards ɗin da suka rage a cikin sauran cibiyoyin bayanai, kuma za mu ci gaba da ƙididdige bayanan.
  • A sakamakon haka, rubutun gungu da abubuwan da ake karantawa za su ragu sannu a hankali, amma gaba ɗaya komai zai yi aiki, ko da a hankali, amma a tsaye.

Kamar yadda ya kasance, muna son wani abu kamar haka:

Tarin Elasticsearch 200 TB+

Kuma mun sami wadannan:

Tarin Elasticsearch 200 TB+

Ta yaya hakan ya faru?

Lokacin da cibiyar bayanai ta fadi, maigidanmu ya zama kankara.

Me ya sa?

Gaskiyar ita ce, maigidan yana da TaskBatcher, wanda ke da alhakin rarraba wasu ayyuka da abubuwan da suka faru a cikin gungu. Duk wani fitowar kumburi, duk wani haɓakar shard daga kwafi zuwa firamare, kowane ɗawainiya don ƙirƙirar shard a wani wuri - duk wannan yana fara zuwa TaskBatcher, inda ake sarrafa shi a jere kuma a cikin zare ɗaya.

A lokacin da aka janye cibiyar data guda daya, ya bayyana cewa duk bayanan da ke cikin cibiyoyin bayanan da suka tsira sun dauki alhakinsu ne su sanar da maigidan "mun yi asarar irin wadannan shards da irin wadannan bayanan."

A lokaci guda kuma, nodes data tsira sun aika da duk waɗannan bayanan zuwa ga maigidan na yanzu kuma sun yi ƙoƙarin jira don tabbatarwa cewa ya karɓa. Ba su jira wannan ba, tun da maigidan ya sami ayyuka da sauri fiye da yadda zai iya amsawa. Nodes ɗin sun ƙare maimaita buƙatun, kuma maigidan a wannan lokacin bai yi ƙoƙarin amsa su ba, amma gaba ɗaya ya nutsu cikin aikin rarraba buƙatun ta fifiko.

A cikin tsari na ƙarshe, ya nuna cewa bayanan bayanan sun lalata maigidan har ya shiga cikin GC cikakke. Bayan haka, aikin maigidanmu ya koma wani kulli na gaba, kwata-kwata abu daya ya faru da shi, kuma a sakamakon haka tarin ya rushe gaba daya.

Mun dauki ma'auni, kuma kafin sigar 6.4.0, inda aka gyara wannan, ya ishe mu a lokaci guda mu fitar da nodes 10 kawai daga cikin 360 don rufe tarin gaba daya.

Ya kasance kamar haka:

Tarin Elasticsearch 200 TB+

Bayan sigar 6.4.0, inda aka gyara wannan mummunan kwaro, bayanan bayanan sun daina kashe maigidan. Amma hakan bai sa ya zama “mafi wayo ba”. Wato: idan muka fitar da 2, 3 ko 10 (kowace lamba banda ɗaya) nodes ɗin bayanai, maigidan ya karɓi saƙon farko wanda ke cewa kumburin A ya bar, kuma yayi ƙoƙarin gaya wa kumburin B, node C game da wannan, node D.

Kuma a halin yanzu, ana iya magance wannan kawai ta hanyar saita lokaci don ƙoƙarin gaya wa wani game da wani abu, daidai da kusan daƙiƙa 20-30, don haka sarrafa saurin cibiyar bayanan da ke fita daga gungu.

A ka'ida, wannan ya dace da bukatun da aka fara gabatar da su zuwa samfurin ƙarshe a matsayin wani ɓangare na aikin, amma daga ma'anar "kimiyya mai tsabta" wannan matsala ce. Wanne, ta hanyar, an sami nasarar gyarawa ta masu haɓakawa a cikin sigar 7.2.

Bugu da ƙari, lokacin da wani kullin bayanan ya fita, ya bayyana cewa watsa bayanai game da ficewarsa ya fi muhimmanci fiye da gaya wa dukan gungu cewa akwai irin wannan nau'i-nau'i na farko a kan shi (domin inganta kwafin-shard a cikin wani bayanan. tsakiya a cikin firamare, kuma a cikin bayanai za a iya rubuta su).

Sabili da haka, lokacin da komai ya riga ya mutu, ba a sanya alamun bayanan da aka saki nan da nan a matsayin stale. Saboda haka, an tilasta mana mu jira har sai duk pings sun ƙare zuwa ga bayanan bayanan da aka fitar, kuma bayan haka ƙungiyarmu ta fara gaya mana cewa a can, akwai, kuma a can muna buƙatar ci gaba da rikodin bayanai. Kuna iya karanta ƙarin game da wannan a nan.

A sakamakon haka, aikin janye cibiyar bayanai a yau yana ɗaukar mu kimanin minti 5 a lokacin gaggawa. Don irin wannan babban colossus mai girma da kuma m, wannan kyakkyawan sakamako ne mai kyau.

Sakamakon haka, mun yanke shawara kamar haka:

  • Muna da bayanan bayanan 360 tare da faifai 700 gigabyte.
  • 60 masu gudanar da zirga-zirgar zirga-zirgar ababen hawa ta waɗannan nodes ɗin bayanai iri ɗaya.
  • 40 masters da muka bari a matsayin wani nau'i na gado tun daga juzu'i kafin 6.4.0 - don tsira daga janyewar cibiyar bayanai, mun kasance a hankali a shirye mu yi asarar injuna da yawa don a ba mu tabbacin samun adadin masters ko da a cikin mafi munin yanayin labari
  • Duk wani yunƙuri na haɗa matsayi a kan akwati ɗaya an gamu da gaskiyar cewa ba dade ko ba dade kumburin zai karye a ƙarƙashin kaya.
  • Gabaɗayan gungu yana amfani da tarin gigabytes 31: duk ƙoƙarin rage girman ya haifar da ko dai kashe wasu nodes akan tambayoyin bincike mai nauyi tare da babban kati ko samun na'urar kewayawa a cikin Elasticsearch kanta.
  • Bugu da ƙari, don tabbatar da aikin bincike, mun yi ƙoƙarin kiyaye adadin abubuwan da ke cikin gungu a matsayin ƙanana kamar yadda zai yiwu, don aiwatar da ƙananan abubuwan da za su yiwu a cikin kwalban da muka samu a cikin maigidan.

A ƙarshe game da saka idanu

Don tabbatar da cewa duk wannan yana aiki kamar yadda aka yi niyya, muna sa ido kan abubuwan da ke gaba:

  • Kowane kullin bayanai yana ba da rahoto ga gajimarenmu cewa akwai shi, kuma akwai irin waɗannan tarkace akansa. Lokacin da muka kashe wani abu a wani wuri, gungu ya ba da rahoton bayan daƙiƙa 2-3 cewa a tsakiyar A mun kashe nodes 2, 3, da 4 - wannan yana nufin cewa a cikin sauran cibiyoyin bayanai ba mu da wani yanayi ba za mu iya kashe waɗancan nodes ɗin da akwai shard ɗaya kawai a kansu. hagu.
  • Sanin yanayin halayen maigidan, muna duban hankali sosai akan adadin ayyukan da ake jira. Domin ko da aiki daya makale, idan bai yi lokaci ba, a ka'ida a wasu yanayi na gaggawa na iya zama dalilin da ya sa, alal misali, haɓaka shard ɗin kwafi a farkon ba ya aiki, wanda shine dalilin da ya sa indexing zai daina aiki.
  • Har ila yau, muna duban tsaikon jinkirin masu tattara shara, saboda mun riga mun sami matsala sosai a lokacin ingantawa.
  • Ya ƙi ta hanyar zare don fahimtar a gaba inda ƙugiya take.
  • To, daidaitattun ma'auni irin su heap, RAM da I/O.

Lokacin gina sa ido, dole ne ku yi la'akari da fasalulluka na Pool Thread a cikin Elasticsearch. Takardun Bincike na Elastick yana bayyana zaɓuɓɓukan sanyi da ƙimar ƙima don bincike da ƙididdigewa, amma gabaɗaya shiru game da thread_pool.management. Waɗannan tsarin zaren, musamman, tambayoyin kamar _cat / shards da sauran makamantan su, waɗanda suka dace don amfani yayin rubuta sa ido. Mafi girman gungu, ana aiwatar da irin waɗannan buƙatun kowace raka'a na lokaci, kuma ba a gabatar da thread_pool.management ɗin da aka ambata a cikin takaddun hukuma kawai ba, amma kuma an iyakance shi ta hanyar tsohuwa zuwa zaren 5, wanda ke zubar da sauri sosai, bayan haka. wanda saka idanu ya daina aiki daidai.

Abin da nake so in faɗi a ƙarshe: mun yi shi! Mun sami damar ba masu shirye-shiryenmu da masu haɓaka kayan aiki wanda, a kusan kowane yanayi, zai iya ba da sauri da dogaro da bayanai game da abin da ke faruwa a samarwa.

Ee, ya juya ya zama mai rikitarwa, amma, duk da haka, mun sami damar dacewa da abubuwan da muke so a cikin samfuran da muke da su, waɗanda ba mu da faci da sake rubutawa kanmu.

Tarin Elasticsearch 200 TB+

source: www.habr.com

Add a comment