Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

Muna rayuwa a cikin wani lokaci mai ban mamaki lokacin da zaku iya haɗa kayan aikin buɗewa da yawa da aka yi cikin sauri da sauƙi, saita su tare da "kashe hankalinku" bisa ga shawarar tarin tarin yawa, ba tare da shiga cikin "haruffa da yawa", da ƙaddamar da su ba. su a cikin kasuwanci aiki. Kuma lokacin da kuke buƙatar sabuntawa / faɗaɗa ko wani ya sake kunna injin ɗin da gangan - kun gane cewa wani nau'in mummunan mafarki ya fara, komai ya zama mai rikitarwa fiye da saninsa, babu juyowa, gaba ba ta da tabbas kuma mafi aminci. maimakon shirye-shirye, kiwo ƙudan zuma a yi cuku.

Ba don komai ba ne cewa ƙwararrun abokan aiki, tare da kawunansu cike da kwari sabili da haka sun riga sun yi launin toka, suna yin la'akari da saurin tura fakitin "kwantena" a cikin "cubes" akan yawancin sabobin a cikin "harsunan zamani" tare da ginanniyar tallafi don asynchronous mara toshewa I/O, murmushi cikin ladabi. Kuma sun yi shiru suna ci gaba da sake karanta "man ps", suna zurfafa cikin lambar tushe na "nginx" har sai idanunsu ya yi jini, kuma su rubuta, rubuta, rubuta gwajin naúrar. Abokan aiki sun san cewa abu mafi ban sha'awa zai zo lokacin da "duk wannan" wata rana ta zama cikin dare a ranar Sabuwar Shekara. Kuma za a taimaka musu ne kawai ta hanyar zurfin fahimtar yanayin unix, ƙayyadaddun tebur TCP/IP da aka haddace da asali na bincike-bincike algorithms. Don dawo da tsarin zuwa rai yayin da kururuwa ke bugawa.

Eh, na dan shagala, amma ina fata na sami nasarar isar da yanayin jira.
A yau ina so in raba gwanintar mu wajen tura tari mai dacewa kuma mara tsada don DataLake, wanda ke warware yawancin ayyukan nazari a cikin kamfani don rarrabuwar tsarin gaba daya.

Wani lokaci da suka wuce, mun zo ga fahimtar cewa kamfanoni suna ƙara buƙatar 'ya'yan itatuwa na samfurori da kuma nazarin fasaha (ba a ma maganar icing a kan cake a cikin nau'i na ilmantarwa na inji) da kuma fahimtar abubuwan da ke faruwa da kuma kasada - muna buƙatar tattarawa da yin nazari. ƙari da ƙari ma'auni.

Bayanan fasaha na asali a cikin Bitrix24

Shekaru da yawa da suka gabata, a lokaci guda tare da ƙaddamar da sabis na Bitrix24, mun himmatu wajen kashe lokaci da albarkatu don ƙirƙirar dandamali mai sauƙi kuma abin dogaro wanda zai taimaka da sauri ganin matsaloli a cikin abubuwan more rayuwa da tsara mataki na gaba. Tabbas, yana da kyau a ɗauki kayan aikin da aka shirya waɗanda suke da sauƙi kuma masu iya fahimta sosai. Sakamakon haka, an zaɓi nagios don saka idanu da munin don nazari da gani. Yanzu muna da dubban cak a nagios, ɗaruruwan ginshiƙi a cikin munin, kuma abokan aikinmu suna amfani da su cikin nasara kowace rana. Ma'auni a bayyane suke, jadawalai a bayyane suke, tsarin yana aiki da dogaro na shekaru da yawa kuma ana ƙara sabbin gwaje-gwaje da jadawali akai-akai: lokacin da muka sanya sabon sabis ɗin aiki, muna ƙara gwaje-gwaje da jadawalai da yawa. Sa'a.

Yatsa akan Pulse - Advanced Technical Analytics

Sha'awar karɓar bayani game da matsaloli "da sauri-wuri" ya jagoranci mu zuwa gwaje-gwaje masu aiki tare da kayan aiki masu sauƙi da fahimta - pinba da xhprof.

Pinba ya aiko mana da kididdiga a cikin fakitin UDP game da saurin aiki na sassan shafukan yanar gizo a cikin PHP, kuma muna iya gani akan layi a cikin ajiyar MySQL (Pinba ya zo tare da injin MySQL don nazarin abubuwan da suka faru da sauri) gajeriyar jerin matsaloli da amsawa. su. Kuma xhprof ta ba mu damar tattara jadawali na aiwatar da shafukan PHP masu hankali daga abokan ciniki kuma mu bincika abin da zai iya haifar da hakan - cikin nutsuwa, zuba shayi ko wani abu mai ƙarfi.

Wani lokaci da suka gabata, an cika kayan aikin da wani injin mai sauƙi kuma mai sauƙin fahimta dangane da juzu'in juzu'i, wanda aka aiwatar da shi daidai a cikin babban ɗakin karatu na Lucene - Elastic/Kibana. Sauƙaƙan ra'ayin rikodi da yawa na takardu a cikin juzu'i na Lucene dangane da abubuwan da suka faru a cikin rajistan ayyukan da bincike mai sauri ta hanyar su ta amfani da rarraba facet ya zama da amfani sosai.

Duk da bayyanar fasaha na abubuwan gani a Kibana tare da ƙananan ra'ayoyi kamar "guga" "mai gudana sama" da kuma sake ƙirƙira harshen algebra da ba a manta da shi gaba ɗaya ba, kayan aikin ya fara taimaka mana da kyau a cikin ayyuka masu zuwa:

  • Kurakurai nawa na PHP abokin ciniki na Bitrix24 yayi akan tashar p1 a cikin sa'a ta ƙarshe kuma waɗanne? Fahimta, gafarta kuma da sauri gyara.
  • Kiran bidiyo nawa ne aka yi akan tashoshi a Jamus a cikin sa'o'i 24 da suka gabata, tare da wane inganci kuma akwai matsaloli tare da tashar/cibiyar sadarwa?
  • Yaya kyawun aikin tsarin (tsarin mu na PHP), wanda aka haɗa daga tushe a cikin sabuwar sabuntawar sabis kuma an fitar da shi ga abokan ciniki, yana aiki? Akwai segfaults?
  • Shin bayanan abokin ciniki sun dace da ƙwaƙwalwar PHP? Shin akwai wasu kurakurai game da ƙetare ƙwaƙwalwar da aka keɓe don aiwatarwa: "ba a cikin ƙwaƙwalwar ajiya"? Nemo kuma a daidaita shi.

Ga wani kwakkwaran misali. Duk da cikakkun gwaje-gwaje masu yawa da yawa, abokin ciniki, tare da shari'ar da ba ta dace ba da kuma lalata bayanan shigarwa, ya sami kuskure mai ban haushi da ba zato ba tsammani, sirin ya yi sauti kuma tsarin gyara shi da sauri ya fara:

Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

Bugu da ƙari, kibana yana ba ku damar tsara sanarwa don ƙayyadaddun abubuwan da suka faru, kuma a cikin ɗan gajeren lokaci kayan aiki a cikin kamfanin ya fara amfani da ma'aikata da dama daga sassa daban-daban - daga goyon bayan fasaha da ci gaba zuwa QA.

Ayyukan kowane sashe a cikin kamfanin ya dace don waƙa da aunawa - maimakon yin nazarin rajistan ayyukan akan sabar da hannu, kawai kuna buƙatar saita rajistan ayyukan sau ɗaya kuma aika su zuwa gungu na roba don jin daɗi, alal misali, yin tunani a cikin kibana. dashboard adadin kittens masu kai biyu da aka sayar da aka buga akan firinta 3-D na watan da ya gabata.

Binciken Kasuwanci na asali

Kowa ya san cewa nazarin kasuwanci a cikin kamfanoni yakan fara da amfani sosai, a, Excel. Amma babban abu shi ne cewa ba ya ƙare a can. Google Analytics na tushen girgije kuma yana ƙara mai ga wuta - da sauri ka fara saba da kyawawan abubuwa.

A cikin kamfani mai haɓakawa cikin jituwa, nan da can "annabawa" na aiki mai zurfi tare da manyan bayanai sun fara bayyana. Bukatar ƙarin rahotanni masu zurfi da yawa sun fara bayyana a kai a kai, kuma ta hanyar ƙoƙarin samari daga sassa daban-daban, wani lokaci da suka wuce an tsara wani bayani mai sauƙi da mai amfani - haɗin ClickHouse da PowerBI.

Na dogon lokaci, wannan sassaucin bayani ya taimaka da yawa, amma a hankali fahimtar ta fara zuwa cewa ClickHouse ba roba ba ne kuma ba za a iya yin izgili kamar haka ba.

Anan yana da mahimmanci a fahimta da kyau cewa ClickHouse, kamar Druid, kamar Vertica, kamar Amazon RedShift (wanda ya dogara da postgres), injunan bincike ne waɗanda aka inganta don ingantaccen nazari mai dacewa ( jimla, tarawa, mafi ƙarancin ta shafi da wasu yuwuwar haɗawa. ), saboda shirya don ingantaccen adana ginshiƙan tebur na alaƙa, sabanin MySQL da sauran bayanan (mai-daidaita-jeri) da aka sani da mu.

A zahiri, ClickHouse shine kawai mafi ƙarfin “database”, ba tare da shigar da madaidaicin maki-da-aya ba (haka ake nufi, komai yana da kyau), amma nazari mai daɗi da saitin ayyuka masu ƙarfi masu ban sha'awa don aiki tare da bayanai. Haka ne, har ma za ku iya ƙirƙirar tari - amma kun fahimci cewa ƙusa kusoshi tare da na'urar microscope ba daidai ba ne kuma mun fara neman wasu mafita.

Bukatar python da manazarta

Kamfaninmu yana da masu haɓakawa da yawa waɗanda ke rubuta lamba kusan kowace rana don shekaru 10-20 a cikin PHP, JavaScript, C #, C/C++, Java, Go, Rust, Python, Bash. Har ila yau, akwai wasu gogaggun masu gudanar da tsarin da suka fuskanci bala'i fiye da ɗaya wanda bai dace da dokokin ƙididdiga ba (alal misali, lokacin da yawancin faifai a cikin harin-10 suka lalace ta hanyar walƙiya mai ƙarfi). A cikin irin wannan yanayi, na dogon lokaci ba a bayyana abin da "mai nazarin Python" yake ba. Python yana kama da PHP, sunan kawai ya ɗan daɗe kuma akwai ƙarancin abubuwan da ke canza tunani a cikin lambar tushe mai fassarar. Koyaya, yayin da aka ƙirƙiri rahotanni na nazari da yawa, ƙwararrun masu haɓakawa sun fara fahimtar mahimmancin ƙwararrun ƙwararrun kayan aikin kamar numpy, pandas, matplotlib, ɗan teku.
Matsayi mai mahimmanci, mai yiwuwa, ya taka rawar gani kwatsam na ma'aikata daga haɗuwa da kalmomin "regression logistic" da kuma nuna ingantaccen rahoto game da manyan bayanai ta amfani da, i, a, pyspark.

Apache Spark, yanayin aikin sa wanda algebra mai alaƙa ya dace daidai, kuma ƙarfin sa ya sanya irin wannan ra'ayi akan masu haɓakawa waɗanda suka saba da MySQL cewa buƙatar ƙarfafa matsayi tare da ƙwararrun manazarta ya bayyana a rana.

Ƙarin ƙoƙarin Apache Spark/Hadoop don kashewa da abin da bai yi daidai ba bisa ga rubutun

Koyaya, ba da daɗewa ba ya bayyana cewa wani abu ba daidai ba ne a tsarin tsari tare da Spark, ko kuma kawai ya zama dole a wanke hannuwanku da kyau. Idan ƙwararrun masu tsara shirye-shirye ne suka yi tarin Hadoop/MapReduce/Lucene, wanda a bayyane yake idan kun kalli lambar tushe a cikin Java ko Doug Cutting's ra'ayoyin a Lucene, to, Spark, ba zato ba tsammani, an rubuta shi cikin yaren Scala, wanda shine mai matukar jayayya daga ra'ayi na amfani kuma a halin yanzu ba ya tasowa. Kuma raguwar ƙididdiga na yau da kullun akan gungu na Spark saboda rashin ma'ana kuma ba aiki mai fa'ida sosai tare da rarraba ƙwaƙwalwar ajiya don rage ayyukan (maɓallai da yawa suna zuwa lokaci ɗaya) ya haifar da halo a kusa da shi na wani abu da ke da sarari don girma. Bugu da ƙari, halin da ake ciki ya kara tsanantawa da ɗimbin manyan tashoshin jiragen ruwa masu ban sha'awa, fayilolin wucin gadi da ke girma a cikin wuraren da ba a iya fahimta da kuma jahannama na abin dogara - wanda ya sa masu gudanar da tsarin su sami ji guda ɗaya da aka sani tun suna yara: ƙiyayya mai tsanani (ko watakila). suna bukatar wanke hannayensu da sabulu).

Sakamakon haka, mun “tsira” ayyukan bincike na cikin gida da yawa waɗanda ke amfani da Apache Spark (ciki har da Spark Streaming, Spark SQL) da yanayin Hadoop (da sauransu da sauransu). Duk da cewa a tsawon lokaci mun koyi shirya da saka idanu "shi" da kyau, kuma "shi" a zahiri ya daina faɗuwa ba zato ba tsammani saboda canje-canje a cikin yanayin bayanan da rashin daidaituwa na hashing RDD, sha'awar ɗaukar wani abu riga an shirya. , sabuntawa da gudanarwa a wani wuri a cikin gajimare ya yi ƙarfi da ƙarfi. A wannan lokacin ne muka yi ƙoƙarin yin amfani da shirye-shiryen girgije na Sabis na Yanar Gizo na Amazon - EMR kuma, daga baya, yayi ƙoƙarin magance matsalolin ta amfani da shi. EMR shine Apache Spark wanda Amazon ya shirya tare da ƙarin software daga yanayin muhalli, kamar Cloudera/Hortonworks yana ginawa.

Adana fayil ɗin roba don nazari buƙatu ne na gaggawa

Kwarewar "dafa abinci" Hadoop/Spark tare da kuna zuwa sassa daban-daban na jiki ba a banza ba. Bukatar ƙirƙirar ajiyar fayil guda ɗaya, mara tsada kuma abin dogaro wanda zai zama mai jurewa ga gazawar hardware kuma a cikinsa zai yiwu a adana fayiloli a cikin nau'ikan tsari daban-daban daga tsarin daban-daban da yin samfuran inganci da ingantaccen lokaci don rahotanni daga wannan bayanan sun ƙara ƙaruwa. bayyananne.

Har ila yau, ina son sabunta software na wannan dandali bai zama mafarki mai ban tsoro na Sabuwar Shekara ba tare da karanta alamun Java mai shafuka 20 da nazarin cikakkun bayanai na gungu na tsawon kilomita ta hanyar amfani da Spark History Server da kuma gilashin ƙara haske. Ina so in sami kayan aiki mai sauƙi kuma bayyananne wanda baya buƙatar nutsewa akai-akai a ƙarƙashin hular idan madaidaicin MapReduce mai haɓaka buƙatun ya daina aiwatarwa lokacin da ma'aikacin rage bayanan ya faɗi daga ƙwaƙwalwar ajiya saboda zaɓin da ba a zaɓa sosai ba.

Shin Amazon S3 dan takarar DataLake ne?

Kwarewa tare da Hadoop/MapReduce ya koya mana cewa muna buƙatar tsarin fayil mai ƙima, abin dogaro da ma'aikata masu ƙima a samansa, "suna zuwa" kusa da bayanan don kada mu fitar da bayanan akan hanyar sadarwa. Ya kamata ma'aikata su iya karanta bayanai ta nau'i-nau'i daban-daban, amma zai fi dacewa kada su karanta bayanan da ba dole ba kuma su iya adana bayanai a gaba a cikin tsari masu dacewa ga ma'aikata.

Har yanzu, ainihin ra'ayin. Babu sha'awar "zuba" manyan bayanai a cikin injin bincike guda ɗaya, wanda zai shaƙe ko ba dade ko ba dade kuma za ku share shi mummuna. Ina so in adana fayiloli, fayiloli kawai, a cikin tsari mai fahimta kuma in yi ingantattun tambayoyin nazari akan su ta amfani da kayan aikin daban-daban amma masu iya fahimta. Kuma za a sami ƙarin fayiloli a cikin tsari daban-daban. Kuma yana da kyau a shard ba injin ba, amma bayanan tushen. Muna buƙatar DataLake mai fa'ida kuma na duniya, mun yanke shawarar ...

Me zai faru idan kun adana fayiloli a cikin sanannun kuma sanannun ma'ajin girgije Amazon S3, ba tare da shirya naku chops daga Hadoop ba?

A bayyane yake cewa bayanan sirri "ƙananan", amma menene game da sauran bayanan idan muka fitar da shi a can kuma "kore shi da kyau"?

Cluster-bigdata-analytics muhalli na Sabis na Yanar Gizo na Amazon - a cikin kalmomi masu sauƙi

Yin la'akari da kwarewarmu tare da AWS, Apache Hadoop/MapReduce an yi amfani da shi sosai a can na dogon lokaci a ƙarƙashin miya daban-daban, misali a cikin sabis na DataPipeline (Ina kishin abokan aiki na, sun koyi yadda ake shirya shi daidai). Anan mun kafa madogara daga ayyuka daban-daban daga tebur DynamoDB:
Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

Kuma suna gudana akai-akai akan Hadoop/Map Rage gungu kamar aikin agogo shekaru da yawa yanzu. "Ka saita shi ka manta da shi":

Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

Hakanan zaka iya shiga cikin satan bayanan yadda ya kamata ta hanyar kafa kwamfyutocin Jupiter a cikin gajimare don manazarta da amfani da sabis na AWS SageMaker don horarwa da tura samfuran AI cikin yaƙi. Ga yadda abin yake a gare mu:

Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

Ee, zaku iya ɗaukar kwamfutar tafi-da-gidanka don kanku ko manazarci a cikin gajimare ku haɗa shi zuwa ga tarin Hadoop/Spark, yi lissafin sannan ku ƙusa komai a ƙasa:

Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

Da gaske dacewa ga ayyukan tantance mutum ɗaya kuma ga wasu mun sami nasarar amfani da sabis na EMR don ƙididdige ƙididdiga da ƙididdiga masu girma. Me game da tsarin tsarin DataLake, zai yi aiki? A dai-dai wannan lokaci muna kan bakin fata da yanke kauna muka ci gaba da bincike.

AWS Glue - Apache Spark da aka shirya da kyau akan ƙwayoyin cuta

Ya juya cewa AWS yana da nau'in nasa na "Hive/Pig/Spark" tari. Matsayin Hive, watau. Katalogin fayiloli da nau'ikan su a cikin DataLake ana yin su ta sabis na "Katalojin Bayanai", wanda baya ɓoye dacewarsa da tsarin Apache Hive. Kuna buƙatar ƙara bayani zuwa wannan sabis ɗin game da inda fayilolinku suke da kuma wane tsari suke. Bayanan na iya zama ba kawai a cikin s3 ba, har ma a cikin bayanan bayanai, amma wannan ba shine batun wannan sakon ba. Ga yadda aka tsara kundin bayanan mu na DataLake:

Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

Fayilolin suna da rijista, babba. Idan an sabunta fayilolin, muna ƙaddamar da masu rarrafe ko dai da hannu ko a kan jadawalin, wanda zai sabunta bayanai game da su daga tafkin kuma ya cece su. Sannan ana iya sarrafa bayanan tafkin kuma a loda sakamakon a wani wuri. A cikin mafi sauƙi, muna kuma loda zuwa s3. Ana iya yin sarrafa bayanai a ko'ina, amma ana ba da shawarar cewa ka saita aiki akan gungu na Apache Spark ta amfani da iyawar ci gaba ta hanyar AWS Glue API. A zahiri, zaku iya ɗaukar tsohuwar tsohuwar lambar Python ɗin da kuka saba amfani da ɗakin karatu na pyspark kuma saita aiwatar da aiwatar da shi akan N nodes na gungu na wasu iya aiki tare da saka idanu, ba tare da tono cikin guts na Hadoop da jan kwantena docker-moker da kawar da rikice-rikice masu dogaro ba. .

Har yanzu, ra'ayi mai sauƙi. Babu buƙatar saita Apache Spark, kawai kuna buƙatar rubuta lambar Python don pyspark, gwada shi a gida akan tebur ɗin ku sannan ku gudanar da shi a kan babban gungu a cikin gajimare, ƙididdige inda bayanan tushen yake da kuma inda za ku saka sakamakon. Wani lokaci wannan yana da mahimmanci kuma yana da amfani, kuma wannan shine yadda aka tsara shi a nan:

Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

Don haka, idan kuna buƙatar lissafin wani abu akan gungu na Spark ta amfani da bayanai a cikin s3, muna rubuta lamba a cikin Python/pyspark, gwada shi, da sa'a ga girgije.

Game da ƙungiyar makaɗa fa? Idan aikin ya fadi kuma ya ɓace fa? Ee, an ba da shawarar yin bututu mai kyau a cikin salon Apache Pig kuma har ma mun gwada su, amma a yanzu mun yanke shawarar yin amfani da ƙaƙƙarfan ƙaƙƙarfan ƙaƙƙarfan ƙaƙƙarfan ƙaƙƙarfan kaɗe-kaɗe a cikin PHP da JavaScript (Na fahimta, akwai rashin fahimta, amma yana aiki, don shekaru kuma ba tare da kurakurai ba).

Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

Tsarin fayilolin da aka adana a cikin tafkin shine mabuɗin aiki

Yana da matuƙar mahimmanci don fahimtar ƙarin mahimman bayanai guda biyu. Domin a aiwatar da tambayoyin kan bayanan fayil a cikin tafkin da sauri da kuma aiki don kada ya lalata lokacin da aka ƙara sabon bayani, kuna buƙatar:

  • Ajiye ginshiƙan fayiloli daban (don kada ku karanta duk layukan don fahimtar abin da ke cikin ginshiƙan). Don wannan mun ɗauki tsarin parquet tare da matsawa
  • Yana da matukar mahimmanci a share fayiloli zuwa manyan fayiloli kamar: harshe, shekara, wata, rana, mako. Injin da suka fahimci irin wannan nau'in shading za su kalli manyan fayilolin da ake buƙata kawai, ba tare da zazzage duk bayanan da ke jere ba.

Mahimmanci, ta wannan hanyar, kuna fitar da bayanan tushen a cikin mafi kyawun tsari don injunan binciken da aka rataye a saman, waɗanda ko da a cikin manyan fayilolin sharared na iya zaɓin shigar da karanta ginshiƙan da suka dace daga fayiloli kawai. Ba kwa buƙatar "cika" bayanan a ko'ina (ma'ajiyar kawai za ta fashe) - nan da nan da hikima sanya shi a cikin tsarin fayil a daidai tsari. Tabbas, yakamata a bayyana a sarari cewa adana babban fayil ɗin csv a cikin DataLake, wanda dole ne a fara karanta layi ta layi ta gungu don cire ginshiƙan, ba abu ne mai kyau ba. Ka sake tunani game da abubuwan biyu na sama idan har yanzu ba a bayyana dalilin da yasa duk wannan ke faruwa ba.

AWS Athena - jack-in-the-box

Kuma a sa'an nan, yayin ƙirƙirar wani tabki, mun ko ta yaya bazata zo a kan Amazon Athena. Nan da nan sai ya zama cewa ta hanyar tsara manyan fayilolin log ɗin mu a hankali cikin ɓangarorin babban fayil a daidai tsarin ginshiƙi (parquet), za ku iya yin zaɓe masu mahimmanci da sauri daga gare su kuma ku gina rahotanni BA TARE BA, ba tare da gungu na Apache Spark/Glue ba.

Injin Athena da ke amfani da bayanai a cikin s3 ya dogara ne akan almara Presto - wakilin MPP (m daidaici aiki) iyali na hanyoyin da za a sarrafa bayanai, daukar bayanai inda ya ta'allaka, daga s3 da Hadoop zuwa Cassandra da talakawa rubutu fayiloli. Kawai kuna buƙatar tambayar Athena don aiwatar da tambayar SQL, sannan komai "yana aiki da sauri kuma ta atomatik." Yana da mahimmanci a lura cewa Athena "mai hankali ne", yana tafiya ne kawai zuwa manyan fayilolin sharar da ake buƙata kuma yana karanta kawai ginshiƙan da ake buƙata a cikin buƙatun.

Hakanan farashin buƙatun ga Athena yana da ban sha'awa. Muna biya ƙarar bayanan da aka bincika. Wadancan. ba don adadin na'urori a cikin gungu a cikin minti daya ba, amma ... don ainihin bayanan da aka bincika akan na'urori 100-500, kawai bayanan da ake bukata don kammala buƙatar.

Kuma ta hanyar neman kawai ginshiƙan da suka dace daga manyan fayilolin da aka share daidai, ya nuna cewa sabis na Athena yana biyan mu dubun-dubatar daloli a wata. To, mai girma, kusan kyauta, idan aka kwatanta da nazari akan gungu!

Af, ga yadda muke raba bayanan mu a cikin s3:

Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

A sakamakon haka, a cikin wani gajeren lokaci, gaba daya daban-daban sassa a cikin kamfanin, daga bayanai tsaro zuwa nazari, ya fara rayayye yin buƙatun zuwa Athena da sauri, a cikin dakika, sami amsoshi masu amfani daga "babban" bayanai a kan fairly dogon lokaci: watanni. rabin shekara, da dai sauransu P.

Amma mun ci gaba, muka fara zuwa ga gajimare don samun amsoshi ta hanyar direban ODBC: wani manazarci ya rubuta tambayar SQL a cikin na'ura mai ba da hanya tsakanin hanyoyin sadarwa, wanda akan na'urori 100-500 "na pennies" yana aika bayanai zuwa s3 kuma ya mayar da amsa yawanci a cikin 'yan dakiku. Dadi. Da sauri. Har yanzu na kasa yarda da hakan.

Sakamakon haka, bayan yanke shawarar adana bayanai a cikin s3, a cikin ingantaccen tsari na columnar kuma tare da madaidaicin rarraba bayanai cikin manyan fayiloli… mun sami DataLake da injin bincike mai sauri da arha - kyauta. Kuma ya shahara a kamfanin, saboda... ya fahimci SQL kuma yana aiki da oda da sauri fiye da ta hanyar farawa / tsayawa / saita tari. "Kuma idan sakamakon ya kasance ɗaya, me yasa za ku biya ƙarin?"

Bukatar Athena yayi kama da wannan. Idan ana so, ba shakka, za ku iya samar da isasshen tambayar SQL mai sarƙaƙƙiya da shafuka masu yawa, amma za mu iyakance kanmu ga ƙungiyoyi masu sauƙi. Bari mu ga menene lambobin amsawa abokin ciniki ya samu makonni kadan da suka gabata a cikin rajistan ayyukan sabar yanar gizo kuma mu tabbatar babu kurakurai:

Yadda muka tsara DataLake mai inganci da tsada kuma me yasa hakan yake

binciken

Bayan mun wuce, ba a faɗi doguwar hanya ba, amma hanya mai raɗaɗi, koyaushe daidai gwargwadon haɗarin haɗari da matakin rikitarwa da tsadar tallafi, mun sami mafita don DataLake da nazari waɗanda ba su daina faranta mana rai da sauri da tsadar mallakar mallaka.

Ya bayyana cewa gina ingantaccen, sauri da arha don sarrafa DataLake don bukatun sassan daban-daban na kamfanin gaba ɗaya yana cikin iyawar hatta ƙwararrun ƙwararrun masu haɓakawa waɗanda ba su taɓa yin aikin gine-gine ba kuma ba su san yadda ake zana murabba'i a kan murabba'ai tare da su ba. kibiyoyi kuma sun san sharuɗɗa 50 daga yanayin yanayin Hadoop.

A farkon tafiya, kaina ya rabu da yawancin namun daji na budewa da rufaffiyar software da fahimtar nauyin alhakin zuriya. Kawai fara gina DataLake ɗinku daga kayan aiki masu sauƙi: nagios/munin -> na roba/kibana -> Hadoop/Spark/s3..., tattara bayanai da zurfin fahimtar ilimin kimiyyar lissafi na hanyoyin da ke faruwa. Duk abin da ke da rikitarwa da damuwa - ba shi ga abokan gaba da masu fafatawa.

Idan baku son zuwa ga gajimare kuma kuna son tallafawa, sabuntawa da facin ayyukan buɗe ido, zaku iya gina tsari mai kama da namu a cikin gida, akan injin ofis masu tsada tare da Hadoop da Presto a saman. Babban abu ba shine tsayawa da ci gaba ba, ƙidaya, neman mafita mai sauƙi da bayyananne, kuma komai zai yi aiki tabbas! Sa'a ga kowa kuma a sake ganin ku!

source: www.habr.com

Add a comment