Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Siphila esikhathini esimangalisayo lapho ungakwazi ukuxhuma ngokushesha futhi kalula amathuluzi omthombo ovulekile amaningana enziwe ngomumo, uwamise “ukwazi kwakho kuvaliwe” ngokweseluleko sokugeleza kwe-stackover, ngaphandle kokuhlolisisa “izinhlamvu eziningi”, bese wethula. bangene emsebenzini wokuhweba. Futhi uma udinga ukuvuselela/ukunweba noma othile ngephutha aqalise kabusha imishini embalwa - uyabona ukuthi uhlobo oluthile lwephupho elibi elibi kakhulu seliqalile, yonke into isibe yinkimbinkimbi ngendlela engenakuqashelwa, akukho ukubuyela emuva, ikusasa alicacile futhi liphephile, esikhundleni sokuhlela, zalanisa izinyosi wenze noshizi.

Akuyona ize ukuthi ozakwethu abanolwazi olunzulu, amakhanda abo egcwele izimbungulu ngakho-ke avele empunga, becabangela ukuthunyelwa okusheshayo kwamaphakethe "eziqukathi" "kuma-cubes" kumaseva amaningi "ngezilimi zemfashini" ezinokwesekwa okwakhelwe ngaphakathi. i-asynchronous non-blocking I/O, mamatheka ngesizotha . Futhi buthule bayaqhubeka nokufunda kabusha elithi “man ps”, bangena ekhodini yomthombo ethi “nginx” kuze kube yilapho amehlo abo opha, futhi babhale, babhale, babhale izivivinyo zamayunithi. Ozakwethu bayazi ukuthi into ethakazelisa kakhulu iyofika lapho "konke lokhu" ngolunye usuku kuba sengozini ebusuku ngoNcibijane. Futhi bazosizwa kuphela ukuqonda okujulile kwemvelo ye-unix, ithebula lesifunda le-TCP/IP elibanjwe ngekhanda kanye nama-algorithms okusesha okuhlunga. Ukuze ubuyisele isistimu ekuphileni njengoba ama-chime eshaya.

Oh yebo, ngike ngaphazamiseka kancane, kodwa ngithemba ukuthi ngikwazile ukudlulisa isimo sokulindela.
Namuhla ngifuna ukwabelana ngolwazi lwethu lokuthumela isitaki esikahle nesingabizi se-DataLake, esixazulula iningi lemisebenzi yokuhlaziya enkampanini ngezigaba ezihluke ngokuphelele zesakhiwo.

Esikhathini esidlule, saqonda ukuthi izinkampani zidinga kakhulu izithelo zakho kokubili komkhiqizo kanye nokuhlaziya kobuchwepheshe (ingasaphathwa eyokuba uqweqwe lwekhekhe ekhekheni ngendlela yokufunda ngomshini) nokuqonda izitayela nezingozi - kudingeka siqoqe futhi sihlaziye. amamethrikhi engeziwe.

Izibalo zobuchwepheshe eziyisisekelo ku-Bitrix24

Eminyakeni embalwa edlule, kanyekanye nokwethulwa kwesevisi ye-Bitrix24, sitshale isikhathi nezinsiza ngokuzimisela ekudaleni inkundla yokuhlaziya elula nethembekile engasiza ukubona ngokushesha izinkinga kwingqalasizinda futhi sihlele isinyathelo esilandelayo. Yiqiniso, kwakutuseka ukuthatha amathuluzi esenziwe ngomumo alula futhi aqondakala ngangokunokwenzeka. Ngenxa yalokho, i-nagios ikhethelwe ukuqapha kanye ne-munin yokuhlaziya nokubonwa. Manje sinezinkulungwane zamasheke ku-nagios, amakhulu amashadi ku-munin, futhi ozakwethu bawasebenzisa ngempumelelo nsuku zonke. Amamethrikhi acacile, amagrafu acacile, isistimu ibisebenza ngokuthembekile iminyaka eminingana futhi izivivinyo ezintsha namagrafu kwengezwa njalo kuyo: lapho sifaka isevisi entsha ekusebenzeni, sengeza izivivinyo namagrafu amaningana. Ngikufisela inhlanhla.

Umunwe ku-Pulse - Izibalo Zobuchwepheshe Ezithuthukisiwe

Isifiso sokuthola ulwazi ngezinkinga “ngokushesha ngangokunokwenzeka” siholele ekuhloleni okusebenzayo ngamathuluzi alula naqondakalayo - i-pinba ne-xhprof.

I-Pinba isithumele izibalo emaphaketheni e-UDP mayelana nesivinini sokusebenza kwezingxenye zamakhasi ewebhu ku-PHP, futhi sasibona ku-inthanethi kusitoreji se-MySQL (i-Pinba iza nenjini yayo ye-MySQL yokuhlaziya umcimbi osheshayo) uhlu olufushane lwezinkinga futhi siphendule bona. Futhi i-xhprof isivumele ngokuzenzakalelayo ukuthi siqoqe amagrafu okusetshenziswa kwamakhasi ahamba kancane we-PHP kumakhasimende futhi sihlaziye ukuthi yini engaholela kulokhu - ngomoya ophansi, ukuthela itiye noma okuthile okunamandla.

Esikhathini esedlule, ikhithi yamathuluzi yaphinde yagcwaliswa ngenye injini elula neqondakalayo ngokusekelwe ku-algorithm yokuhlehla yenkomba, esetshenziswe kahle kumtapo wolwazi we-Lucene - Elastic/Kibana. Umbono olula wokuqoshwa kwemibhalo enemicu eminingi ibe yinkomba ye-Lucene ephambene ngokusekelwe ezenzakalweni ezikulogi kanye nokusesha okusheshayo kuzo kusetshenziswa ukuhlukaniswa kwama-facet kube usizo ngempela.

Ngaphandle kokubukeka kobuchwepheshe kokubonwayo e-Kibana okunemiqondo yezinga eliphansi njengokuthi “ibhakede” “eligeleza libheke phezulu” kanye nolimi olusungulwe kabusha lwe-algebra yobudlelwane engakhohlwa ngokuphelele, ithuluzi laqala ukusisiza kahle emisebenzini elandelayo:

  • Mangaki amaphutha e-PHP iklayenti le-Bitrix24 ebe nawo kuphothali ye-p1 ehoreni eledlule futhi yimaphi? Qonda, thethelela futhi ulungise ngokushesha.
  • Mangaki amakholi evidiyo enziwe kumaphothali e-Germany emahoreni angu-24 adlule, ngayiphi ikhwalithi futhi ingabe kube khona ubunzima ngesiteshi/inethiwekhi?
  • Kusebenza kahle kangakanani ukusebenza kwesistimu (isandiso sethu se-C se-PHP), esihlanganiswe kusukela emthonjeni kusibuyekezo sakamuva sesevisi futhi sakhishelwa kumakhasimende, sisebenza? Ingabe akhona ama-segfault?
  • Ingabe idatha yekhasimende ingena kumemori ye-PHP? Ingabe akhona amaphutha mayelana nokweqa inkumbulo eyabelwe izinqubo: "ngaphandle kwenkumbulo"? Thola futhi unciphise.

Nasi isibonelo esiphathekayo. Naphezu kokuhlolwa okuphelele namazinga amaningi, iklayenti, elinecala elingajwayelekile kakhulu kanye nedatha yokokufaka eyonakele, ithole iphutha elicasulayo nelingalindelekile, kukhala inhlabamkhosi futhi kwaqala inqubo yokuyilungisa ngokushesha:

Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Ukwengeza, i-kibana ikuvumela ukuthi uhlele izaziso zemicimbi ethile, futhi ngesikhathi esifushane ithuluzi enkampanini laqala ukusetshenziswa inqwaba yabasebenzi abavela eminyangweni ehlukene - kusukela ekusekelweni kwezobuchwepheshe kanye nentuthuko kuya ku-QA.

Umsebenzi wanoma yimuphi umnyango ngaphakathi kwenkampani usulungele ukulandelela nokulinganisa - esikhundleni sokuhlaziya ngesandla izingodo kumaseva, udinga nje ukusetha izingodo zokuhlukanisa kanye futhi uzithumele kuqoqo elinwebekayo ukuze ujabulele, isibonelo, ucabange ku-kibana. ideshibhodi inombolo yamakati anamakhanda amabili athengisiwe aphrintwe kuphrinta ye-3-D enyangeni yokugcina yenyanga.

Izibalo Zebhizinisi Eziyisisekelo

Wonke umuntu uyazi ukuthi izibalo zebhizinisi ezinkampanini zivame ukuqala ngokusetshenziswa okusebenzayo kwe-Excel. Kodwa okuyinhloko ukuthi akugcini lapho. I-Google Analytics esekwe emafini nayo yengeza uphethiloli emlilweni - uqala ukujwayela izinto ezinhle ngokushesha.

Enkampanini yethu esakha ngokuvumelanayo, lapha nalaphaya “abaprofethi” bomsebenzi onzima onemininingwane emikhulu baqala ukuvela. Isidingo semibiko eningilizayo neningilizayo saqala ukuvela njalo, futhi ngemizamo yabafana abavela eminyangweni ehlukene, esikhathini esithile esidlule isisombululo esilula nesisebenzayo sahlelwa - inhlanganisela ye-ClickHouse ne-PowerBI.

Isikhathi eside, lesi sixazululo esivumelana nezimo sasiza kakhulu, kodwa kancane kancane kwaqala ukufika ukuqonda ukuthi i-ClickHouse akuyona irabha futhi ayikwazi ukuhlekwa kanjalo.

Lapha kubalulekile ukuqonda kahle ukuthi i-ClickHouse, njenge-Druid, njenge-Vertica, njenge-Amazon RedShift (esekelwe kuma-postgres), izinjini zokuhlaziya ezilungiselelwe ukuhlaziya okulungile (izibalo, ukuhlanganisa, ubuncane-ubukhulu ngekholomu kanye nokuhlanganisa okumbalwa okungenzeka. ), ngoba ihlelwe ukuze kugcinwe kahle amakholomu amathebula ahlobene, ngokungafani ne-MySQL nezinye (ezigxile emigqeni) yolwazi esizaziyo.

Empeleni, i-ClickHouse imane nje “isizindalwazi” esinamandla, esingenasici esikahle kakhulu sephuzu nephuzu (yileyo ndlela okuhloswe ngayo, konke kulungile), kodwa ukuhlaziya okujabulisayo kanye nesethi yemisebenzi enamandla ethakazelisayo yokusebenza ngedatha. Yebo, ungakha ngisho neqoqo - kodwa uyaqonda ukuthi ukubethela izipikili ngesibonakhulu akulungile ngokuphelele futhi saqala ukufuna ezinye izixazululo.

Isidingo se-python nabahlaziyi

Inkampani yethu inabathuthukisi abaningi ababhala amakhodi cishe nsuku zonke iminyaka eyi-10-20 ku-PHP, JavaScript, C#, C/C++, Java, Go, Rust, Python, Bash. Kukhona futhi nabaphathi abaningi abanolwazi lwesistimu abaye bahlangabezana nenhlekelele engaphezu kweyodwa emangalisayo engangeni emithethweni yezibalo (isibonelo, lapho iningi lamadiski ku-raid-10 libhujiswa umbani onamandla). Ezimweni ezinjalo, isikhathi eside kwakungacaci ukuthi yini "umhlaziyi we-python". I-Python ifana ne-PHP, igama kuphela lide kancane futhi kukhona iminonjana emincane yezinto ezishintsha ingqondo kukhodi yomthombo yomhumushi. Kodwa-ke, njengoba kwakhiwa imibiko yokuhlaziya eyengeziwe, onjiniyela abanolwazi baqala ukuqonda ngokwandayo ukubaluleka kobuchwepheshe obuncane kumathuluzi afana ne-numpy, i-pandas, i-matplotlib, i-seaborn.
Indima ewujuqu, cishe, yadlalwa ukuquleka okungazelelwe kwezisebenzi kusukela ekuhlanganisweni kwamagama athi “ukuhlehla kwezinto” kanye nokuboniswa kokubika okuphumelelayo kudatha enkulu kusetshenziswa, yebo, yebo, i-pyspark.

I-Apache Spark, i-paradigm yayo esebenzayo lapho i-algebra ehlobene ilingana kahle, namandla ayo enza umbono onjalo kubathuthukisi abajwayele i-MySQL kangangokuthi isidingo sokuqinisa izilinganiso nabahlaziyi abanolwazi sacaca njengosuku.

Eminye imizamo ye-Apache Spark/Hadoop yokusuka nokuthi yini engazange ihambe kahle ngokusho kombhalo

Kodwa-ke, kusheshe kwaba sobala ukuthi kukhona okungahambi kahle nge-Spark, noma kwakudingeka ukugeza izandla zakho kangcono. Uma isitaki se-Hadoop/MapReduce/Lucene senziwa abahleli bezinhlelo abanolwazi olufanele, okusobala uma ubhekisisa ikhodi yomthombo ku-Java noma imibono ka-Doug Cutting e-Lucene, khona-ke i-Spark, kungazelelwe, ibhalwe ngolimi olungavamile i-Scala, okungukuthi impikiswano kakhulu ngokombono wokusebenziseka futhi okwamanje ayithuthuki. Futhi ukwehla okujwayelekile kwezibalo kuqoqo le-Spark ngenxa yomsebenzi ongenangqondo futhi ongabonisi lutho onokwabiwa kwenkumbulo ukuze kuncishiswe imisebenzi (okhiye abaningi bafika ngesikhathi esisodwa) kudale i-halo ezungezile yokuthile enendawo yokukhula. Ukwengeza, isimo sabhebhethekiswa inani elikhulu lamachweba avulekile angavamile, amafayela esikhashana akhula ezindaweni ezingaqondakali kakhulu kanye nokuncika kwesihogo sembiza - okwabangela ukuthi abaphathi bezinhlelo babe nomuzwa owodwa owaziwa kusukela ebuntwaneni: inzondo enonya (noma mhlawumbe babedinga ukugeza izandla zabo ngensipho).

Ngenxa yalokho, "sisindile" amaphrojekthi wokuhlaziya amaningana angaphakathi asebenzisa ngokugcwele i-Apache Spark (kuhlanganise ne-Spark Streaming, Spark SQL) kanye ne-Hadoop ecosystem (kanye nokunye nokunye). Naphezu kweqiniso lokuthi ngokuhamba kwesikhathi safunda ukulungiselela nokuqapha "it" kahle, futhi "it" cishe yayeka ukuphahlazeka ngokuzumayo ngenxa yezinguquko zemvelo yedatha kanye nokungalingani kwe-RDD hashing efanayo, isifiso sokuthatha into kakade isilungile. , ebuyekeziwe futhi yaphathwa ndawana thize efwini yaqina futhi yaba namandla. Kungalesi sikhathi lapho sazama khona ukusebenzisa umhlangano wamafu owenziwe ngomumo we-Amazon Web Services - I-EMR futhi, kamuva, wazama ukuxazulula izinkinga ngokuyisebenzisa. I-EMR i-Apache Spark elungiselelwe i-Amazon ngesofthiwe eyengeziwe evela ku-ecosystem, efana ne-Cloudera/Hortonworks builds.

Isitoreji sefayela lerabha sokuhlaziya siyisidingo esiphuthumayo

Isipiliyoni "sokupheka" i-Hadoop/Spark ngokusha ezingxenyeni ezihlukahlukene zomzimba akubanga yize. Isidingo sokudala isitoreji sefayela esisodwa, esingabizi futhi esinokwethenjelwa esingamelana nokwehluleka kwezingxenyekazi zekhompuyutha futhi lapho bekungenzeka khona ukugcina amafayela ngamafomethi ahlukene avela kumasistimu ahlukene nokwenza amasampula asebenza kahle nasebenza isikhathi emibiko evela kule datha siye sakhula. cacile.

Bengifuna futhi ukuthi ukubuyekeza isofthiwe yale nkundla kungabi yiphupho elibi likaNcibijane ngokufunda iminonjana ye-Java enamakhasi angu-20 nokuhlaziya amalogi anemininingwane ebude bekhilomitha yeqoqo kusetshenziswa i-Spark History Server kanye nengilazi ekhanyayo engemuva. Bengifuna ukuba nethuluzi elilula nelisobala elalingadingi ukutshuza okuvamile ngaphansi kwe-hood uma isicelo esijwayelekile sikanjiniyela se-MapReduce siyeka ukusebenza lapho isisebenzi sedatha esincishisiwe siphelelwa yinkumbulo ngenxa ye-algorithm yokuhlukanisa idatha yomthombo engakhethwanga kahle.

Ingabe i-Amazon S3 iyikhandidethi le-DataLake?

Okuhlangenwe nakho nge-Hadoop/MapReduce kusifundise ukuthi sidinga isistimu yefayela enokwethenjelwa, enokwethenjelwa kanye nezisebenzi ezingalawuleki phezu kwayo, “ziza” eduze kwedatha ukuze singashayeli idatha kunethiwekhi. Izisebenzi kufanele zikwazi ukufunda idatha ngamafomethi ahlukene, kodwa okungcono zingafundi ulwazi olungadingekile futhi zikwazi ukugcina idatha kusengaphambili ngamafomethi alungele abasebenzi.

Nakulokhu, umqondo oyisisekelo. Asikho isifiso "sokuthela" idatha enkulu enjinini yokuhlaziya yeqoqo elilodwa, ezoklinywa ngokushesha futhi uzoyihlinza ibe yimbi. Ngifuna ukugcina amafayela, amafayela nje, ngefomethi eqondakalayo futhi ngenze imibuzo yokuhlaziya ephumelelayo kuwo ngisebenzisa amathuluzi ahlukene kodwa aqondakalayo. Futhi kuzoba namafayela amaningi kakhulu ngamafomethi ahlukene. Futhi kungcono ukuhlukanisa hhayi injini, kodwa idatha yomthombo. Sidinga i-DataLake enwebekayo neyomhlaba wonke, sinqume...

Kuthiwani uma ugcina amafayela endaweni yesitoreji samafu esijwayelekile nesaziwa kakhulu i-Amazon S3, ngaphandle kokuthi uzilungisele owakho ama-chops avela ku-Hadoop?

Kuyacaca ukuthi idatha yomuntu siqu "iphansi", kodwa kuthiwani ngenye idatha uma siyikhipha lapho futhi "siyishayela ngempumelelo"?

I-Cluster-bigdata-analytics ecosystem ye-Amazon Web Services - ngamagama alula kakhulu

Uma sibheka okuhlangenwe nakho kwethu nge-AWS, i-Apache Hadoop/MapReduce ibilokhu isetshenziswa ngenkuthalo lapho isikhathi eside ngaphansi kwamasoso ahlukahlukene, ngokwesibonelo kusevisi ye-DataPipeline (nginomona ozakwethu, bafunde ukuyilungiselela ngendlela efanele). Lapha sisetha ama-backups avela kumasevisi ahlukene kusuka kumatafula e-DynamoDB:
Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Futhi bebesebenza njalo kumaqoqo ashumekiwe e-Hadoop/MapReduce afana newashi iminyaka eminingana manje. "Isethe futhi uyikhohlwe":

Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Ungakwazi futhi ukuzibandakanya ngokuphumelelayo kuSathane wedatha ngokusetha amakhompyutha aphathekayo e-Jupiter efwini ukuze abahlaziyi futhi usebenzise isevisi ye-AWS SageMaker ukuze uqeqeshe futhi ukhiphe amamodeli e-AI empini. Nansi indlela ebukeka ngayo kithi:

Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Futhi yebo, ungazithathela ikhompuyutha ephathekayo noma umhlaziyi efwini futhi uyinamathisele kuqoqo le-Hadoop/Spark, wenze izibalo bese ubethelela yonke into phansi:

Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Ilungele ngempela amaphrojekthi wokuhlaziya angawodwana futhi kwabanye sisebenzise ngempumelelo insiza ye-EMR ngezibalo ezinkulu kanye nezibalo. Kuthiwani ngesisombululo sesistimu se-DataLake, sizosebenza? Ngalo mzuzu sasisenkingeni yethemba nokuphelelwa yithemba futhi saqhubeka nokuthungatha.

I-AWS Glue - ifakwe ngobunono i-Apache Spark kuma-steroid

Kuvele ukuthi i-AWS inenguqulo yayo yesitaki se-"Hive/Pig/Spark". Indima yeHive, i.e. Ikhathalogi yamafayela nezinhlobo zawo ku-DataLake yenziwa isevisi "yekhathalogi yedatha", engafihli ukuhambisana kwayo nefomethi ye-Apache Hive. Udinga ukungeza ulwazi kule sevisi mayelana nokuthi amafayela akho atholakala kuphi nokuthi akuyiphi ifomethi. Idatha ayikwazi ukuba ku-s3 kuphela, kodwa futhi ku-database, kodwa lokho akuyona isihloko salokhu okuthunyelwe. Nansi indlela umkhombandlela wethu wedatha we-DataLake ohlelwa ngayo:

Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Amafayela abhalisiwe, kuhle. Uma amafayela ebuyekeziwe, sethula iziseshi ngokwenza noma ngeshejuli, ezobuyekeza ulwazi olumayelana nawo echibini futhi iwalondoloze. Khona-ke idatha evela echibini ingacutshungulwa futhi imiphumela ilayishwe endaweni ethile. Esimeni esilula, siphinde silayishe ku-s3. Ukucubungula idatha kungenziwa noma yikuphi, kodwa kuphakanyiswa ukuthi ulungiselele ukucutshungulwa kuqoqo le-Apache Spark usebenzisa amakhono athuthukile nge-AWS Glue API. Eqinisweni, ungathatha ikhodi ye-python enhle endala nejwayelekile usebenzisa umtapo wezincwadi we-pyspark futhi ulungiselele ukusebenza kwayo kuma-N node weqoqo lomthamo othile ngokuqapha, ngaphandle kokumba emathunjini e-Hadoop nokudonsa iziqukathi ze-docker-moker nokuqeda izingxabano zokuncika. .

Nakulokhu, umqondo olula. Asikho isidingo sokumisa i-Apache Spark, udinga nje ukubhala ikhodi ye-python ye-pyspark, ihlole endaweni yakho kudeskithophu yakho bese uyiqhuba eqoqweni elikhulu efwini, ucacisa ukuthi idatha yomthombo ikuphi nokuthi ungabeka kuphi umphumela. Ngezinye izikhathi lokhu kuyadingeka futhi kuwusizo, futhi nansi indlela esikumisa ngayo:

Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Ngakho-ke, uma udinga ukubala okuthile kuqoqo le-Spark usebenzisa idatha ku-s3, sibhala ikhodi ku-python/pyspark, siyihlole, futhi sikufisela inhlanhla efwini.

Kuthiwani nge-orchestration? Kuthiwani uma umsebenzi uwe futhi wanyamalala? Yebo, kuhlongozwa ukwenza ipayipi elihle ngesitayela se-Apache Pig futhi saze sazama, kodwa okwamanje sinqume ukusebenzisa i-orchestration yethu eyenziwe ngokwezifiso ngokujulile ku-PHP ne-JavaScript (ngiyaqonda, kukhona i-cognitive dissonance, kodwa iyasebenza, ngoba iminyaka futhi ngaphandle kwamaphutha).

Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Ifomethi yamafayela agcinwe echibini iwukhiye wokusebenza

Kubaluleke kakhulu, kubaluleke kakhulu ukuqonda amaphuzu amabili abalulekile. Ukuze imibuzo kudatha yefayela echibini isetshenziswe ngokushesha ngangokunokwenzeka futhi ukusebenza kungathuki isithunzi lapho ulwazi olusha lwengezwa, udinga ukwenza lokhu:

  • Gcina amakholomu wamafayela ngokwehlukana (ukuze ungadingi ukufunda yonke imigqa ukuze uqonde ukuthi yini ekumakholomu). Kulokhu sithathe ifomethi ye-parquet ngokucindezela
  • Kubaluleke kakhulu ukuhlukanisa amafayela kumafolda afana nalawa: ulimi, unyaka, inyanga, usuku, isonto. Izinjini eziqonda lolu hlobo lokuhlukanisa zizobheka kuphela amafolda adingekayo, ngaphandle kokuhlunga yonke idatha ngokulandelana.

Ngokuyinhloko, ngale ndlela, ubeka idatha yomthombo ngendlela ephumelela kakhulu yezinjini zokuhlaziya ezilengiswe phezulu, okuthi ngisho nasemafoldani ashiyiwe angafaka ngokukhetha futhi afunde amakholomu adingekayo kuphela kumafayela. Awudingi "ukugcwalisa" idatha noma yikuphi (isitoreji sizovele siqhume) - vele ubeke ngokuhlakanipha ohlelweni lwefayela ngefomethi efanele. Vele, kufanele kucace lapha ukuthi ukugcinwa kwefayela elikhulu le-csv ku-DataLake, okufanele kuqala lifundwe umugqa ngomugqa yiqoqo ukuze kukhishwe amakholomu, akweluleki kakhulu. Cabanga ngamaphuzu amabili angenhla futhi uma kungakacaci ukuthi kungani konke lokhu kwenzeka.

I-AWS Athena - i-jack-in-the-box

Futhi-ke, ngenkathi sidala ichibi, ngandlela thile sathola i-Amazon Athena ngephutha. Kungazelelwe kwavela ukuthi ngokuhlela ngokucophelela amafayela wethu omkhulu welogi abe amashadi efolda ngefomethi elungile yekholomu (ye-parquet), ungenza ukukhetha okufundisayo kakhulu kuzo futhi wakhe imibiko NGAPHANDLE, ngaphandle kweqoqo le-Apache Spark/Glue.

Injini ye-Athena enikwa amandla yidatha ku-s3 isekelwe kwenganekwane presto - omele i-MPP (i-massive parallel processing) umndeni wezindlela zokucubungula idatha, ukuthatha idatha lapho ilele khona, ukusuka ku-s3 ne-Hadoop kuya ku-Cassandra kanye namafayela ombhalo ajwayelekile. Udinga nje ukucela u-Athena ukuthi enze umbuzo we-SQL, bese yonke into "isebenza ngokushesha nangokuzenzakalelayo." Kubalulekile ukuqaphela ukuthi i-Athena "ihlakaniphile", iya kuphela kumafolda aqoshiwe adingekayo futhi ifunda kuphela amakholomu adingekayo esicelweni.

Intengo yezicelo eziya ku-Athena nayo iyathakazelisa. Siyakhokhela umthamo wedatha eskeniwe. Labo. hhayi ngenani lemishini ku-cluster ngomzuzu, kodwa... ngedatha empeleni eskenwe emishinini eyi-100-500, idatha edingekayo kuphela ukuze kuqedelwe isicelo.

Futhi ngokucela amakholomu adingekayo kuphela kumafolda ahlukaniswe kahle, kwavela ukuthi isevisi ye-Athena isibiza amashumi amadola ngenyanga. Hhayi-ke, kuhle, cishe kumahhala, uma kuqhathaniswa nezibalo zamaqoqo!

Nokho, nansi indlela esihlukanisa ngayo idatha yethu ku-s3:

Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

Ngenxa yalokho, ngesikhathi esifushane, iminyango ehluke ngokuphelele enkampanini, kusukela ekuvikelekeni kolwazi kuya ku-analytics, yaqala ukufaka izicelo ku-Athena futhi ngokushesha, ngemizuzwana, yathola izimpendulo eziwusizo ezivela kudatha "enkulu" isikhathi eside: izinyanga, isigamu sonyaka, njll. P.

Kodwa sadlulela phambili futhi saqala ukuya efwini ukuze sithole izimpendulo ngomshayeli we-ODBC: umhlaziyi ubhala umbuzo we-SQL kukhonsoli ejwayelekile, okuthi emishinini engu-100-500 "kwamapeni" athumele idatha ku-s3 futhi abuyisele impendulo ngokuvamile emizuzwaneni embalwa. Ukhululekile. Futhi ngokushesha. Angikakholwa namanje.

Ngenxa yalokho, ngemva kokunquma ukugcina idatha ku-s3, ngefomethi yekholomu ephumelelayo kanye nokwabiwa okunengqondo kwedatha kumafolda... sithole i-DataLake kanye nenjini yokuhlaziya esheshayo neshibhile - mahhala. Futhi waduma kakhulu enkampanini, ngoba... iqonda i-SQL futhi isebenza ngama-oda wobukhulu ngokushesha kunokuqala/ukumisa/ukusetha amaqoqo. "Futhi uma umphumela ufana, kungani ukhokha ngaphezulu?"

Isicelo ku-Athena sibukeka kanje. Uma ufisa, kunjalo, ungakha ngokwanele umbuzo we-SQL oyinkimbinkimbi namakhasi amaningi, kodwa sizozikhawulela ekuhlanganiseni okulula. Ake sibone ukuthi yimaphi amakhodi okuphendula iklayenti ebe nawo emasontweni ambalwa edlule kulogi yeseva yewebhu futhi siqiniseke ukuthi awekho amaphutha:

Siyihlele kanjani i-DataLake esebenza kahle kakhulu futhi engabizi nokuthi kungani lokhu kunjalo

okutholakele

Ngemva kokudlula, hhayi ukusho indlela ende, kodwa ebuhlungu, sihlala sihlola ngokwanele izingozi nezinga lobunzima nezindleko zokusekelwa, sithole isisombululo se-DataLake kanye nezibalo ezingayeki ukusijabulisa kokubili ngesivinini kanye nezindleko zobunikazi.

Kwavela ukuthi ukwakha i-DataLake ephumelelayo, esheshayo futhi eshibhile ukuze isebenze ngezidingo zeminyango ehluke ngokuphelele yenkampani ingaphakathi kwamakhono ngisho nabathuthukisi abanolwazi abangakaze basebenze njengabakhi bezakhiwo futhi abazi ukudweba izikwele ezikweleni nge imicibisholo futhi wazi amagama angu-50 kusukela ku-ecosystem ye-Hadoop.

Ekuqaleni kohambo, ikhanda lami lalihlukana nezindawo eziningi zezilwane zasendle ezivulekile nezivaliwe kanye nokuqonda umthwalo wemfanelo ezizukulwaneni. Vele uqale ukwakha i-DataLake yakho ngamathuluzi alula: nagios/munin -> elastic/kibana -> Hadoop/Spark/s3..., ukuqoqa impendulo futhi uqonde ngokujulile i-physics yezinqubo ezenzekayo. Yonke into eyinkimbinkimbi futhi edabukile - inikeze izitha nabaqhudelana nabo.

Uma ungafuni ukuya efwini futhi uthanda ukusekela, ukuvuselela nokunamathisela amaphrojekthi omthombo ovulekile, ungakha isikimu esifana nesethu endaweni, emishinini yehhovisi engabizi ene-Hadoop ne-Presto phezulu. Into esemqoka ukuthi ungayeki futhi uqhubekele phambili, ubale, ubheke izixazululo ezilula nezicacile, futhi konke kuzosebenza nakanjani! Ngikufisela inhlanhla wonke umuntu futhi siphinde sibonane!

Source: www.habr.com

Engeza amazwana