Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Siphila kwixesha elimangalisayo xa unokunxibelelana ngokukhawuleza nangokulula izixhobo ezininzi esele zivuliwe, uzimise “ukuqonda kwakho kucinyiwe” ngokwengcebiso ye-stackoverflow, ngaphandle kokuphonononga “kwiincwadi ezininzi”, kwaye uqalise. ukuba basebenze kurhwebo. Kwaye xa ufuna ukuhlaziya / ukwandisa okanye umntu ngempazamo aqalise oomatshini abambalwa-uyaqonda ukuba uhlobo oluthile lwephupha elibi eliqalisileyo, yonke into iye yantsonkotha kakhulu ngaphaya kokuqondwa, akukho kubuya ngamva, ikamva alicacanga kwaye likhuselekile, endaweni yokucwangcisa, khulisa iinyosi kwaye wenze isonka samasi.

Ayisiyonto into yokuba oogxa babo abanamava ngakumbi, iintloko zabo zigcwele bugs kwaye ke sele bengwevu, becinga ngokuthunyelwa ngokukhawuleza kweepakethi "zemigqomo" "kwiicubes" kuthotho lweeseva "kwiilwimi ezifashisayo" ezinenkxaso eyakhelwe-ngaphakathi. asynchronous non-blocking I/O, ncuma ngokuthozama . Kwaye baqhubeka ngokuthe cwaka befunda kwakhona "man ps", bajonge kwikhowudi yomthombo "nginx" de amehlo abo opha, kwaye babhale, babhale, babhale iimvavanyo zeeyunithi. Oogxa babo bayazi ukuba eyona nto inomdla kakhulu iya kuza xa "konke oku" ngolunye usuku luba sengozini ngobusuku ngoNyaka oMtsha. Kwaye baya kuncedwa kuphela kukuqonda okunzulu kobume be-unix, i-TCP/IP itheyibhile yombuso enkqayiweyo kunye nesiseko sokukhangela-ukukhangela i-algorithms. Ukubuyisela inkqubo ebomini njengoko i-chimes ibetha.

Owu ewe, ndiye ndaphazamiseka kancinci, kodwa ndiyathemba ukuba ndikwazile ukuhambisa imeko yolindelo.
Namhlanje ndifuna ukwabelana ngamava ethu ekuthumeleni isitakhi esifanelekileyo nesingabizi kakhulu kwiDataLake, esisombulula uninzi lwemisebenzi yohlalutyo kwinkampani kulwahlulo olwahluke ngokupheleleyo lwesakhiwo.

Ngexesha elidlulileyo, safikelela ekuqondeni ukuba iinkampani ziya zidinga iziqhamo zombini imveliso kunye nohlalutyo lobugcisa (singasathethi ke nge-icing kwikhekhe ngendlela yokufunda koomatshini) kunye nokuqonda iintsingiselo kunye neengozi - kufuneka siqokelele kwaye sihlalutye. ngakumbi nangakumbi metrics.

Uhlalutyo olusisiseko lobugcisa kwi-Bitrix24

Kwiminyaka eliqela eyadlulayo, ngaxeshanye nokuqaliswa kwenkonzo ye-Bitrix24, sityale ixesha kunye nezixhobo ngokusebenzayo ekudaleni iqonga elilula nelithembekileyo lokuhlalutya eliya kunceda ukubona ngokukhawuleza iingxaki kwiziseko ezingundoqo kunye nokucwangcisa inyathelo elilandelayo. Ngokuqinisekileyo, kwakucetyiswa ukuba kuthathwe izixhobo ezisele zenziwe zilula kwaye ziqondakala ngokusemandleni. Ngenxa yoko, i-nagios yakhethwa ukubeka iliso kunye ne-munin kuhlalutyo kunye nokubonwa. Ngoku sinamawaka eetshekhi kwi-nagios, amakhulu eetshathi kwi-munin, kwaye oogxa bethu bazisebenzisa ngempumelelo yonke imihla. Iimethrikhi zicacile, iigrafu zicacile, inkqubo isebenza ngokuthembekileyo iminyaka emininzi kwaye iimvavanyo ezintsha kunye neegrafu zongezwa rhoqo kuyo: xa sibeka inkonzo entsha ekusebenzeni, songeza iimvavanyo ezininzi kunye neegrafu. Nqwenelela impumelelo.

Umnwe kwiPulse - Uhlalutyo lobuchwephesha obukwinqanaba eliphezulu

Umnqweno wokufumana ulwazi malunga neengxaki "ngokukhawuleza kunokwenzeka" usikhokelele kwiimvavanyo ezisebenzayo kunye nezixhobo ezilula neziqondakalayo - i-pinba kunye ne-xhprof.

I-Pinba isithumele izibalo kwiipakethi ze-UDP malunga nesantya sokusebenza kweengxenye zamaphepha ewebhu kwi-PHP, kwaye sinokubona kwi-intanethi kwi-MySQL yokugcina (i-Pinba iza ne-injini yayo ye-MySQL yokuhlaziywa kwesiganeko esikhawulezayo) uluhlu olufutshane lweengxaki kwaye uphendule bona. Kwaye xhprof wasivumela ngokuzenzekelayo ukuba siqokelele iigrafu zokwenziwa kwamaphepha acothayo e-PHP kubathengi kwaye sihlalutye into enokukhokelela koku - ngokuzolileyo, ukugalela iti okanye into enamandla.

Kwixesha elidlulileyo, i-toolkit yazaliswa ngenye injini elula kwaye eqondakalayo esekwe kwi-algorithm ye-reverse indexing, ephunyezwe ngokugqibeleleyo kwithala leencwadi laseLucene-Elastic/Kibana. Umbono olula wokurekhodwa kwamaxwebhu anemisonto emininzi kwisalathiso seLucene eguqukileyo esekwe kwiziganeko ezikwiilog kunye nokukhangela okukhawulezayo kuzo kusetyenziswa ulwahlulo lwecandelo kuye kwaba luncedo ngokwenene.

Ngaphandle kwembonakalo yobuchwephesha yembonakalo ye-Kibana enemiqondo ekumgangatho ophantsi njenge "ibhakethi" "ehamba enyuka" kunye nolwimi oluhlaziyiweyo lwealgebra yobudlelwane obungekalityalwa, isixhobo saqala ukusinceda kakuhle kule misebenzi ilandelayo:

  • Zingaphi iimpazamo ze-PHP anazo umxhasi we-Bitrix24 kwi-portal ye-p1 kwiyure yokugqibela kwaye zeziphi? Qonda, xolela kwaye ukhawuleze ulungise.
  • Zingaphi iifowuni zevidiyo ezenziwe kwiiphothali eJamani kwiiyure ezingama-24 ezidlulileyo, ngowuphi umgangatho kwaye bekukho nabuphi na ubunzima ngetshaneli/inethiwekhi?
  • Isebenza kakuhle kangakanani inkqubo (ulwandiso lwethu lwe-C lwe-PHP), ehlanganiswe ukusuka kumthombo kuhlaziyo lwenkonzo yamva nje kwaye yagqithiselwa kubathengi, isebenza? Ngaba kukho iimpazamo?
  • Ngaba idatha yomthengi ingena kwimemori ye-PHP? Ngaba kukho naziphi na iimpazamo malunga nokugqithisa imemori eyabelwe iinkqubo: "ngaphandle kwememori"? Fumana kwaye unciphise.

Nanku umzekelo obambekayo. Ngaphandle kovavanyo olucokisekileyo kunye nolwamanqanaba amaninzi, umxhasi, ngemeko engeyiyo esemgangathweni kunye nedatha yegalelo eyonakalisiweyo, wafumana impazamo ecaphukisayo nengalindelekanga, kwakhala i-siren kwaye inkqubo yokuyilungisa ngokukhawuleza yaqala:

Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Ukongeza, i-kibana ikuvumela ukuba uququzelele izaziso zeziganeko ezichaziweyo, kwaye ngexesha elifutshane isixhobo kwinkampani saqala ukusetyenziswa ngabasebenzi abaninzi abavela kumasebe ahlukeneyo - ukusuka kwinkxaso yobugcisa kunye nophuhliso ukuya kwi-QA.

Umsebenzi walo naliphi na isebe ngaphakathi kwenkampani uye waba lula ukulandelela kunye nokulinganisa - endaweni yokuhlalutya ngesandla iilogi kwiiseva, kufuneka nje usete izigodo zokwahlula kube kanye kwaye uzithumele kwiqela elilastiki ukuze ujabulele, umzekelo, ukucinga kwi-kibana. Ideshibhodi inani lamantshontsho ekati aneentloko-mbini athengisiweyo ashicilelwe kwiprinta ye-3-D kwinyanga yokugqibela.

Uhlalutyo olusisiseko lweShishini

Wonke umntu uyazi ukuba uhlalutyo lweshishini kwiinkampani zihlala ziqala ngokusetyenziswa okusebenzayo, ewe, i-Excel. Kodwa into ephambili kukuba ayipheli apho. I-Google Analytics esekelwe kwifu yongeza ifutha emlilweni - uqala ngokukhawuleza ukuqhelana nezinto ezilungileyo.

Kwinkampani yethu yokuphuhlisa ngokuhambelanayo, apha naphaya "abaprofeti" bomsebenzi onzima kunye nedatha enkulu baqala ukubonakala. Isidingo seengxelo ezinzulu kunye neengxelo ezininzi zaqala ukubonakala rhoqo, kwaye ngeenzame zabafana abavela kumasebe ahlukeneyo, ixesha elidlulileyo isisombululo esilula nesisebenzayo sahlelwa - inhlanganisela yeClickHouse kunye nePowerBI.

Ixesha elide, esi sisombululo siguquguqukayo sincede kakhulu, kodwa ngokuthe ngcembe ukuqonda kwaqala ukufika ukuba iClickHouse ayiyorabha kwaye ayinakuhlekwa ngolo hlobo.

Apha kubalulekile ukuqonda kakuhle ukuba i-ClickHouse, njenge-Druid, njenge-Vertica, njenge-Amazon RedShift (esekelwe kwi-postgres), ziinjini zokuhlalutya ezilungiselelwe uhlalutyo olulungele ngokufanelekileyo (iisambuku, ukudibanisa, ubuncinci-ubuninzi ngekholamu kunye nokudibanisa okumbalwa okunokwenzeka. ), ngokuba ilungiselelwe ukugcinwa ngokufanelekileyo kweekholamu zeetheyibhile ezihambelanayo, ngokungafaniyo ne-MySQL kunye nezinye (ezijoliswe kumqolo) oovimba beenkcukacha abaziwa kuthi.

Ngokwenene, i-ClickHouse yi "database" enobuchule ngakumbi, engenakulungeleka kakhulu ukufakwa kwenqaku-nge-point (yiyo indlela ekujoliswe ngayo, yonke into ilungile), kodwa uhlalutyo olumnandi kunye neseti yemisebenzi enamandla enomdla yokusebenza neenkcukacha. Ewe, unokwenza iqoqo - kodwa uyaqonda ukuba ukubethelela izikhonkwane ngemicroscope ayichanekanga ngokupheleleyo kwaye saqala ukukhangela ezinye izisombululo.

Imfuno yepython kunye nabahlalutyi

Inkampani yethu inabaphuhlisi abaninzi ababhala ikhowudi phantse yonke imihla iminyaka eyi-10-20 kwi-PHP, JavaScript, C #, C / C ++, Java, Go, Rust, Python, Bash. Kwakhona kukho abalawuli benkqubo abaninzi abanamava abaye bafumana intlekele engaphezu kwesinye engakholelekiyo ngokupheleleyo engahambelaniyo nemithetho yezibalo (umzekelo, xa uninzi lweediski kuhlaselo-10 zitshatyalaliswa luqhankqalazo oluluqilima lombane). Kwiimeko ezinjalo, ixesha elide kwakungacaci ukuba yintoni "umhlalutyi we-python". I-Python ifana ne-PHP, kuphela igama lide kancinane kwaye kukho imikhondo encinci yezinto eziguqula ingqondo kwikhowudi yomthombo wetoliki. Nangona kunjalo, njengoko kusenziwa iingxelo zohlalutyo ezingakumbi, abaphuhlisi abanamava baqala ukuqonda ngakumbi ukubaluleka kobuchule obumxinwa kwizixhobo ezifana nenumpy, pandas, matplotlib, seaborn.
Indima ecacileyo, mhlawumbi, idlalwe ngokuphelelwa ngamandla kwabasebenzi ekudityanisweni kwamagama athi "logistic regression" kunye nokubonakaliswa kwengxelo esebenzayo kwidatha enkulu usebenzisa, ewe, ewe, i-pyspark.

I-Apache Spark, i-paradigm yayo esebenzayo apho i-algebra yobudlelwane ihambelana ngokugqibeleleyo, kunye nezakhono zayo zenza ukuba kube nombono onjalo kubaphuhlisi abajwayele i-MySQL ukuba imfuno yokomeleza amanqanaba kunye nabahlalutyi abanamava yacaca njengemini.

Eminye imizamo ye-Apache Spark/Hadoop yokumka kwaye yintoni engazange ihambe kakuhle ngokweskripthi

Nangona kunjalo, kwakhawuleza kwacaca ukuba kukho into engalunganga ngokwenkqubo ngeSpark, okanye bekufuneka nje ukuba uhlambe izandla zakho ngcono. Ukuba iHadoop/MapReduce/Lucene stack yenziwe ngabadwelisi benkqubo abanamava, okucacileyo ukuba ujongisisa ikhowudi yomthombo kwiJava okanye izimvo zikaDoug Cutting eLucene, emva koko iSpark, ngequbuliso, ibhalwe ngolwimi olungaqhelekanga lweScala, oluyiyo. iphikiswana kakhulu ngokwembono yokusebenza kwaye okwangoku ayiphuhlisi. Kwaye ukuhla rhoqo kwezibalo kwi-Spark cluster ngenxa yomsebenzi ongekho ngqiqweni kwaye awubonakali kakhulu kunye nolwabiwo lwememori yokunciphisa imisebenzi (izitshixo ezininzi zifika kanye kanye) zenze i-halo ejikeleze into enendawo yokukhula. Ukwengeza, imeko yaba mandundu inani elikhulu amazibuko engaqhelekanga evulekileyo, iifayile zexeshana zikhula kwezona ndawo zingaqondakaliyo kunye nesihogo sokuxhomekeka kwejagi - eyabangela ukuba abalawuli benkqubo babe nemvakalelo enye eyaziwayo ukusuka ebuntwaneni: inzondo ekrakra (okanye mhlawumbi kwakufuneka bahlambe izandla zabo ngesepha).

Ngenxa yoko, "siye sasinda" iiprojekthi ezininzi zohlalutyo zangaphakathi ezisebenzisa ngokusebenzayo i-Apache Spark (kubandakanya ukusasazwa kweSpark, iSpark SQL) kunye neHadoop ecosystem (kunye njalo njalo njalo). Ngaphandle kwento yokuba ngokuhamba kwexesha safunda ukulungiselela nokubeka iliso "yona" kakuhle, kwaye "it" iyeka ngokukhawuleza ukuphahlazeka ngenxa yotshintsho kwindalo yedatha kunye nokungalingani kwe-RDD hashing, umnqweno wokuthatha into esele ilungile. , ihlaziywe kwaye yalawulwa kwindawo ethile kwilifu yaya yomelela kwaye yomelela. Kungelo xesha apho siye sazama ukusebenzisa indibano yelifu esele yenziwe kwiiNkonzo zeWebhu zeAmazon - EMR kwaye, emva koko, wazama ukusombulula iingxaki ngokuyisebenzisa. I-EMR yi-Apache Spark elungiselelwe yiAmazon kunye nesoftware eyongezelelweyo evela kwi-ecosystem, efana ne-Cloudera/Hortonworks yakha.

Ukugcinwa kwefayile yeRubber yohlalutyo yimfuneko engxamisekileyo

Amava "okupheka" i-Hadoop / Spark ngokutshisa kwiindawo ezahlukeneyo zomzimba kwakungelolize. Isidingo sokudala ifayile eyodwa, engabizi kwaye ethembekileyo yokugcina ifayile eya kuxhathisa kwiintsilelo ze-hardware kunye nalapho kuya kwenzeka khona ukugcina iifayile kwiifomathi ezahlukeneyo kwiinkqubo ezahlukeneyo kwaye wenze iisampulu ezisebenzayo kunye nexesha elifanelekileyo kwiingxelo ezivela kule datha ziye zanda. icacile.

Ndandifuna kwakhona ukuba ukuhlaziya isofthiwe yeli qonga akuzange kuguquke kwi-nightmare yoNyaka omtsha ngokufunda umkhondo we-Java wamaphepha angama-20 kunye nokuhlalutya iilogi ezineenkcukacha ze-kilometer ubude beqela usebenzisa i-Spark History Server kunye ne-backlit magnifying glass. Bendifuna ukuba nesixhobo esilula nesingafihliyo ebesingafuni ukuntywila rhoqo phantsi kwehod ukuba isicelo esisemgangathweni somphuhlisi weMephuReduce siyekile ukwenza xa umsebenzi wokunciphisa wedatha ephumile kwinkumbulo ngenxa yomthombo ongakhethwanga kakuhle we-algorithm yokwahlulwa kwedatha.

Ngaba iAmazon S3 ngumgqatswa weDataLake?

Amava kunye ne-Hadoop / MapReduce yasifundisa ukuba sidinga inkqubo yefayile enokunyuka, ethembekileyo kunye nabasebenzi abathintekayo phezu kwayo, "beza" ngokusondeleyo kwidatha ukuze bangaqhubeki idatha kwinethiwekhi. Abasebenzi kufuneka bakwazi ukufunda idatha kwiifomathi ezahlukeneyo, kodwa kukhethwa ukuba bangafundi ulwazi olungeyomfuneko kwaye bakwazi ukugcina idatha kwiifomathi ezifanelekileyo kubasebenzi.

Kwakhona, ingcamango esisiseko. Akukho mnqweno "wokugalela" idatha enkulu kwi-injini yokuhlalutya yeqela elinye, eliya kuthi kungekudala liminxe kwaye kuya kufuneka uyiqhekeze kakubi. Ndifuna ukugcina iifayile, iifayile nje, kwifomathi eqondakalayo kwaye ndenze imibuzo esebenzayo yohlalutyo kubo usebenzisa izixhobo ezahlukeneyo kodwa eziqondakalayo. Kwaye kuya kubakho iifayile ezininzi kwiifomathi ezahlukeneyo. Kwaye kungcono ukukrazula hayi injini, kodwa idatha yomthombo. Sidinga iDataLake eyandisiweyo kunye neyendalo yonke, sigqibe...

Kuthekani ukuba ugcina iifayile kwindawo eqhelekileyo kunye neyaziwayo yogcino lwamafu e-Amazon S3, ngaphandle kokuba ulungiselele i-chops yakho evela kwi-Hadoop?

Kucacile ukuba idatha yobuqu "iphantsi", kodwa kuthekani ngezinye iinkcukacha ukuba siyikhupha apho kwaye "siyiqhube ngokufanelekileyo"?

I-Cluster-bigdata-analytics ecosystem yeeNkonzo zeWebhu ze-Amazon - ngamagama alula kakhulu

Ukuqwalasela ngamava ethu nge-AWS, i-Apache Hadoop / MapReduce isetyenziswe ngokusebenzayo apho ixesha elide phantsi kweesosi ezahlukeneyo, umzekelo kwinkonzo yeDathaPipeline (Ndiyabamonela oogxa bam, bafunde ukuyilungisa ngokuchanekileyo). Apha siseta ii-backups ezivela kwiinkonzo ezahlukeneyo kwiitafile zeDynamoDB:
Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Kwaye bebesoloko bebaleka rhoqo kwiHadoop/MapReduce eshunyekiweyo amaqela afana newotshi iminyaka eliqela ngoku. "Yibeke kwaye uyilibale":

Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Unokubandakanyeka ngokufanelekileyo kwi-satanism yedatha ngokuseta iilaptops zeJupiter efini kubahlalutyi kunye nokusebenzisa inkonzo ye-AWS SageMaker ukuqeqesha kunye nokuthumela iimodeli ze-AI edabini. Nantsi indlela ekhangeleka ngayo kuthi:

Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Kwaye ewe, ungazithathela ilaptop okanye umhlalutyi efini kwaye uyincamathele kwiqela leHadoop/Spark, wenze izibalo emva koko ubethelele yonke into phantsi:

Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Ilungele ngokwenene iiprojekthi zohlalutyo lomntu ngamnye kwaye kwezinye sisebenzise ngempumelelo inkonzo ye-EMR kwizibalo ezinkulu kunye nohlalutyo. Kuthekani ngesisombululo senkqubo yeDataLake, ngaba iya kusebenza? Ngalo mzuzu sasikwicala lethemba kunye nokuphelelwa lithemba kwaye saqhubeka nokukhangela.

IGlue ye-AWS-ipakishwe kakuhle iApache Spark kwi-steroids

Kuye kwavela ukuba i-AWS inenguqulo yayo ye-"Hive / Pig / Spark" stack. Indima yeHive, i.e. Ikhathalogu yeefayile kunye neentlobo zazo kwiDathaLake zenziwa yinkonzo "yekhathalogu yedatha", engazifihli ukuhambelana kwayo nefomathi ye-Apache Hive. Kufuneka ungeze ulwazi kule nkonzo malunga nokuba zibekwe phi iifayile zakho kwaye ziphi na ifomathi. Idatha ayinakho kuphela kwi-s3, kodwa nakwi-database, kodwa oko akusiyo umxholo wale post. Nantsi indlela iDathaLake yethu yedatha ecwangciswe ngayo:

Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Iifayile zibhalisiwe, zilungile. Ukuba iifayile ziye zahlaziywa, siqalisa abakhandi ngesandla okanye kwishedyuli, eya kuhlaziya ulwazi malunga nabo ukusuka echibini kwaye ugcine. Emva koko idatha evela echibini inokucutshungulwa kwaye iziphumo zilayishwe kwenye indawo. Kwimeko elula, siphinde silayishe kwi-s3. Ukusetyenzwa kwedatha kunokwenziwa naphi na, kodwa kucetyiswa ukuba uqwalasele ukusetyenzwa kwi-Apache Spark cluster usebenzisa ubuchule obuphambili nge-AWS Glue API. Ngapha koko, unokuthatha ikhowudi yepython endala kunye neqhelekileyo usebenzisa ithala leencwadi lepyspark kwaye uqwalasele ukwenziwa kwayo kwiindawo ze-N zeqela lomthamo othile ngokubeka iliso, ngaphandle kokumba kwi-guts yeHadoop kunye nokutsala izikhongozeli ze-docker-moker kunye nokuphelisa iingxabano zokuxhomekeka. .

Kwakhona, ingcamango elula. Akukho mfuneko yokuqwalasela i-Apache Spark, kufuneka ubhale ikhowudi yepython yepyspark, uyivavanye kwindawo yakho kwi-desktop yakho emva koko uyiqhube kwiqela elikhulu efini, uchaza ukuba iphi idatha yomthombo kunye nendawo yokubeka isiphumo. Ngamanye amaxesha oku kuyimfuneko kwaye kuluncedo, kwaye nantsi indlela esiyibeka ngayo:

Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Ke, ukuba ufuna ukubala into ethile kwi-Spark cluster usebenzisa idatha kwi-s3, sibhala ikhowudi kwi-python/pyspark, yivavanye, kunye nethamsanqa kwilifu.

Kuthekani ngeokhestra? Kuthekani ukuba umsebenzi uthe wawa waza wanyamalala? Ewe, kucetywa ukuba kwenziwe umbhobho omhle kwisitayela se-Apache Pig kwaye siye sazama, kodwa okwangoku sigqibe kwelokuba sisebenzise i-orchestration yethu eyenziwe ngokwezifiso kwi-PHP kunye neJavaScript (ndiyaqonda, kukho i-dissonance yokuqonda, kodwa iyasebenza, kuba iminyaka kwaye ngaphandle kweempazamo).

Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Ifomathi yeefayile ezigcinwe echibini ngundoqo ekusebenzeni

Kubaluleke kakhulu, kubaluleke kakhulu ukuqonda amanqaku amabini abalulekileyo. Ukuze imibuzo kwidatha yefayile echibini iphunyezwe ngokukhawuleza kwaye ukusebenza kungathotywanga xa ulwazi olutsha longezwa, kufuneka:

  • Gcina iikholamu zeefayile ngokwahlukileyo (ukuze ungafundi yonke imigca ukuze uqonde ukuba yintoni na kwiikholamu). Kule nto sithathe ifomathi ye-parquet kunye noxinzelelo
  • Kubaluleke kakhulu ukwahlula iifayile kwiifolda ezifana: ulwimi, unyaka, inyanga, usuku, iveki. Iinjini eziqondayo olu hlobo lwe-sharding ziya kujonga kuphela kwiifolda eziyimfuneko, ngaphandle kokuhluza yonke idatha ngokulandelelana.

Ngokusisiseko, ngale ndlela, ubeka idatha yomthombo ngeyona ndlela isebenzayo yeenjini zokuhlalutya ezixhonywe phezulu, ezithi nakwiifolda ezicandiweyo zinokungena ngokukhetha kwaye zifunde kuphela iikholamu eziyimfuneko kwiifayile. Awudingi "ukugcwalisa" idatha naphi na (ugcino luya kugqabhuka ngokulula) - nje ngobulumko uyibeke kwinkqubo yefayile ngendlela echanekileyo. Ngokuqinisekileyo, kufuneka kucace apha ukuba ukugcina ifayile enkulu ye-csv kwi-DataLake, ekufuneka ifundwe kuqala ngomgca ngomgca liqela ukuze kukhishwe iikholamu, ayicebisi kakhulu. Cinga ngezi ngongoma zimbini zingasentla kwakhona ukuba akukacaci ukuba kutheni konke oku kusenzeka.

I-AWS Athena - i-jack-in-the-box

Kwaye emva koko, ngelixa sidala ichibi, ngandlel' ithile sadibana neAmazon Athena. Ngequbuliso kuye kwavela ukuba ngokucwangcisa ngononophelo iifayile zethu ezinkulu zelog kwiishadi zefolda kwifomathi yekholamu echanekileyo (i-parquet), unokukhawuleza wenze ukhetho olunolwazi kakhulu kubo kwaye wakhe iingxelo NGAPHANDLE, ngaphandle kweqela le-Apache Spark/Glue.

Injini ye-Athena enikwe amandla yidatha kwi-s3 isekelwe kwimbali Qotha - ummeli we-MPP (i-massive parallel processing) intsapho yeendlela zokucwangcisa idatha, ukuthatha idatha apho ilala khona, ukusuka kwi-s3 kunye ne-Hadoop ukuya kwi-Cassandra kunye neefayile ezibhaliweyo eziqhelekileyo. Kufuneka nje ucele uAthena ukuba enze umbuzo weSQL, emva koko yonke into "isebenze ngokukhawuleza kwaye ngokuzenzekelayo." Kubalulekile ukuba uqaphele ukuba u-Athena "uhlakaniphile", uya kuphela kwiifolda ezidibeneyo eziyimfuneko kwaye ufunda kuphela iikholomu ezifunekayo kwisicelo.

Amaxabiso ezicelo ku-Athena nawo anomdla. Siyahlawula umthamo wedatha eskeniweyo. Ezo. kungekhona kwinani loomatshini kwi-cluster ngomzuzu, kodwa ... kwi-data eskenwe ngokwenene kumatshini we-100-500, kuphela idatha efunekayo ukugqiba isicelo.

Kwaye ngokucela kuphela iikholamu eziyimfuneko kwiifolda ezibiwe ngokuchanekileyo, kwavela ukuba inkonzo ye-Athena ixabisa amashumi eedola ngenyanga. Ewe, kuhle, phantse kusimahla, xa kuthelekiswa nohlalutyo kumaqela!

Ngendlela, nantsi indlela esabelana ngayo idatha yethu kwi-s3:

Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

Ngenxa yoko, ngexesha elifutshane, amasebe ahluke ngokupheleleyo kwinkampani, ukusuka kukhuseleko lolwazi ukuya kuhlalutyo, aqala ukwenza izicelo ku-Athena kwaye ngokukhawuleza, ngemizuzwana, afumana iimpendulo eziluncedo kwidatha "enkulu" kwixesha elide: iinyanga, isiqingatha sonyaka, njl.

Kodwa siye saya phambili kwaye saqala ukuya efini ukuze sifumane iimpendulo ngomqhubi weODBC: Umhlalutyi ubhala umbuzo we-SQL kwi-console eqhelekileyo, apho kumatshini we-100-500 "yeepeni" ithumela idatha kwi-s3 kwaye ibuyisela impendulo ngokuqhelekileyo kwimizuzwana embalwa. Ukhululekile. Kwaye ngokukhawuleza. Andikayikholelwa nangoku.

Ngenxa yoko, emva kokuthatha isigqibo sokugcina idatha kwi-s3, kwifomathi esebenzayo yekholomu kunye ne-sharding efanelekileyo yedatha kwiifolda ... sifumene i-DataLake kunye ne-injini yokuhlalutya ngokukhawuleza kunye nexabiso eliphantsi - mahhala. Kwaye wathandwa kakhulu kwinkampani, kuba ... iyaqonda SQL kwaye isebenza imiyalelo yobungakanani ngokukhawuleza kunokuqala/ukumisa/ukuseta izihloko. "Kwaye ukuba iziphumo ziyafana, kutheni uhlawula ngaphezulu?"

Isicelo ku-Athena sijonge into enje. Ukuba unqwenela, ngokuqinisekileyo, unokwenza ngokwaneleyo umbuzo onzima kunye namaphepha amaninzi weSQL, kodwa siya kuphelela ekubekeni ngokwamaqela okulula. Makhe sibone ukuba zeziphi iikhowudi zokuphendula umxhasi ebenazo kwiiveki ezimbalwa ezidlulileyo kwiilog zeseva yewebhu kwaye uqiniseke ukuba akukho zimpazamo:

Siyilungelelanise njani iDataLake esebenza kakuhle kwaye engabizi kwaye kutheni oku kunjalo

ezifunyanisiweyo

Emva kokudlula, kungekhona ukuthetha indlela ende, kodwa ebuhlungu, ukuvavanya rhoqo ngokufanelekileyo ingozi kunye nenqanaba lobunzima kunye neendleko zenkxaso, sifumene isisombululo seDataLake kunye nohlalutyo olungayeki ukusikholisa kunye nesantya kunye neendleko zobunini.

Kwavela ukuba ukwakha i-DataLake esebenzayo, ekhawulezayo kunye nexabiso eliphantsi kwiimfuno zamasebe ahluke ngokupheleleyo enkampani ingaphakathi ngokupheleleyo kwizakhono zabaphuhlisi abanamava abangazange basebenze njengabakhi bezakhiwo kwaye abayazi indlela yokuzoba izikwere kwizikwere kunye. iintolo kwaye wazi amagama angama-50 ukusuka kwi-ecosystem ye-Hadoop.

Ekuqaleni kohambo, intloko yam yayihlukana kwiindawo ezininzi zogcino-zilwanyana zasendle ezivulekileyo nezivaliweyo kunye nokuqonda umthwalo woxanduva kwinzala. Qala nje ukwakha iDataLake yakho kwizixhobo ezilula: nagios / munin -> elastic / kibana -> Hadoop / Spark / s3 ..., ukuqokelela impendulo kunye nokuqonda ngokunzulu i-physics yeenkqubo ezenzekayo. Yonke into entsonkothileyo kwaye imfiliba-yinike iintshaba kunye nabakhuphisana nabo.

Ukuba awufuni ukuya efini kwaye uthanda ukuxhasa, ukuhlaziya kunye nokupakisha iiprojekthi zomthombo ovulekileyo, unokwakha iskimu esifana nesethu kwindawo, kumashishini angabizi kakhulu kunye neHadoop kunye nePresto phezulu. Into ephambili ayiyikumisa kwaye iqhube phambili, ukubala, ukujonga izisombululo ezilula nezicacileyo, kwaye yonke into iya kusebenza ngokuqinisekileyo! Inhlanhla kuye wonke umntu kwaye ndikubone kwakhona!

umthombo: www.habr.com

Yongeza izimvo