Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Isu tinorarama munguva inoshamisa apo iwe unogona kukurumidza uye nyore kubatanidza akati wandei akagadzirira-akavhurika-sosi maturusi, woamisa ne "kuziva kwako kwakadzimwa" maererano nezano re stackoverflow, pasina kunyura mu "mavara akawanda", uye kutanga. kuti vaite zvekutengesa. Uye kana iwe uchida kuvandudza / kuwedzera kana mumwe munhu netsaona atangazve akati wandei michina - iwe unoona kuti imwe mhando yezviroto zvakashata zvatanga, zvese zvave kuomarara zvakanyanya kupfuura kuzivikanwa, hapana kudzokera kumashure, ramangwana harina kujeka uye rakachengeteka, pachinzvimbo chekugadzira, bereka nyuchi uye ita chizi.

Hazvisi zvenhando kuti vamwe vane ruzivo rwakanyanya, vane misoro yakazadzwa netsikidzi uye nekudaro vatochena, vachifunga nezve kukurumidza kukurumidza kutumira mapaketi e "midziyo" mu "cubes" pane akawanda emaseva mu "mitauro yemafashoni" ine yakavakirwa-mukati rutsigiro asynchronous isingavharidzi I/O, nyemwerera zvine mwero . Uye ivo chinyararire vanoenderera mberi nekuverenga zvakare "man ps", vanonyura mune "nginx" sosi kodhi kusvika maziso avo abuda ropa, uye nyora, nyora, nyora bvunzo dzeyuniti. Vanoshanda navo vanoziva kuti chinhu chinonyanya kufadza chichauya apo "zvose izvi" rimwe zuva zvichave zvakasungwa usiku paEvha Idzva Idzva. Uye ivo vanozongobatsirwa nekunzwisisa kwakadzama kwechimiro cheunix, iyo yakabatidzwa nemusoro TCP/IP tafura yenyika uye yakakosha kurongedza-kutsvaga algorithms. Kudzosa sisitimu kuhupenyu sezvo chimes ichirova.

Ah hongu, ndakavarairwa zvishoma, asi ndinovimba ndakakwanisa kuburitsa mamiriro etarisiro.
Nhasi ndinoda kugovera ruzivo rwedu mukuendesa yakaringana uye isingadhure stack yeDataLake, iyo inogadzirisa mazhinji emabasa ekuongorora mukambani kune akasiyana zvachose zvimiro zvikamu.

Imwe nguva yapfuura, takasvika pakunzwisisa kuti makambani anowedzera kuda michero yezvose chigadzirwa uye yehunyanzvi analytics (tisingarevi icing pakike nenzira yekudzidza kwemichina) uye kunzwisisa mafambiro nenjodzi - isu tinofanirwa kuunganidza nekuongorora. akawanda uye akawanda metrics.

Basic technical analytics muBitrix24

Makore akati wandei apfuura, panguva imwe chete nekutangwa kweiyo Bitrix24 sevhisi, takashinga kushandisa nguva nezviwanikwa mukugadzira yakapfava uye yakavimbika yekuongorora chikuva chaizobatsira nekukurumidza kuona matambudziko muzvivakwa uye kuronga danho rinotevera. Zvechokwadi, zvaive zvakakodzera kutora zvishandiso zvakagadzirirwa zvaive nyore uye zvinonzwisisika sezvinobvira. Nekuda kweizvozvo, nagios yakasarudzwa yekutarisa uye munin yeanalytics uye kuona. Iye zvino tine zviuru zvecheki mu nagios, mazana emachati mumunin, uye vatinoshanda navo vanoashandisa zvinobudirira mazuva ese. Metrics akajeka, magirafu akajeka, sisitimu yave ichishanda zvakavimbika kwemakore akati wandei uye miedzo mitsva nemagirafu anogara achiwedzerwa kwairi: patinoisa sevhisi nyowani kushanda, tinowedzera bvunzo dzinoverengeka uye magirafu. Rombo rakanaka.

Chigunwe paPulse - Yepamberi Technical Analytics

Chido chekugamuchira ruzivo pamusoro pematambudziko "nekukurumidza sezvinobvira" chakatitungamirira kune kushingaira kuedza nemidziyo iri nyore uye inonzwisisika - pinba uye xhprof.

Pinba akatitumira zviverengero mumapaketi eUDP nezve kumhanya kwekushanda kwezvikamu zvemapeji ewebhu muPHP, uye isu taigona kuona online muMySQL chengetedzo (Pinba inouya neyayo MySQL injini yekukurumidza chiitiko analytics) rondedzero pfupi yezvinetso uye kupindura kune ivo. Uye xhprof otomatiki yakatitendera kuti titore magirafu ekuitwa kweanononoka mapeji ePP kubva kune vatengi uye tiongorore izvo zvinogona kutungamira kune izvi - zvakadzikama, kudira tii kana chimwe chinhu chakasimba.

Imwe nguva yapfuura, kit yezvishandiso yakazadzikiswa neimwe injini iri nyore uye inonzwisisika yakavakirwa pane reverse indexing algorithm, yakanyatsoitwa munhoroondo yeraibhurari yeLucene - Elastic/Kibana. Pfungwa yakapusa yeakawanda-akarukwa kurekodha ezvinyorwa mune inverse Lucene index yakavakirwa pane zviitiko mumatanda uye nekukurumidza kutsvaga kuburikidza navo uchishandisa facet kupatsanurwa kwakave kunobatsira chaizvo.

Zvisinei nekuonekwa kwehunyanzvi hwekuona muKibana ine pfungwa dzakaderera senge "bucket" "kuyerera kumusoro" uye mutauro wakadzokororwa weiyo isati yakanganwa zvachose algebra yehukama, chishandiso chakatanga kutibatsira mumabasa anotevera:

  • Mangani ezvikanganiso zvePHP aive nemutengi weBitrix24 pane p1 portal muawa yekupedzisira uye ndedzipi? Nzwisisa, regerera uye nekukurumidza kugadzirisa.
  • Mangani evhidhiyo mafoni akaitwa pamaportals muGermany mumaawa makumi maviri nemana apfuura, nemhando ipi uye paive paine matambudziko nechiteshi/netiweki?
  • Iyo inoshanda sei sisitimu (yedu C yekuwedzera yePHP), yakaunganidzwa kubva kutsime mune yazvino sevhisi yekuvandudza uye yakatenderedzwa kune vatengi, inoshanda? Pane segfaults here?
  • Ko data yevatengi inokodzera mundangariro yePHP? Pane here kukanganisa pamusoro pekupfuura ndangariro yakagoverwa kune maitiro: "kunze kwekuyeuka"? Tsvaga uye usarerekera.

Heino muenzaniso wekongiri. Zvisinei nekuyedzwa kwakakwana uye kwakawanda, mutengi, aine kesi isiri-yakajairwa uye yakakuvadzwa data yekupinda, akagamuchira chikanganiso chinogumbura uye chisingatarisirwi, siren yakarira uye maitiro ekukurumidza kugadzirisa akatanga:

Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Pamusoro pezvo, kibana inokutendera kuti uronge zviziviso zvezviitiko zvakatarwa, uye munguva pfupi chishandiso mukambani chakatanga kushandiswa nevashandi vakawanda kubva kumadhipatimendi akasiyana - kubva kukutsigira kwehunyanzvi uye kusimudzira kusvika kuQA.

Chiitiko chechero dhipatimendi mukati mekambani chave nyore kuteedzera uye kuyera - pachinzvimbo chekuongorora nemaoko matanda pamaseva, iwe unongoda kuseta matanda ekuparadzanisa kamwe uye woatumira kune elastiki cluster kuti unakirwe, semuenzaniso, kufungisisa mu kibana. dashboard nhamba yekatsi dzine misoro miviri dzakatengeswa paprinta ye3-D pamwedzi wekupedzisira.

Basic Business Analytics

Wese munhu anoziva kuti bhizinesi analytics mumakambani kazhinji rinotanga nekushandisa zvakanyanya, hongu, Excel. Asi chinhu chikuru ndechekuti hazvigumiri ipapo. Cloud-based Google Analytics inowedzerawo mafuta kumoto - unokurumidza kutanga kujaira zvinhu zvakanaka.

Mukambani yedu ichiri kusimukira inoenderana, pano neapo "maporofita" ebasa rakasimba ane data rakakura akatanga kuoneka. Kudiwa kwehuwandu hwakadzama uye hwakasiyana-siyana mishumo yakatanga kuonekwa nguva dzose, uye kuburikidza nekuedza kwevarume kubva kumadhipatimendi akasiyana, imwe nguva yapfuura mhinduro iri nyore uye inoshanda yakarongwa - musanganiswa weClickHouse uye PowerBI.

Kwenguva yakareba, mhinduro inoshanduka yakabatsira zvakanyanya, asi zvishoma nezvishoma kunzwisisa kwakatanga kuuya kuti ClickHouse haisi rabha uye haigone kusekwa saizvozvo.

Pano zvakakosha kuti unzwisise zvakanaka kuti ClickHouse, seDruid, seVertica, seAmazon RedShift (iyo yakavakirwa pane postgres), injini dzekuongorora dzakagadziridzwa kuti dziite nyore analytics (masamu, kubatanidza, hushoma-hukuru nekoramu uye mashoma anogoneka majoini. ), nokuti yakarongedzerwa kuchengetedza kwakaringana makoramu ematafura ehukama, kusiyana neMySQL uye mamwe (ane mutsara) dhatabhesi inozivikanwa kwatiri.

Muchidimbu, ClickHouse ingori yakawanda capacious "database", isina yakanyanya kunaka poindi-ne-poindi yekuisa (ndiwo marongero ayo, zvese zvakanaka), asi inonakidza analytics uye seti yeanonakidza mabasa ane simba ekushanda nedata. Hongu, iwe unogona kutogadzira sumbu - asi iwe unonzwisisa kuti kurovera zvipikiri zvine maikorosikopu hakuna kunyatso kurongeka uye isu takatanga kutsvaga dzimwe mhinduro.

Kudiwa kwepython uye vaongorori

Kambani yedu ine vakawanda vanogadzira vanonyora kodhi zuva rega rega kwe10-20 makore muPHP, JavaScript, C #, C/C++, Java, Go, Rust, Python, Bash. Kune zvakare vazhinji vane ruzivo vatariri vehurongwa vakawana njodzi inopfuura imwe isingadaviriki isingaenderane nemitemo yezviverengero (somuenzaniso, apo ruzhinji rwemadhisiki mukurova-10 anoparadzwa nekurova kwakasimba kwemheni). Mumamiriro ezvinhu akadaro, kwenguva yakareba hazvina kujeka kuti "python analyst" yaiva chii. Python yakafanana nePHP, zita chete inguva yakati rebei uye pane zvishoma zvishoma zvezvinhu zvinoshandura pfungwa mune kodhi yemuturikiri. Nekudaro, pakazogadzirwa mishumo yekuongorora, vagadziri vane ruzivo vakatanga kuwedzera kunzwisisa kukosha kwehunyanzvi hudiki mumidziyo yakaita senumpy, pandas, matplotlib, seaborn.
Basa rakasarudzwa, ringangodaro, rakatambidzwa nekufenda kamwe kamwe kwevashandi kubva mukubatanidzwa kwemashoko okuti "logistic regression" uye kuratidzwa kwekutaura kunobudirira pamashoko makuru achishandisa, hongu, hongu, pyspark.

Apache Spark, paradigm yayo inoshanda pairi relational algebra inokodzera zvakakwana, uye kugona kwayo kwakaita fungidziro kune vanogadzira vakajaira MySQL zvekuti kukosha kwekusimbisa mazinga nevaongorori vane ruzivo kwakajeka sezuva.

Kumwe kuyedza kweApache Spark / Hadoop kuti abve uye izvo zvisina kunyatsoenda zvinoenderana nescript.

Nekudaro, zvakabva zvajeka kuti chimwe chinhu chaive chisina kumira zvakanaka neSpark, kana kuti zvaingodiwa kugeza maoko ako zvirinani. Kana iyo Hadoop/MepuReduce/Lucene stack yakagadzirwa nevaya vane ruzivo rwakakwana, izvo zviri pachena kana iwe ukanyatsotarisisa iyo sosi kodhi muJava kana Doug Cutting mazano muLucene, ipapo Spark, kamwe kamwe, yakanyorwa mumutauro unoshamisa Scala, inova. inopokana zvakanyanya kubva pakuona kwekuita uye parizvino haisi kusimukira. Uye kudonha kwenguva dzose kwekuverenga paSpark cluster nekuda kwekusanzwisisika uye kusanyanya kujeka basa rine ndangariro kugoverwa kwekudzikisa mashandiro (makiyi mazhinji anosvika kamwechete) yakagadzira halo yakaitenderedza yechinhu chine nzvimbo yekukura. Pamusoro pezvo, mamiriro acho ezvinhu akawedzerwa nenhamba huru yezviteshi zvakavhurika zvisinganzwisisike, mafaira enguva pfupi ari kukura munzvimbo dzisinganzwisisike uye gehena rekutsamira kwechirongo - izvo zvakaita kuti vatariri vehurongwa vave nekunzwa kumwe kwainyatsozivikanwa kubva paudiki: ruvengo runotyisa (kana kuti pamwe. vaifanira kugeza maoko avo nesipo).

Nekuda kweizvozvo, isu "takapona" akati wandei emukati ekuongorora mapurojekiti anoshandisa achishingairira Apache Spark (kusanganisira Spark Streaming, Spark SQL) uye Hadoop ecosystem (uye zvichingodaro zvichingodaro). Zvisinei nekuti nekufamba kwenguva takadzidza kugadzirira nekutarisa "iyo" chaizvo, uye "iyo" yakangoerekana yamira kuparara nekuda kwekuchinja kwechimiro che data uye kusaenzana kweyunifomu yeRDD hashing, chishuwo chekutora chimwe chinhu chatogadzirira. , yakagadziridzwa nekupihwa kumwe kunhu mugore kwakawedzera kusimba nekusimba. Yakanga iri panguva ino yatakaedza kushandisa yakagadzirira-yakagadzirwa gore musangano weAmazon Web Services - EMR uye, pashure, akaedza kugadzirisa matambudziko nokuishandisa. EMR is Apache Spark yakagadzirwa neAmazon ine imwe software kubva kuecosystem, senge Cloudera/Hortonworks inovaka.

Rubber faira chengetedzo yeanalytics inodikanwa nekukurumidza

Chiitiko che "kubika" Hadoop / Spark nekupisa kune zvikamu zvakasiyana-siyana zvemuviri zvakanga zvisiri pasina. Iko kudikanwa kwekugadzira imwe chete, isingadhuri uye yakavimbika faira yekuchengetedza iyo yaizove isingaenderane nekutadza kwehardware uye umo zvaizogoneka kuchengetedza mafaera mune akasiyana mafomati kubva kune akasiyana masisitimu uye kugadzira inoshanda uye nguva-inoshanda sampuli yemishumo kubva kune iyi data yakawedzera. clear.

Ini ndaida zvakare kuti kuvandudza software yepuratifomu iyi haina kushanduka kuita hutsinye hweGore Idzva nekuverenga makumi maviri-mapeji Java traces uye kuongorora kiromita-refu yakadzama matanda esumbu uchishandisa Spark Nhoroondo Server uye backlit yekukudza girazi. Ini ndaida kuva nechishandiso chakareruka uye chakajeka chaisada kugara uchinyura pasi pehodhi kana mugadziri weMepuReduce chikumbiro chakamira kuita apo mushandisi wedata akadonha mundangariro nekuda kweiyo isina kunyatsosarudzwa sosi yekugovera data algorithm.

IAmazon S3 mumiriri weDataLake?

Chiitiko neHadoop / MepuReduce akatidzidzisa kuti tinoda scalable, yakavimbika faira system uye vashandi vanopisa pamusoro payo, "vachiuya" padhuze nedata kuti usatyaire data pamusoro petiweki. Vashandi vanofanirwa kukwanisa kuverenga data mune akasiyana mafomati, asi zviri nani kwete kuverenga ruzivo rusina basa uye kukwanisa kuchengeta data pachine nguva mumhando dzakanakira vashandi.

Zvakare, pfungwa huru. Iko hakuna chishuwo che "kudurura" data hombe mune imwechete cluster analytical injini, iyo ichakurumidza kana gare gare kudzipwa uye iwe uchafanirwa kuiparadza zvakashata. Ini ndoda kuchengeta mafaera, mafaera chete, mune inonzwisisika fomati uye ndiite inoshanda yekuongorora mibvunzo paari uchishandisa akasiyana asi anonzwisisika maturusi. Uye kuchave nekuwanda mafaera mune akasiyana mafomati. Uye zviri nani kugova kwete injini, asi iyo sosi data. Isu tinoda yakawedzera uye yepasirese DataLake, isu takasarudza ...

Ko kana iwe ukachengeta mafaera mune yakajairika uye inozivikanwa scalable gore yekuchengetedza Amazon S3, pasina kugadzirira yako chops kubva kuHadoop?

Zviri pachena kuti data yemunhu "yakaderera", asi zvakadini nedzimwe data kana tikaitora kunze uko uye "kuityaira zvinobudirira"?

Cluster-bigdata-analytics ecosystem yeAmazon Web Services - mumashoko akareruka

Tichifunga nezvechiitiko chedu neAWS, Apache Hadoop/MepuReduce yanga ichishandiswa ipapo kwenguva yakareba pasi pemasosi akasiyana siyana, semuenzaniso musevhisi yeDataPipeline (Ndinogodora vandinoshanda navo, vakadzidza kuigadzira nemazvo). Pano isu tinomisa ma backups kubva kune akasiyana masevhisi kubva kuDynamoDB matafura:
Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Uye vanga vachimhanya nguva dzose pane yakadzamidzirwa Hadoop/MepuReduce masumbu sewachi kwemakore akati wandei ikozvino. "Igadze uye ukanganwe":

Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Iwe unogona zvakare kuita nekuita mune data satanism nekumisikidza malaptop eJupiter mugore kune vanoongorora uye kushandisa iyo AWS SageMaker sevhisi kudzidzisa uye kuendesa AI modhi muhondo. Hezvino zvazvinoita kwatiri:

Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Uye hongu, iwe unogona kuzvitorera iwe kana muongorori welaptop mugore uye woibatanidza kuHadoop/Spark cluster, ita zviverengero wozoroverera zvese pasi:

Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Zvakakodzera chaizvo kune ega ega ekuongorora mapurojekiti uye kune vamwe isu takabudirira kushandisa iyo EMR sevhisi yehukuru-hukuru kuverenga uye analytics. Zvakadini nezve system solution yeDataLake, ichashanda here? Panguva iyi takanga tave padanho retariro nekupererwa tikaenderera mberi nekutsvaga.

AWS Glue - yakanyatsorongedzwa Apache Spark pane steroids

Zvakazoitika kuti AWS ine yayo vhezheni ye "Hive / Pig / Spark" stack. Basa reHive, i.e. Rondedzero yemafaira nemhando dzawo muDataLake inoitwa ne "Data catalogue" sevhisi, iyo isingavanze kuenderana kwayo neApache Hive fomati. Iwe unofanirwa kuwedzera ruzivo kune ino sevhisi nezvekuti mafaera ako anowanikwa kupi uye mune ipi fomati. Iyo data inogona kunge isiri mu s3 chete, asiwo mune dhatabhesi, asi iyo haisi iyo nyaya yeiyi positi. Heano maitiro edu DataLake data dhairekitori yakarongeka:

Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Mafaira akanyoreswa, akanaka. Kana mafaera akagadziridzwa, tinoparura vanokambaira nemaoko kana nehurongwa, izvo zvinovandudza ruzivo pamusoro pazvo kubva mudhamu uye nekuachengeta. Ipapo iyo data kubva kudhamu inogona kugadziriswa uye mhedzisiro inoiswa kumwe kunhu. Muchiitiko chakareruka, isu tinorodhawo kune s3. Kugadziriswa kwedata kunogona kuitwa chero kupi, asi zvinokurudzirwa kuti iwe ugadzirise kugadzirisa pane Apache Spark cluster uchishandisa hunyanzvi hwepamberi kuburikidza neAWS Glue API. Muchokwadi, iwe unogona kutora yakanaka yekare uye yakajairika python kodhi uchishandisa iyo pyspark raibhurari uye gadzirisa mashandiro ayo paN node dzeboka reimwe hunyanzvi nekutarisa, pasina kuchera mukati meguts eHadoop uye kudhonza docker-moker midziyo uye kubvisa kusawirirana. .

Kamwe zvakare, pfungwa iri nyore. Iko hakuna chikonzero chekugadzirisa Apache Spark, iwe unongoda kunyora python kodhi yepyspark, iedze munharaunda yako padesktop yako wobva waimhanyisa pane hombe sumbu mugore, uchitsanangura kuti kunobva data kupi uye kwekuisa mhedzisiro. Dzimwe nguva izvi zvakakosha uye zvinobatsira, uye heino maitiro atinoimisa:

Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Saka, kana iwe uchida kuverenga chimwe chinhu paSpark cluster uchishandisa data mu s3, tinonyora kodhi mu python/pyspark, iedze, uye rombo rakanaka kune gore.

Zvakadini nenziyo? Ko kana basa racho rikadonha rikanyangarika? Ehe, zvinokurudzirwa kugadzira pombi yakanaka muApache Nguruve maitiro uye isu takatomboedza ivo, asi parizvino takafunga kushandisa yedu yakadzama yakasarudzika orchestration muPHP neJavaScript (Ndinonzwisisa, kune cognitive dissonance, asi inoshanda, makore uye pasina zvikanganiso).

Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Mamiriro emafaira akachengetwa mudhamu ndiyo kiyi yekuita

Zvakanyanya, zvakakosha kuti unzwisise zvimwe zviviri zvakakosha. Kuti mibvunzo iri padhata redhamu iri mudhamu iitwe nekukurumidza sezvinobvira uye kuita kuti kusaderedze kana ruzivo rutsva rwawedzerwa, unofanirwa:

  • Chengetedza makoramu emafaira zvakasiyana (kuitira kuti usaverenge mitsara yese kuti unzwisise zviri mumakoramu). Kune izvi takatora iyo parquet fomati nekumanikidza
  • Izvo zvakakosha kuti ucheke mafaera mumaforodha senge: mutauro, gore, mwedzi, zuva, vhiki. Injini dzinonzwisisa rudzi urwu rwe sharding dzinotarisa chete pamaforodha anodiwa, pasina kusefa data rese mumutsara.

Chaizvoizvo, neiyi nzira, iwe unoisa iyo sosi data mune yakanyanya kushanda fomu yeanoongorora injini dzakaturikwa pamusoro, iyo kunyange mune sharded mafolda anogona kusarudza kupinda uye kuverenga chete makoramu anodiwa kubva kumafaira. Iwe haufanire "kuzadza" iyo data chero kupi (iyo chengetedzo inongoputika) - ingoiisa nehungwaru muhurongwa hwefaira mune chaiyo fomati. Ehe, zvinofanirwa kuve pachena pano kuti kuchengeta hombe csv faira muDataLake, iyo inofanirwa kutanga yaverengwa mutsara nemutsara neboka kuitira kuti ubvise makoramu, hazvinyanyi kukurudzirwa. Funga nezvemapfundo maviri ari pamusoro apa zvakare kana zvisati zvajeka kuti sei zvese izvi zviri kuitika.

AWS Athena - jack-in-the-box

Uye ipapo, tichigadzira dziva, isu neimwe nzira takasangana neAmazon Athena. Pakarepo zvakazoitika kuti nekunyatsoronga mafaera edu mahombe mumafolda shards mune chaiyo (parquet) column fomati, unogona kukurumidza kuita sarudzo dzinodzidzisa kubva kwavari uye kuvaka mishumo PASI, pasina Apache Spark / Glue cluster.

Injini yeAthena inofambiswa nedata mu s3 yakavakirwa pane inozivikanwa Presto - mumiriri weMPP (massive parallel processing) mhuri yemaitiro ekugadzirisa data, kutora data painorara, kubva s3 uye Hadoop kuenda kuCassandra uye akajairwa zvinyorwa zvinyorwa. Iwe unongoda kubvunza Athena kuti aite SQL query, uyezve zvese "zvinoshanda nekukurumidza uye otomatiki." Zvakakosha kuziva kuti Athena "akachenjera", inongoenda kune anodiwa sharded folders uye inoverenga chete makoramu anodiwa mukukumbira.

Mitengo yezvikumbiro kuna Athena inonakidzawo. Tinobhadhara huwandu hwe data yakaongororwa. Avo. kwete yehuwandu hwemichina musumbu paminiti, asi ... kune iyo data yakanyatsoongororwa pamichina 100-500, chete data inodiwa kupedzisa chikumbiro.

Uye nekukumbira chete makoramu anodiwa kubva kumaforodha akachekwa, zvakazoitika kuti sevhisi yeAthena inotitorera makumi emadhora pamwedzi. Zvakanaka, zvakanaka, zvinenge zvemahara, zvichienzaniswa neanalytics pamasumbu!

Nenzira, heino maitiro atinogovana data redu mus3:

Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

Nekuda kweizvozvo, munguva pfupi, madhipatimendi akasiyana zvachose mukambani, kubva pakuchengetedza ruzivo kusvika kune analytics, akatanga kuita zvikumbiro kuna Athena uye nekukurumidza, mumasekonzi, akagamuchira mhinduro dzinobatsira kubva ku data "hombe" kwenguva yakareba: mwedzi, hafu yegore, nezvimwewo P.

Asi takaenda mberi ndokutanga kuenda kugore kuti tiwane mhinduro kuburikidza neODBC mutyairi: muongorori anonyora mubvunzo weSQL mune yakajairika console, iyo pa 100-500 michina "yepennies" inotumira data kune s3 uye inodzorera mhinduro kazhinji mumasekondi mashomanana. Comfortable. Uye kutsanya. Handisati ndazvibvuma.

Nekuda kweizvozvo, takasarudza kuchengetedza data mu s3, mune inoshanda columnar fomati uye nekugovanisa kwe data mumafolda ... takagamuchira DataLake uye inokurumidza uye yakachipa yekuongorora injini - yemahara. Uye akazove akakurumbira mukambani, nokuti ... inonzwisisa SQL uye inoshanda maodha ehukuru nekukurumidza kupfuura kuburikidza nekutanga / kumisa / kumisikidza masumbu. "Uye kana mhedzisiro yakafanana, sei uchibhadhara yakawanda?"

Chikumbiro kuna Athena chinotaridzika seizvi. Kana uchida, hongu, unogona kuumba zvakakwana yakaoma uye yakawanda-peji SQL mubvunzo, asi tichazviganhurira kumapoka ari nyore. Ngationei kuti ndeapi makodhi emhinduro aive nemutengi mavhiki mashoma apfuura muwebhu server logs uye ita shuwa kuti hapana zvikanganiso:

Isu takaronga sei inoshanda zvakanyanya uye isingadhure DataLake uye nei izvi zvakadaro

zvakawanikwa

Tapfuura, tisingataure nzira refu, asi inorwadza, tichigara tichiongorora njodzi uye mwero wekuoma uye mutengo wekutsigirwa, takawana mhinduro yeDataLake uye analytics isingaregi kutifadza nekumhanya uye mutengo wevaridzi.

Zvakazoitika kuti kuvaka inoshanda, inokurumidza uye yakachipa yekushandisa DataLake yezvido zvemadhipatimendi akasiyana zvachose ekambani iri mukati mehunyanzvi hwevagadziri vane ruzivo vasina kumbobvira vashanda sevagadziri vezvivakwa uye vasingazive maitiro ekudhirowa masikweya pamakwere ane. miseve uye ziva makumi mashanu mazwi kubva kuHadoop ecosystem.

Pakutanga kwerwendo, musoro wangu wakanga uchitsemuka kubva munzvimbo dzakawanda dzemhuka dzomusango dzemapurogiramu akazaruka uye akavharwa uye kunzwisisa kwomutoro webasa kuvazukuru. Ingotanga kuvaka yako DataLake kubva kune zvishandiso zviri nyore: nagios / munin -> elastic / kibana -> Hadoop / Spark / s3 ..., kuunganidza mhinduro uye kunzwisisa zvakadzama fizikisi yemaitiro ari kuitika. Zvese zvakaoma uye zvakasviba - zvipe kune vavengi nevakwikwidzi.

Kana iwe usingade kuenda kune gore uye uchida kutsigira, kugadzirisa uye chigamba-yakavhurika-sosi mapurojekiti, unogona kuvaka chirongwa chakafanana neyedu munharaunda, pamichina yehofisi isingadhure ine Hadoop nePresto pamusoro. Chinhu chikuru hachisi kumira uye kuenderera mberi, kuverenga, kutsvaga mhinduro dzakareruka uye dzakajeka, uye zvese zvichanyatso shanda! Rombo rakanaka kune wese uye kukuona zvakare!

Source: www.habr.com

Voeg