Ingabe sidinga ichibi ledatha? Yini okufanele uyenze nge-warehouse yedatha?

Lesi sihloko sihumusha isihloko sami ku-medium - Ukuqalisa nge-Data Lake, okwavele kwaduma kakhulu, mhlawumbe ngenxa yobulula bayo. Ngakho-ke, nginqume ukuyibhala ngesiRashiya futhi ngengeze okuncane ukuze ngicacise kumuntu ovamile ongeyena uchwepheshe wedatha ukuthi iyini i-datawarehouse (DW), nokuthi iyini i-data lake (Data Lake), nokuthi bayenza kanjani. nizwane .

Kungani bengifuna ukubhala mayelana nechibi ledatha? Bengisebenza ngedatha nezibalo iminyaka engaphezu kwe-10, futhi manje ngisebenza ngedatha enkulu e-Amazon Alexa AI eCambridge, eseBoston, nakuba ngihlala eVictoria eVancouver Island futhi ngivame ukuvakashela eBoston, Seattle. , naseVancouver, futhi ngezinye izikhathi ngisho naseMoscow, ngikhuluma ezingqungqutheleni. Ngibhala futhi ngezikhathi ezithile, kodwa ngibhala ikakhulukazi ngesiNgisi, futhi sengibhale kakade ezinye izincwadi, nginesidingo futhi sokwabelana ngamathrendi ezibalo avela eNyakatho Melika, futhi ngezinye izikhathi ngiyabhala ucingo.

Bengihlala ngisebenza nezinqolobane zedatha, futhi kusukela ngo-2015 ngaqala ukusebenza eduze ne-Amazon Web Services, futhi ngokuvamile ngashintshela ekuhlaziyeni amafu (AWS, Azure, GCP). Ngiye ngabona ukuvela kwezixazululo ze-analytics kusukela ngo-2007 futhi ngaze ngasebenzela umthengisi we-warehouse we-Teradata futhi ngayisebenzisa e-Sberbank, futhi yilapho kwavela i-Big Data ene-Hadoop. Wonke umuntu waqala ukusho ukuthi inkathi yokugcina isidlulile futhi manje konke kwakuseHadoop, base beqala ukukhuluma ngeDatha Lake, futhi, ukuthi manje ukuphela kwendawo yokugcina idatha kwase kufikile nakanjani. Kodwa ngenhlanhla (mhlawumbe ngeshwa kwabanye abenze imali eningi ngokumisa i-Hadoop), indawo yokugcina idatha ayizange ihambe.

Kulesi sihloko sizobheka ukuthi liyini i-data lake. Lesi sihloko senzelwe abantu abanolwazi oluncane noma abangenalo nhlobo ngezindawo zokugcina idatha.

Ingabe sidinga ichibi ledatha? Yini okufanele uyenze nge-warehouse yedatha?

Esithombeni kukhona iLake Bled, leli elinye lamachibi engiwakhonzile, nakuba ngaba khona kanye kuphela, ngalikhumbula impilo yami yonke. Kodwa sizokhuluma ngolunye uhlobo lwechibi - ichibi ledatha. Mhlawumbe abaningi benu sebezwile ngaleli gama izikhathi ezingaphezu kwesisodwa, kodwa enye incazelo ngeke ilimaze muntu.

Okokuqala, nazi izincazelo ezidume kakhulu zeDatha Lake:

"isitoreji sefayela sazo zonke izinhlobo zedatha eluhlaza etholakalayo ukuze ihlaziywe yinoma ubani enhlanganweni" - Martin Fowler.

“Uma ucabanga ukuthi i-data mart iyibhodlela lamanzi - ahlanziwe, apakishwa futhi apakishwa ukuze asetshenziswe kalula, khona-ke ichibi ledatha liyichibi elikhulu lamanzi ngendlela yalo yemvelo. Basebenzisi, ngingakwazi ukuziqoqela amanzi, ngicwile ngijule, ngihlole.”— UJames Dixon.

Manje siyazi ngokuqinisekile ukuthi ichibi ledatha limayelana nezibalo, lisivumela ukuthi sigcine inani elikhulu ledatha ngendlela yalo yangempela futhi sinokufinyelela okudingekayo nokulula kudatha.

Ngivame ukuthanda ukwenza izinto zibe lula, uma ngikwazi ukuchaza igama eliyinkimbinkimbi ngamagama alula, khona-ke ngiyaziqonda ngokwami ​​ukuthi lisebenza kanjani nokuthi lidingekani. Ngolunye usuku, ngangilunguza kugalari yezithombe ze-iPhone, futhi kwafika kimi, leli chibi ledatha langempela, ngaze ngenza isilayidi sezinkomfa:

Ingabe sidinga ichibi ledatha? Yini okufanele uyenze nge-warehouse yedatha?

Konke kulula kakhulu. Sithatha isithombe ocingweni, isithombe sigcinwa ocingweni futhi singagcinwa ku-iCloud (isitoreji sefayela lefu). Ifoni iphinda iqoqe imethadatha yesithombe: okuboniswayo, ithegi ye-geo, isikhathi. Ngenxa yalokho, singasebenzisa isikhombimsebenzisi esisebenziseka kalula se-iPhone ukuze sithole isithombe sethu futhi sibona ngisho nezinkomba, isibonelo, lapho ngifuna izithombe ngegama elithi umlilo, ngithola izithombe ezi-3 ezinesithombe somlilo. Kimina, lokhu kufana nethuluzi leBusiness Intelligence elisebenza ngokushesha nangokunembile.

Futhi-ke, akumelwe sikhohlwe mayelana nokuphepha (ukugunyazwa kanye nokuqinisekisa), ngaphandle kwalokho idatha yethu ingaphelela kalula esizindeni somphakathi. Ziningi izindaba ezimayelana nezinkampani ezinkulu kanye neziqalo zazo idatha yazo yatholakala esidlangalaleni ngenxa yobudedengu babathuthukisi nokwehluleka ukulandela imithetho elula.

Ngisho nesithombe esilula kangaka sisisiza sicabange ukuthi liyini i-data lake, umehluko walo kusuka endaweni yokugcina idatha yendabuko kanye nezici zayo eziyinhloko:

  1. Ilayisha Idatha (Ukungenisa) kuyingxenye ebalulekile yechibi ledatha. Idatha ingangena endaweni yokugcina idatha ngezindlela ezimbili - i-batch (ukulayisha ngezikhathi ezithile) nokusakaza (ukugeleza kwedatha).
  2. Isitoreji sefayela (Isitoreji) siyingxenye eyinhloko yeDatha Lake. Besidinga isitoreji ukuze silinganiseke kalula, sithembeke ngokwedlulele, futhi sibe nezindleko eziphansi. Isibonelo, ku-AWS yi-S3.
  3. Ikhathalogi Nosesho (Ikhathalogi Nosesho) - ukuze sigweme Ixhaphozi Ledatha (yilapho silahla yonke idatha enqwabeni eyodwa, bese kungenzeki ukusebenza nayo), sidinga ukwakha isendlalelo semethadatha ukuze sihlukanise idatha. ukuze abasebenzisi bathole kalula idatha, abayidingayo ukuze ihlaziywe. Ukwengeza, ungasebenzisa izixazululo zokusesha ezengeziwe ezifana ne-ElasticSearch. Ukusesha kusiza umsebenzisi ukuthi athole idatha edingekayo ngokusebenzisa isixhumi esibonakalayo esisebenziseka kalula.
  4. Iyacubungula (Inqubo) - lesi sinyathelo sinesibopho sokucubungula nokuguqula idatha. Singaguqula idatha, siguqule ukwakheka kwayo, siyihlanze, nokunye okuningi.
  5. Ukuphepha (Ukuvikeleka) - Kubalulekile ukuchitha isikhathi ekwakhiweni kokuphepha kwesixazululo. Isibonelo, ukubethela kwedatha ngesikhathi sokulondoloza, ukucubungula nokulayisha. Kubalulekile ukusebenzisa izindlela zokuqinisekisa nokugunyaza. Ekugcineni, ithuluzi lokucwaninga liyadingeka.

Ngokombono ongokoqobo, singakwazi ukuveza ichibi ledatha ngezibaluli ezintathu:

  1. Qoqa futhi ugcine noma yini - Ichibi ledatha liqukethe yonke idatha, kokubili idatha engaphekiwe yanoma isiphi isikhathi kanye nedatha ecutshunguliwe/ehlanziwe.
  2. I-Deep Scan - ichibi ledatha livumela abasebenzisi ukuthi bahlole futhi bahlaziye idatha.
  3. Ukufinyelela okuguquguqukayo - Ichibi ledatha linikeza ukufinyelela okuguquguqukayo kwedatha ehlukene nezimo ezahlukahlukene.

Manje singakhuluma ngomehluko phakathi kwendawo yokugcina idatha kanye nechibi ledatha. Ngokuvamile abantu bayabuza:

  • Kuthiwani nge-warehouse yedatha?
  • Ingabe sishintsha indawo yokugcina idatha ngechibi ledatha noma siyayinweba?
  • Ingabe kusengenzeka ukwenza ngaphandle kwechibi ledatha?

Ngamafuphi, ayikho impendulo ecacile. Konke kuncike esimweni esithile, amakhono eqembu kanye nesabelomali. Isibonelo, ukuthuthela indawo yokugcina idatha ku-Oracle kuya ku-AWS nokudala ichibi ledatha inkampani ephethwe yi-Amazon - Woot - Indaba yethu yedatha yechibi: Indlela i-Woot.com eyakha ngayo ichibi ledatha engenaseva ku-AWS.

Ngakolunye uhlangothi, umthengisi we-Snowflake uthi akusadingeki ucabange ngechibi ledatha, njengoba isiteji sabo sedatha (kuze kube ngu-2020 kwakuyindawo yokugcina idatha) ikuvumela ukuthi uhlanganise kokubili ichibi ledatha kanye nendawo yokugcina idatha. Angisebenzanga kakhulu nge-Snowflake, futhi iwumkhiqizo oyingqayizivele ngempela ongenza lokhu. Intengo yodaba ingenye indaba.

Sengiphetha, umbono wami siqu owokuthi sisadinga indawo yokugcina idatha njengomthombo oyinhloko wedatha yokubika kwethu, futhi noma yini engalingani siyigcina echibini ledatha. Wonke umsebenzi wezibalo uwukunikeza ukufinyelela okulula kwebhizinisi ukuze lenze izinqumo. Kungakhathaliseki ukuthi umuntu angathini, abasebenzisi bebhizinisi basebenza ngokuphumelelayo kakhulu ngenqolobane yedatha kunechibi ledatha, isibonelo e-Amazon - kukhona i-Redshift (i-analytical data warehouse) futhi kukhona i-Redshift Spectrum/Athena (isixhumi esibonakalayo se-SQL sechibi ledatha ku-S3 esekelwe Isidleke/Presto). Okufanayo kuyasebenza nakwezinye izindawo zokugcina idatha zokuhlaziya.

Ake sibheke isakhiwo esivamile se-warehouse data:

Ingabe sidinga ichibi ledatha? Yini okufanele uyenze nge-warehouse yedatha?

Lesi yisixazululo sakudala. Sinezinhlelo zemithombo, sisebenzisa i-ETL/ELT sikopisha idatha endaweni yokugcina idatha yokuhlaziya bese siyixhuma kusixazululo seBusiness Intelligence (engiyithanda kakhulu i-Tableau, kuthiwani ngeyakho?).

Lesi sixazululo sinalezi zinto ezimbi ezilandelayo:

  • Imisebenzi ye-ETL/ELT idinga isikhathi nezinsiza.
  • Njengomthetho, inkumbulo yokugcina idatha endaweni yokugcina idatha yokuhlaziya ayishibhile (isibonelo, i-Redshift, i-BigQuery, i-Teradata), njengoba sidinga ukuthenga iqoqo lonke.
  • Abasebenzisi bebhizinisi banokufinyelela kudatha ehlanziwe nevame ukuhlanganiswa futhi abanakho ukufinyelela kudatha eluhlaza.

Yebo, konke kuncike endabeni yakho. Uma ungenazo izinkinga nge-warehouse yakho yedatha, awulidingi nhlobo i-data lake. Kodwa uma izinkinga ziphakama ngokuntuleka kwendawo, amandla, noma intengo idlala indima ebalulekile, khona-ke ungacabangela inketho yechibi ledatha. Yingakho ichibi ledatha lidume kakhulu. Nasi isibonelo sedatha yezakhiwo zechibi:
Ingabe sidinga ichibi ledatha? Yini okufanele uyenze nge-warehouse yedatha?
Ngokusebenzisa indlela yedatha yechibi, silayisha idatha eluhlaza echibini lethu ledatha (inqwaba noma ukusakaza), bese sicubungula idatha njengoba kudingeka. Ichibi ledatha livumela abasebenzisi bebhizinisi ukuthi bazenzele ezabo izinguquko zedatha (ETL/ELT) noma bahlaziye idatha kuzixazululo zeBusiness Intelligence (uma umshayeli odingekayo etholakala).

Umgomo wanoma yisiphi isisombululo se-analytics ukusebenzela abasebenzisi bebhizinisi. Ngakho-ke, kufanele ngaso sonke isikhathi sisebenze ngokwezidingo zebhizinisi. (E-Amazon lokhu kungenye yezimiso - ukusebenza emuva).

Ukusebenza nakho kokubili inqolobane yedatha kanye nechibi ledatha, singaqhathanisa zombili izixazululo:

Ingabe sidinga ichibi ledatha? Yini okufanele uyenze nge-warehouse yedatha?

Isiphetho esiyinhloko esingafinyelelwa ukuthi indawo yokugcina idatha ayiqhudelani nechibi ledatha, kodwa kunalokho iyakuphelelisa. Kodwa kukuwe ukuthi unqume ukuthi yini elungele icala lakho. Kuhlale kuthakazelisa ukuzizama ngokwakho futhi wenze iziphetho ezifanele.

Ngingathanda futhi ukukutshela esinye sezimo lapho ngiqala ukusebenzisa indlela yechibi ledatha. Konke kuncane kakhulu, ngazama ukusebenzisa ithuluzi le-ELT (sasine-Matillion ETL) ne-Amazon Redshift, isisombululo sami sasebenza, kodwa asizange silingane nezidingo.

Bengidinga ukuthatha amalogi ewebhu, ngiwaguqule futhi ngiwahlanganise ukuze nginikeze idatha yamacala angu-2:

  1. Ithimba lezentengiso belifuna ukuhlaziya umsebenzi we-bot we-SEO
  2. I-IT ibifuna ukubheka amamethrikhi okusebenza kwewebhusayithi

Ilula kakhulu, izingodo ezilula kakhulu. Nasi isibonelo:

https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 
192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 200 200 0 57 
"GET https://www.example.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 
arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067
"Root=1-58337281-1d84f3d73c47ec4e58577259" "www.example.com" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012"
1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-"

Ifayela elilodwa libe nesisindo samamegabhayithi angu-1-4.

Kodwa kwakunobunzima obubodwa. Sinezizinda eziyi-7 emhlabeni jikelele, futhi amafayela ayizinkulungwane eziyi-7000 adalwe ngosuku olulodwa. Lokhu akuyona ivolumu eyengeziwe, kuphela ama-gigabytes angu-50. Kodwa usayizi weqoqo lethu le-Redshift nawo wawumncane (ama-node ama-4). Ukulayisha ifayela elilodwa ngendlela evamile kuthathe cishe umzuzu. Okusho ukuthi, inkinga ayizange ixazululwe ngqo. Futhi kwaba njalo lapho nginquma ukusebenzisa indlela yechibi ledatha. Isixazululo sasibukeka kanjena:

Ingabe sidinga ichibi ledatha? Yini okufanele uyenze nge-warehouse yedatha?

Kulula kakhulu (ngifuna ukuqaphela ukuthi inzuzo yokusebenza efwini ilula). Ngisebenzise:

  • I-AWS Elastic Map Yehlisa (i-Hadoop) Yamandla Ekhompyutha
  • I-AWS S3 njengesitoreji sefayela enekhono lokubethela idatha kanye nomkhawulo wokufinyelela
  • I-Spark njengamandla ekhompiyutha e-InMemory kanye ne-PySpark yokushintsha ingqondo nokuguqulwa kwedatha
  • I-Parquet njengomphumela we-Spark
  • I-AWS Glue Crawler njengomqoqi wemethadatha mayelana nedatha entsha nokuhlukaniswa
  • I-Redshift Spectrum njengesixhumi esibonakalayo se-SQL echibini ledatha labasebenzisi abakhona be-Redshift

Iqoqo elincane le-EMR+Spark licubungule isitaki samafayela ngemizuzu engama-30. Kunamanye amacala e-AWS, ikakhulukazi amaningi ahlobene ne-Alexa, lapho kunedatha eningi.

Muva nje ngifunde okukodwa kokubi kwechibi ledatha i-GDPR. Inkinga ilapho iklayenti licela ukulisusa futhi idatha ikwelinye lamafayela, asikwazi ukusebenzisa Ulimi Lokukhohlisa Kwedatha kanye nokusebenza kwe-SUSA njengakusizindalwazi.

Ngethemba ukuthi lesi sihloko siwucacise umehluko phakathi kwendawo yokugcina idatha kanye nechibi ledatha. Uma ubunentshisekelo, ngingakwazi ukuhumusha izindatshana zami eziningi noma izindatshana zochwepheshe engizifundile. Futhi ngitshele ngezixazululo engisebenza nazo kanye nezakhiwo zazo.

Source: www.habr.com

Engeza amazwana