Indlela iGoogle's BigQuery eyenze uhlalutyo lwedatha yedemokhrasi. Icandelo loku-2

Hayi Habr! Ubhaliso lwekhosi entsha luvuliwe e-OTUS ngoku Injineli yedatha. Ukulindela ukuqala kwekhosi, siyaqhubeka ukwabelana nawe ngezinto eziluncedo.

Funda inxalenye yokuqala

Indlela iGoogle's BigQuery eyenze uhlalutyo lwedatha yedemokhrasi. Icandelo loku-2

Ulawulo lwedatha

Ulawulo oluluqilima lweDatha lusisiseko soBunjineli be-Twitter. Njengoko siphumeza iBigQuery kwiqonga lethu, sigxile ekufumaneni idatha, ulawulo lofikelelo, ukhuseleko kunye nobumfihlo.

Ukufumana nokulawula idata, siye sandisa iDatha yethu yoFikelelo lweDatha ukuya DAL) ukubonelela ngezixhobo zombini kwizakhiwo kunye nedatha yeLifu likaGoogle, ukubonelela nge-interface eyodwa kunye ne-API kubasebenzisi bethu. NjengoGoogle Ikhathalogu yedatha isiya ekufumanekeni ngokubanzi, siyakuyibandakanya kwiiprojekthi zethu ukubonelela abasebenzisi ngeempawu ezinjengokukhangela kwikholamu.

I-BigQuery yenza kube lula ukwabelana kunye nokufikelela kwidatha, kodwa bekufuneka sibe nolawulo oluthile kule nto ukunqanda ukukhutshwa kwedatha. Phakathi kwezinye izixhobo, sikhethe imisebenzi emibini:

  • Idomein ithintelwe ukwabelana: Inqaku leBeta lokuthintela abasebenzisi ekwabelaneni ngedatha yeBigQuery nabasebenzisi abangaphandle kweTwitter.
  • Ulawulo lwenkonzo yeVPC: Ulawulo oluthintela ukukhutshelwa kwedatha kwaye lufuna ukuba abasebenzisi bafikelele kwi-BigQuery ukusuka kuluhlu lweedilesi ze-IP ezaziwayo.

Siphumeze iimfuno zokuqinisekisa, ugunyaziso, kunye nophicotho (AAA) kukhuseleko ngolu hlobo lulandelayo:

  • Ukuqinisekiswa: Sisebenzise ii-akhawunti zomsebenzisi we-GCP kwizicelo ze-ad hoc kunye neeakhawunti zenkonzo kwizicelo zemveliso.
  • Ugunyaziso: Besifuna iseti yedatha nganye ibe ne-akhawunti yenkonzo yomnini kunye neqela labafundi.
  • Uphicotho-zincwadi: Sithumele ngaphandle i-BigQuery stackdriver logs, equlethe iinkcukacha ezineenkcukacha zokwenziwa kombuzo, kwi-BigQuery dataset ukuze kuhlalutywe lula.

Ukuqinisekisa ukuba idatha yomuntu siqu yabasebenzisi be-Twitter iphathwa ngokufanelekileyo, kufuneka sibhalise zonke iiseti zedatha ye-BigQuery, sichaze idatha yomntu, sigcine ukugcinwa okufanelekileyo, kunye nokucima (ukukrazula) idatha ecinywe ngabasebenzisi.

Sajonga kuGoogle Cloud Data Loss Prevention API, esebenzisa umatshini wokufunda ukwahlula nokuhlela idatha enovakalelo, kodwa kugqitywe ngokuthanda ukuchazelwa kweqela ledatha ngenxa yokuchaneka. Siceba ukusebenzisa i-API yoThintelo loLahleko lweDatha ukunyusa isichasiselo sesiko.

KuTwitter, senze iindidi zabucala ezine zeeseti zedatha kwiBigQuery, edweliswe apha ngokwehla ngolungelelwano lobuntununtunu:

  • Iiseti zedatha ezinovakalelo kakhulu zenziwe zifumaneke ngokwesiseko esifunekayo ngokusekwe kumgaqo welona lungelo lincinci. Isethi nganye yedatha ineqela elahlukileyo labafundi, kwaye siya kulandelela ukusetyenziswa kweakhawunti nganye.
  • Iiseti zedatha ezinovakalelo oluphakathi (ii-pseudonyms zondlela-nye zisebenzisa i-hashing enetyuwa) aziqulathanga iNkcukacha eZichongayo uMntu (PII) kwaye ziyafikeleleka kwiqela elikhulu labasebenzi. Le yibhalansi elungileyo phakathi kweenkxalabo zabucala kunye nokusetyenziswa kwedatha. Oku kuvumela abasebenzi ukuba benze imisebenzi yohlalutyo, njengokubala inani labasebenzisi abasebenzisa inqaku, ngaphandle kokwazi ukuba ngoobani abasebenzisi bokwenyani.
  • Iiseti zedatha ezinovakalelo oluphantsi nazo zonke iinkcukacha zokuchonga umsebenzisi. Le yindlela elungileyo ukusuka kumbono wabucala, kodwa ayinakusetyenziselwa uhlalutyo lwenqanaba lomsebenzisi.
  • Iisethi zedatha zoluntu (ezikhutshwe ngaphandle kwe-Twitter) ziyafumaneka kubo bonke abasebenzi be-Twitter.

Ngokugawulwa kwemithi, sisebenzise imisebenzi ecwangcisiweyo ukubala iiseti zedatha zeBigQuery kwaye sizibhalise ngoMaleko woFikelelo lweDatha (DAL), Ugcino lwemetadata ye-Twitter. Abasebenzisi baya kuchaza iiseti zedatha ngolwazi lwabucala kwaye bachaze ixesha lokugcinwa. Ngokuphathelele ukucoca, sivavanya ukusebenza kunye neendleko zeenketho ezimbini: 1. Ukucoca iiseti zedatha kwi-GCS usebenzisa izixhobo ezinje ngeScalding kwaye uzilayishe kwi-BigQuery; 2. Ukusebenzisa iingxelo ze-BigQuery DML. Siza kusebenzisa indibaniselwano yazo zombini iindlela ukuhlangabezana neemfuno zamaqela ahlukeneyo kunye nedatha.

Ukusebenza kwenkqubo

Ngenxa yokuba iBigQuery yinkonzo elawulwayo, kwakungekho mfuneko yokubandakanya iqela le-SRE le-Twitter kulawulo lweenkqubo okanye kwimisebenzi yedesika. Kwakulula ukubonelela ngomthamo ongaphezulu kokubini ukugcina kunye nekhompyutha. Singatshintsha ugcino lwendawo ngokudala itikiti ngenkxaso kaGoogle. Sichonge iindawo ezinokuthi ziphuculwe, ezifana nolwabiwo lwe-self-service slot kunye nokuphuculwa kwedeshibhodi yokubeka iliso, kwaye sangenisa ezo zicelo kuGoogle.

iindleko

Uhlalutyo lwethu lokuqala lubonise ukuba iindleko zemibuzo ye-BigQuery kunye ne-Presto zazikwinqanaba elifanayo. Sithenge iindawo zokubeka ilungisiwe ixabiso lokuba neendleko zenyanga ezizinzileyo endaweni yentlawulo ifuneka ngamandla nge-TB yedatha esetyenzisiweyo. Esi sigqibo sasisekelwe kwimpendulo evela kubasebenzisi abangafuni ukucinga ngeendleko ngaphambi kokuba benze isicelo ngasinye.

Ukugcina idatha kwi-BigQuery kwazisa iindleko ukongeza kwiindleko ze-GCS. Izixhobo ezinje ngeScalding zifuna iiseti zedatha kwi-GCS, kwaye ukufikelela kwi-BigQuery kuye kwafuneka silayishe iiseti zedatha ezifanayo kwifomathi yeBigQuery. Umphathi. Sisebenza kuqhagamshelo lwe-Scalding kwiiseti zedatha ze-BigQuery eziya kuphelisa isidingo sokugcina iiseti zedatha kwi-GCS kunye ne-BigQuery.

Kwiimeko ezinqabileyo ezifuna imibuzo engaqhelekanga yamashumi eepetabytes, sigqibe kwelokuba ukugcina iiseti zedatha kwi-BigQuery akubizi ndleko kwaye kusetyenziswe i-Presto ukufikelela ngokuthe ngqo kwiiseti zedatha kwi-GCS. Ukwenza oku, sijonge kwiMithombo yeDatha yaNgaphandle yeBigQuery.

Amanyathelo alandelayo

Sibone umdla omkhulu kwiBigQuery ukusukela ekukhutshweni kwealpha. Songeza iiseti zedatha ezininzi kunye nemiyalelo emininzi kwiBigQuery. Siphuhlisa iziqhagamshelo zezixhobo zohlalutyo lwedatha ezifana ne-Scalding ukufunda nokubhala kwi-BigQuery yokugcina. Sijonge izixhobo ezifana ne-Locker kunye ne-Apache Zeppelin ekwenzeni iingxelo zomgangatho weshishini kunye namanqaku kusetyenziswa i-BigQuery datasets.

Intsebenziswano yethu noGoogle ibe nemveliso kakhulu kwaye siyakuvuyela ukuqhubeka nokuphuhlisa olu buhlakani. Sisebenze noGoogle ukuphumeza ezethu Umlandeli woMba weqabaneukuthumela imibuzo ngqo kuGoogle. Ezinye zazo, ezinje ngeBigQuery Parquet loader, sele ziphunyezwe nguGoogle.

Nazi ezinye zezicelo zethu eziphambili ngokubaluleka kuGoogle:

  • Izixhobo zokufumana idatha efanelekileyo kunye nenkxaso yefomathi ye-LZO-Thrift.
  • Ukwahlulahlula ngeyure
  • Uphuculo lolawulo lofikelelo olufana netafile-, umqolo-, kunye neemvume zomgangatho wekholamu.
  • enkulu Imithombo yeDatha yaNgaphandle ngokudityaniswa kwe-Hive Metastore kunye nenkxaso yefomati ye-LZO-Thrift.
  • Ukudityaniswa kwekhathalogu yedatha ephuculweyo kujongano lomsebenzisi weBigQuery
  • Ukuzisebenzela ngokwabiwa kwe-slot kunye nokubeka iliso.

isiphelo

Ukuhlaziya idatha yedemokhrasi, ukubonwa, kunye nokufunda komatshini ngendlela ekhuselekileyo yinto ephambili ephambili kwiqela leDatha yeDatha. Sichonge iGoogle BigQuery kunye neData Studio njengezixhobo ezinokunceda ukufezekisa le njongo, kwaye sakhupha iBigQuery Alpha kwinkampani iphela kunyaka ophelileyo.

Sifumene imibuzo kwiBigQuery ilula kwaye iyasebenza. Sasebenzisa izixhobo zikaGoogle ukungenisa kunye nokuguqula idatha kwimibhobho elula, kodwa kwiipayipi ezintsonkothileyo kwafuneka sakhe esakhe isikhokelo sokuhamba kwe-Airflow. Kwindawo yolawulo lwedatha, iinkonzo zeBigQuery zokuqinisekisa, ukugunyaziswa, kunye nophicotho zihlangabezana neemfuno zethu. Ukulawula i-metadata kunye nokugcina ubumfihlo, besidinga ubhetyebhetye ngakumbi kwaye kwafuneka sakhe iisistim zethu. IBigQuery, iyinkonzo elawulwayo, kwakulula ukuyisebenzisa. Iindleko zokubuza zazifana nezixhobo ezikhoyo. Ukugcina idatha kwi-BigQuery kungenisa iindleko ukongeza kwiindleko ze-GCS.

Ngokubanzi, iBigQuery isebenza kakuhle kuhlalutyo lweSQL ngokubanzi. Sibona umdla omkhulu kwi-BigQuery, kwaye sisebenzela ukufudusa iiseti zedatha ezininzi, ukuzisa amaqela angakumbi, kwaye sakhe imibhobho engakumbi nge-BigQuery. I-Twitter isebenzisa uluhlu lwedatha oluya kufuna ukudibanisa izixhobo ezifana ne-Scalding, i-Spark, i-Presto, kunye ne-Druid. Sijonge ukuqhubeka nokomeleza izixhobo zethu zokuhlalutya idatha kunye nokubonelela ngesikhokelo esicacileyo kubasebenzisi bethu malunga nendlela yokuyisebenzisa kakuhle iminikelo yethu.

Amazwi ombulelo

Ndingathanda ukubulela ababhali bam kunye nabalingane, u-Anju Jha kunye no-Will Pascucci, ngokubambisana kwabo okukhulu kunye nokusebenza kanzima kule projekthi. Ndingathanda ukubulela iinjineli kunye nabaphathi abavela kumaqela aliqela kuTwitter nakuGoogle abasincedileyo kunye nabasebenzisi beBigQuery kuTwitter abasinike ingxelo ebalulekileyo.

Ukuba unomdla ekusebenzeni kwezi ngxaki, jonga yethu izithuba kwiqela leQonga leDatha.

Umgangatho weDatha kwi-DWH - Ukungqinelana kweNdawo yokuGcina iDatha

umthombo: www.habr.com

Yongeza izimvo