Sawubona, Habr! Ukubhalisela ukusakaza okusha kwesifundo kuvuliwe khona manje e-OTUS
Ukuphathwa kwedatha
I-Strong Data Governance iwumgogodla we-Twitter Engineering. Njengoba sisebenzisa i-BigQuery endaweni yethu, sigxila ekutholeni idatha, ukulawula ukufinyelela, ukuvikeleka kanye nobumfihlo.
Ukuze sithole futhi siphathe idatha, sinwebe Isendlalelo sethu Sokufinyelela Kudatha
I-BigQuery ikwenza kube lula ukwabelana nokufinyelela idatha, kodwa besidinga ukulawula lokhu ukuze sigweme ukukhishwa kwedatha. Phakathi kwamanye amathuluzi, sikhethe imisebenzi emibili:
Ukwabelana okukhawulelwe kwesizinda : Isici se-Beta sokuvimbela abasebenzisi ekwabeleni idathasethi ye-BigQuery nabasebenzisi abangaphandle kwe-Twitter.Izilawuli zesevisi ye-VPC : Ukulawula okuvimbela ukukhishwa kwedatha futhi kudinga abasebenzisi ukuthi bafinyelele i-BigQuery kusukela kububanzi bamakheli e-IP obaziwayo.
Senze izimfuneko zokuqinisekisa, ukugunyazwa, kanye nokucwaninga (AAA) ngokuvikeleka ngale ndlela elandelayo:
- Ukuqinisekisa: Sisebenzise ama-akhawunti abasebenzisi be-GCP ngezicelo zesikhashana nama-akhawunti esevisi ngezicelo zokukhiqiza.
- Ukugunyazwa: Besidinga ukuthi idathasethi ngayinye ibe ne-akhawunti yesevisi yomnikazi kanye neqembu labafundi.
- Ukuhlola: Sikhiphe amalogi e-BigQuery stackdriver, aqukethe ulwazi oluningiliziwe lokusebenzisa imibuzo, kudathasethi ye-BigQuery ukuze ahlaziywe kalula.
Ukuqinisekisa ukuthi idatha yomuntu siqu yabasebenzisi be-Twitter iphathwa ngendlela efanele, kufanele sibhalise wonke amasethi edatha e-BigQuery, sichaze idatha yomuntu siqu, sigcine isitoreji esifanele, futhi sisuse (scrape) idatha esuswe abasebenzisi.
Sibheke i-Google
Ku-Twitter, sidale izigaba ezine zobumfihlo zamasethi edatha ku-BigQuery, ezisohlwini lapha ngohlelo olwehlayo lokuzwela:
- Amasethi edatha azwela kakhulu enziwa atholakale ngokwesisekelo esidingekayo ngokusekelwe kumgomo wokungabi namalungelo amancane. Isethi yedatha ngayinye ineqembu elihlukile labafundi, futhi sizolandelela ukusetshenziswa ngama-akhawunti ngamanye.
- Amasethi edatha okuzwela okumaphakathi (amagama-mbumbulu endlela eyodwa asebenzisa i-hashing enosawoti) awaqukethe Ulwazi Oluhlonza Umuntu (PII) futhi afinyeleleka eqenjini elikhulu labasebenzi. Lokhu ukulingana okuhle phakathi kokukhathazeka kobumfihlo kanye nokusetshenziswa kwedatha. Lokhu kuvumela abasebenzi ukuthi benze imisebenzi yokuhlaziya, efana nokubala inombolo yabasebenzisi abasebenzise isici, ngaphandle kokwazi ukuthi obani abasebenzisi bangempela.
- Amasethi edatha okuzwela okuphansi analo lonke ulwazi oluhlonza umsebenzisi. Lena indlela enhle ngokombono wobumfihlo, kodwa ayikwazi ukusetshenziselwa ukuhlaziywa kwezinga lomsebenzisi.
- Amasethi edatha omphakathi (akhishwe ngaphandle kwe-Twitter) ayatholakala kubo bonke abasebenzi be-Twitter.
Ngokuqondene nokugawula, sisebenzise imisebenzi ehleliwe ukubala amasethi edatha e-BigQuery futhi siwabhalise Ngosendlalelo Sokufinyelela Idatha (
Ukusebenza kwesistimu
Ngenxa yokuthi i-BigQuery iyisevisi ephethwe, besingekho isidingo sokubandakanya ithimba le-SRE le-Twitter ekuphatheni amasistimu noma emisebenzini yedeski. Kwakulula ukunikeza umthamo owengeziwe kukho kokubili isitoreji nekhompyutha. Singashintsha ukubhukha indawo ngokudala ithikithi ngosekelo lwe-Google. Sihlonze izindawo ezingathuthukiswa, njengokwabiwa kwesikhala sokuzisiza kanye nokuthuthukiswa kwedeshibhodi yokuqapha, futhi sahambisa lezo zicelo ku-Google.
izindleko
Ukuhlaziya kwethu kokuqala kubonise ukuthi izindleko zombuzo ze-BigQuery ne-Presto zazikuleveli efanayo. Sithenge izikhala ze
Ukugcina idatha ku-BigQuery kulethe izindleko ngaphezu kwezindleko ze-GCS. Amathuluzi afana ne-Scalding adinga amasethi edatha ku-GCS, futhi ukuze sifinyelele i-BigQuery kwakudingeka silayishe amasethi edatha afanayo kufomethi ye-BigQuery.
Ezimweni ezingavamile ezidinga imibuzo engavamile yamashumi ama-petabytes, sinqume ukuthi ukugcinwa kwedathasethi ku-BigQuery kwakungabizi futhi kwasebenzisa i-Presto ukuze sifinyelele ngokuqondile amasethi edatha ku-GCS. Ukuze senze lokhu, sibheke Imithombo Yedatha Yangaphandle Ye-BigQuery.
Izinyathelo ezilandelayo
Sibone intshisekelo enkulu ku-BigQuery kusukela ekukhishweni kwe-alpha. Sengeza amasethi edatha engeziwe neminye imiyalo ku-BigQuery. Sakha izixhumi zamathuluzi okuhlaziya idatha afana ne-Scalding ukuze sifunde futhi sibhale kusitoreji se-BigQuery. Sibheke amathuluzi afana ne-Locker ne-Apache Zeppelin okudala imibiko yekhwalithi yebhizinisi namanothi sisebenzisa amasethi edatha e-BigQuery.
Ukusebenzisana kwethu ne-Google kube nomphumela omuhle kakhulu futhi siyajabula ukuqhubeka nokuthuthukisa lobu budlelwano. Sisebenze ne-Google ukuze senze okwethu
Nazi ezinye zezicelo zesici ezibaluleke kakhulu ze-Google:
- Amathuluzi okwamukela idatha okulula nokusekelwa kwefomethi ye-LZO-Thrift.
- Ukuhlukaniswa ngehora
- Ukuthuthukiswa kokulawula ukufinyelela okufana nezimvume zethebula, irowu, kanye nezinga lekholomu.
- IBigQuery
Imithombo Yedatha Yangaphandle ngokuhlanganiswa kwe-Hive Metastore nokusekelwa kwefomethi ye-LZO-Thrift. - Ukuhlanganiswa kwekhathalogi yedatha okuthuthukisiwe kusixhumi esibonakalayo somsebenzisi se-BigQuery
- Ukuzisebenzela ngokwabiwa kwesikhala nokuqapha.
isiphetho
Ukuhlaziywa kwedatha ukwenza kube Democrat, ukubona ngeso, nokufunda komshini ngendlela evikelekile kuyinto ehamba phambili eqenjini le-Data Platform. Sihlonze i-Google BigQuery kanye neSitudiyo Sedatha njengamathuluzi angasiza ukufeza lo mgomo, futhi sakhulula i-BigQuery Alpha inkampani yonke ngonyaka odlule.
Sithole ukuthi imibuzo ku-BigQuery ilula futhi isebenza kahle. Sisebenzise amathuluzi e-Google ukuze singenise futhi siguqule idatha yamapayipi alula, kodwa kumapayipi ayinkimbinkimbi kwakudingeka sakhe uhlaka lwethu lwe-Airflow. Esikhaleni sokuphatha idatha, izinsizakalo ze-BigQuery zokuqinisekisa, ukugunyazwa, nokuhlolwa kwamabhuku zihlangabezana nezidingo zethu. Ukuze silawule imethadatha futhi sigcine ubumfihlo, sasidinga ukuguquguquka okwengeziwe futhi kwakudingeka sakhe amasistimu ethu. I-BigQuery, njengesevisi ephethwe, bekulula ukuyisebenzisa. Izindleko zemibuzo bezifana namathuluzi akhona. Ukugcina idatha ku-BigQuery kuletha izindleko ngaphezu kwezindleko ze-GCS.
Sekukonke, i-BigQuery isebenza kahle ekuhlaziyeni okujwayelekile kwe-SQL. Sibona intshisekelo enkulu ku-BigQuery, futhi sisebenzela ukuthutha amasethi edatha engeziwe, silethe amaqembu engeziwe, futhi sakhe amapayipi engeziwe nge-BigQuery. I-Twitter isebenzisa idatha ehlukahlukene ezodinga inhlanganisela yamathuluzi afana ne-Scalding, Spark, Presto, ne-Druid. Sihlose ukuqhubeka nokuqinisa amathuluzi ethu okuhlaziya idatha futhi sinikeze isiqondiso esicacile kubasebenzisi bethu mayelana nendlela engcono kakhulu yokusebenzisa ukunikezwa kwethu.
Amazwi okubonga
Ngithanda ukubonga ababhali engikanye nabo kanye nozakwethu beqembu, u-Anju Jha no-Will Pascucci, ngokubambisana kwabo okuhle nokusebenza kanzima kule phrojekthi. Ngingathanda futhi ukubonga onjiniyela nabaphathi abavela emaqenjini ambalwa ku-Twitter naku-Google abasisize kanye nabasebenzisi be-BigQuery ku-Twitter abanikeze impendulo ebalulekile.
Uma unentshisekelo yokusebenza ngalezi zinkinga, bheka yethu
Source: www.habr.com