I-BigQuery ye-Google ikwenze kanjani ukuhlaziya idatha yedemokhrasi. Ingxenye 2

Sawubona, Habr! Ukubhalisela ukusakaza okusha kwesifundo kuvuliwe khona manje e-OTUS Unjiniyela Wedatha. Njengoba silindele ukuqala kwesifundo, siyaqhubeka nokwabelana nawe ngezinto eziwusizo.

Funda ingxenye yokuqala

I-BigQuery ye-Google ikwenze kanjani ukuhlaziya idatha yedemokhrasi. Ingxenye 2

Ukuphathwa kwedatha

I-Strong Data Governance iwumgogodla we-Twitter Engineering. Njengoba sisebenzisa i-BigQuery endaweni yethu, sigxila ekutholeni idatha, ukulawula ukufinyelela, ukuvikeleka kanye nobumfihlo.

Ukuze sithole futhi siphathe idatha, sinwebe Isendlalelo sethu Sokufinyelela Kudatha DAL) ukuhlinzeka ngamathuluzi akho kokubili endaweni kanye nedatha ye-Google Cloud, ukuhlinzeka ngesixhumi esibonakalayo esisodwa kanye ne-API yabasebenzisi bethu. Njenge-Google Ikhathalogi Yedatha iphokophele ekutholakaleni okuvamile, sizokufaka kumaphrojekthi ethu ukuze sinikeze abasebenzisi izici ezifana nokusesha kwekholomu.

I-BigQuery ikwenza kube lula ukwabelana nokufinyelela idatha, kodwa besidinga ukulawula lokhu ukuze sigweme ukukhishwa kwedatha. Phakathi kwamanye amathuluzi, sikhethe imisebenzi emibili:

  • Ukwabelana okukhawulelwe kwesizinda: Isici se-Beta sokuvimbela abasebenzisi ekwabeleni idathasethi ye-BigQuery nabasebenzisi abangaphandle kwe-Twitter.
  • Izilawuli zesevisi ye-VPC: Ukulawula okuvimbela ukukhishwa kwedatha futhi kudinga abasebenzisi ukuthi bafinyelele i-BigQuery kusukela kububanzi bamakheli e-IP obaziwayo.

Senze izimfuneko zokuqinisekisa, ukugunyazwa, kanye nokucwaninga (AAA) ngokuvikeleka ngale ndlela elandelayo:

  • Ukuqinisekisa: Sisebenzise ama-akhawunti abasebenzisi be-GCP ngezicelo zesikhashana nama-akhawunti esevisi ngezicelo zokukhiqiza.
  • Ukugunyazwa: Besidinga ukuthi idathasethi ngayinye ibe ne-akhawunti yesevisi yomnikazi kanye neqembu labafundi.
  • Ukuhlola: Sikhiphe amalogi e-BigQuery stackdriver, aqukethe ulwazi oluningiliziwe lokusebenzisa imibuzo, kudathasethi ye-BigQuery ukuze ahlaziywe kalula.

Ukuqinisekisa ukuthi idatha yomuntu siqu yabasebenzisi be-Twitter iphathwa ngendlela efanele, kufanele sibhalise wonke amasethi edatha e-BigQuery, sichaze idatha yomuntu siqu, sigcine isitoreji esifanele, futhi sisuse (scrape) idatha esuswe abasebenzisi.

Sibheke i-Google I-Cloud Data Loss Prevention API, esebenzisa ukufunda komshini ukuze ihlukanise futhi ihlele idatha ebucayi, kodwa inqume ukuvumelana nokuchasisa idathasethi ngenxa yokunemba. Sihlela ukusebenzisa i-Data Loss Prevention API ukuze sikhulise isichasiselo sangokwezifiso.

Ku-Twitter, sidale izigaba ezine zobumfihlo zamasethi edatha ku-BigQuery, ezisohlwini lapha ngohlelo olwehlayo lokuzwela:

  • Amasethi edatha azwela kakhulu enziwa atholakale ngokwesisekelo esidingekayo ngokusekelwe kumgomo wokungabi namalungelo amancane. Isethi yedatha ngayinye ineqembu elihlukile labafundi, futhi sizolandelela ukusetshenziswa ngama-akhawunti ngamanye.
  • Amasethi edatha okuzwela okumaphakathi (amagama-mbumbulu endlela eyodwa asebenzisa i-hashing enosawoti) awaqukethe Ulwazi Oluhlonza Umuntu (PII) futhi afinyeleleka eqenjini elikhulu labasebenzi. Lokhu ukulingana okuhle phakathi kokukhathazeka kobumfihlo kanye nokusetshenziswa kwedatha. Lokhu kuvumela abasebenzi ukuthi benze imisebenzi yokuhlaziya, efana nokubala inombolo yabasebenzisi abasebenzise isici, ngaphandle kokwazi ukuthi obani abasebenzisi bangempela.
  • Amasethi edatha okuzwela okuphansi analo lonke ulwazi oluhlonza umsebenzisi. Lena indlela enhle ngokombono wobumfihlo, kodwa ayikwazi ukusetshenziselwa ukuhlaziywa kwezinga lomsebenzisi.
  • Amasethi edatha omphakathi (akhishwe ngaphandle kwe-Twitter) ayatholakala kubo bonke abasebenzi be-Twitter.

Ngokuqondene nokugawula, sisebenzise imisebenzi ehleliwe ukubala amasethi edatha e-BigQuery futhi siwabhalise Ngosendlalelo Sokufinyelela Idatha (DAL), inqolobane yemethadatha ye-Twitter. Abasebenzisi bazochaza amasethi edatha ngolwazi lobumfihlo futhi bacacise isikhathi sokugcinwa. Mayelana nokuhlanza, sihlola ukusebenza nezindleko zezinketho ezimbili: 1. Ukuhlanza amasethi edatha ku-GCS kusetshenziswa amathuluzi afana ne-Scalding futhi ukuwalayisha ku-BigQuery; 2. Ukusebenzisa izitatimende ze-BigQuery DML. Cishe sizosebenzisa inhlanganisela yazo zombili izindlela ukuze sihlangabezane nezimfuneko zamaqembu ahlukene nedatha.

Ukusebenza kwesistimu

Ngenxa yokuthi i-BigQuery iyisevisi ephethwe, besingekho isidingo sokubandakanya ithimba le-SRE le-Twitter ekuphatheni amasistimu noma emisebenzini yedeski. Kwakulula ukunikeza umthamo owengeziwe kukho kokubili isitoreji nekhompyutha. Singashintsha ukubhukha indawo ngokudala ithikithi ngosekelo lwe-Google. Sihlonze izindawo ezingathuthukiswa, njengokwabiwa kwesikhala sokuzisiza kanye nokuthuthukiswa kwedeshibhodi yokuqapha, futhi sahambisa lezo zicelo ku-Google.

izindleko

Ukuhlaziya kwethu kokuqala kubonise ukuthi izindleko zombuzo ze-BigQuery ne-Presto zazikuleveli efanayo. Sithenge izikhala ze kulungisiwe intengo ukuze ibe nezindleko ezizinzile zenyanga esikhundleni senkokhelo efunwa kakhulu nge-TB ngayinye yedatha ecutshunguliwe. Lesi sinqumo futhi sisekelwe empendulweni evela kubasebenzisi abangafuni ukucabanga ngezindleko ngaphambi kokwenza isicelo ngasinye.

Ukugcina idatha ku-BigQuery kulethe izindleko ngaphezu kwezindleko ze-GCS. Amathuluzi afana ne-Scalding adinga amasethi edatha ku-GCS, futhi ukuze sifinyelele i-BigQuery kwakudingeka silayishe amasethi edatha afanayo kufomethi ye-BigQuery. Umshayeli wemisebenzi. Sisebenzela uxhumo lwe-Scalding kumadathasethi we-BigQuery oluzoqeda isidingo sokugcina amasethi edatha kukho kokubili i-GCS ne-BigQuery.

Ezimweni ezingavamile ezidinga imibuzo engavamile yamashumi ama-petabytes, sinqume ukuthi ukugcinwa kwedathasethi ku-BigQuery kwakungabizi futhi kwasebenzisa i-Presto ukuze sifinyelele ngokuqondile amasethi edatha ku-GCS. Ukuze senze lokhu, sibheke Imithombo Yedatha Yangaphandle Ye-BigQuery.

Izinyathelo ezilandelayo

Sibone intshisekelo enkulu ku-BigQuery kusukela ekukhishweni kwe-alpha. Sengeza amasethi edatha engeziwe neminye imiyalo ku-BigQuery. Sakha izixhumi zamathuluzi okuhlaziya idatha afana ne-Scalding ukuze sifunde futhi sibhale kusitoreji se-BigQuery. Sibheke amathuluzi afana ne-Locker ne-Apache Zeppelin okudala imibiko yekhwalithi yebhizinisi namanothi sisebenzisa amasethi edatha e-BigQuery.

Ukusebenzisana kwethu ne-Google kube nomphumela omuhle kakhulu futhi siyajabula ukuqhubeka nokuthuthukisa lobu budlelwano. Sisebenze ne-Google ukuze senze okwethu I-Tracker Yezinkinga Zozakwethuukuthumela imibuzo ngqo ku-Google. Ezinye zazo, ezifana nesilayishi se-BigQuery Parquet, sezivele zisetshenziswe yi-Google.

Nazi ezinye zezicelo zesici ezibaluleke kakhulu ze-Google:

  • Amathuluzi okwamukela idatha okulula nokusekelwa kwefomethi ye-LZO-Thrift.
  • Ukuhlukaniswa ngehora
  • Ukuthuthukiswa kokulawula ukufinyelela okufana nezimvume zethebula, irowu, kanye nezinga lekholomu.
  • IBigQuery Imithombo Yedatha Yangaphandle ngokuhlanganiswa kwe-Hive Metastore nokusekelwa kwefomethi ye-LZO-Thrift.
  • Ukuhlanganiswa kwekhathalogi yedatha okuthuthukisiwe kusixhumi esibonakalayo somsebenzisi se-BigQuery
  • Ukuzisebenzela ngokwabiwa kwesikhala nokuqapha.

isiphetho

Ukuhlaziywa kwedatha ukwenza kube Democrat, ukubona ngeso, nokufunda komshini ngendlela evikelekile kuyinto ehamba phambili eqenjini le-Data Platform. Sihlonze i-Google BigQuery kanye neSitudiyo Sedatha njengamathuluzi angasiza ukufeza lo mgomo, futhi sakhulula i-BigQuery Alpha inkampani yonke ngonyaka odlule.

Sithole ukuthi imibuzo ku-BigQuery ilula futhi isebenza kahle. Sisebenzise amathuluzi e-Google ukuze singenise futhi siguqule idatha yamapayipi alula, kodwa kumapayipi ayinkimbinkimbi kwakudingeka sakhe uhlaka lwethu lwe-Airflow. Esikhaleni sokuphatha idatha, izinsizakalo ze-BigQuery zokuqinisekisa, ukugunyazwa, nokuhlolwa kwamabhuku zihlangabezana nezidingo zethu. Ukuze silawule imethadatha futhi sigcine ubumfihlo, sasidinga ukuguquguquka okwengeziwe futhi kwakudingeka sakhe amasistimu ethu. I-BigQuery, njengesevisi ephethwe, bekulula ukuyisebenzisa. Izindleko zemibuzo bezifana namathuluzi akhona. Ukugcina idatha ku-BigQuery kuletha izindleko ngaphezu kwezindleko ze-GCS.

Sekukonke, i-BigQuery isebenza kahle ekuhlaziyeni okujwayelekile kwe-SQL. Sibona intshisekelo enkulu ku-BigQuery, futhi sisebenzela ukuthutha amasethi edatha engeziwe, silethe amaqembu engeziwe, futhi sakhe amapayipi engeziwe nge-BigQuery. I-Twitter isebenzisa idatha ehlukahlukene ezodinga inhlanganisela yamathuluzi afana ne-Scalding, Spark, Presto, ne-Druid. Sihlose ukuqhubeka nokuqinisa amathuluzi ethu okuhlaziya idatha futhi sinikeze isiqondiso esicacile kubasebenzisi bethu mayelana nendlela engcono kakhulu yokusebenzisa ukunikezwa kwethu.

Amazwi okubonga

Ngithanda ukubonga ababhali engikanye nabo kanye nozakwethu beqembu, u-Anju Jha no-Will Pascucci, ngokubambisana kwabo okuhle nokusebenza kanzima kule phrojekthi. Ngingathanda futhi ukubonga onjiniyela nabaphathi abavela emaqenjini ambalwa ku-Twitter naku-Google abasisize kanye nabasebenzisi be-BigQuery ku-Twitter abanikeze impendulo ebalulekile.

Uma unentshisekelo yokusebenza ngalezi zinkinga, bheka yethu izikhala eqenjini le-Data Platform.

Ikhwalithi Yedatha ku-DWH - Ukungaguquguquki Kwenqolobane Yedatha

Source: www.habr.com

Engeza amazwana