I-BigQuery ye-Google ikwenze kanjani ukuhlaziya idatha yedemokhrasi. Ingxenye 1

Sawubona, Habr! Ukubhalisela ukusakaza okusha kwesifundo kuvuliwe khona manje e-OTUS Unjiniyela Wedatha. Njengoba silindele ukuqala kwesifundo, sikulungiselele ngokwesiko ukuhunyushwa kwezinto ezithokozisayo.

Nsuku zonke, abantu abangaphezu kwezigidi eziyikhulu bavakashela i-Twitter ukuze bathole ukuthi kwenzekani emhlabeni futhi baxoxe ngakho. Yonke i-tweet kanye nazo zonke ezinye izenzo zabasebenzisi zenza umcimbi otholakalayo wokuhlaziya idatha yangaphakathi ye-Twitter. Amakhulu ezisebenzi ahlaziya futhi abone ngeso lengqondo le datha, futhi ukuthuthukisa ulwazi lwabo kuyinto ebaluleke kakhulu eqenjini le-Twitter Data Platform.

Sikholelwa ukuthi abasebenzisi abanamakhono anhlobonhlobo obuchwepheshe kufanele bakwazi ukuthola idatha futhi babe nokufinyelela kumathuluzi okuhlaziya asekelwe ku-SQL asebenza kahle kanye nokubona ngeso lengqondo. Lokhu kuzovumela iqembu elisha lonke labasebenzisi abancane bezobuchwepheshe, okuhlanganisa abahlaziyi bedatha nabaphathi bomkhiqizo, ukukhipha imininingwane kudatha, okubavumela ukuthi baqonde kangcono futhi basebenzise amakhono e-Twitter. Lena yindlela esenza ngayo intando yeningi ukuhlaziya idatha ku-Twitter.

Njengoba amathuluzi ethu namandla okuhlaziya idatha angaphakathi ethuthukile, siyibonile i-Twitter ithuthuka. Nokho, isekhona indawo yokuthuthukisa. Amathuluzi amanje afana ne-Scalding adinga ulwazi lokuhlela. Amathuluzi okuhlaziya asekelwe ku-SQL afana ne-Presto ne-Vertica anezinkinga zokusebenza esikalini. Siphinde sibe nenkinga yokusabalalisa idatha kumasistimu amaningi ngaphandle kokufinyelela njalo kuyo.

Ngonyaka odlule samemezela ukusebenzisana okusha ne-Google, lapho sidlulisela khona izingxenye zethu ingqalasizinda yedatha ku-Google Cloud Platform (GCP). Siphethe ngokuthi amathuluzi e-Google Cloud Idatha enkulu ingasisiza ngemizamo yethu yokwenza ukuhlaziya, ukubona ngeso lengqondo, nokufunda ngomshini ku-Twitter:

  • IBigQuery: Inqolobane yedatha yebhizinisi enenjini ye-SQL esekelwe Dremel, edume ngesivinini sayo, ubulula kanye nokubhekana nayo ukufunda ngomshini.
  • Isitudiyo Sedatha: ithuluzi elikhulu lokubonisa idatha elinezici zokusebenzisana ezifana ne-Google Amadokhumenti.

Kulesi sihloko, uzofunda ngokuhlangenwe nakho kwethu ngala mathuluzi: esikwenzile, esikufundile, nesizokwenza ngokulandelayo. Manje sizogxila ku-batch nokuhlaziya okusebenzisanayo. Sizoxoxa ngezibalo zesikhathi sangempela esihlokweni esilandelayo.

Umlando Wezitolo Zedatha Ye-Twitter

Ngaphambi kokuthi ungene ku-BigQuery, kufanelekile ukulandisa kafushane umlando wokugcinwa kwedatha ye-Twitter. Ngo-2011, ukuhlaziywa kwedatha ye-Twitter kwenziwa e-Vertica nase-Hadoop. Sisebenzise i-Pig ukudala imephu Yehlisa imisebenzi ye-Hadoop. Ngo-2012, sashintsha i-Pig safaka i-Scalding, eyayine-Scala API enezinzuzo ezifana nekhono lokudala amapayipi ayinkimbinkimbi kanye nokuhlola kalula. Kodwa-ke, kubahlaziyi bedatha abaningi nabaphathi bemikhiqizo ababekhululeke kakhudlwana ukusebenza ne-SQL, bekuyijika lokufunda eliwumqansa. Cishe ngo-2016, saqala ukusebenzisa i-Presto njengesixhumi esibonakalayo se-SQL kudatha ye-Hadoop. I-Spark inikeze isixhumi esibonakalayo sePython, okwenza kube ukukhetha okuhle kwesayensi yedatha ye-ad hoc nokufunda komshini.

Kusukela ngo-2018, sisebenzise amathuluzi alandelayo okuhlaziya idatha nokubona ngeso lengqondo:

  • I-scalding yemishini yokukhiqiza
  • I-Scalding ne-Spark yokuhlaziya idatha ye-ad hoc nokufunda ngomshini
  • I-Vertica ne-Presto yokuhlaziywa kwe-ad hoc nokusebenzelana kwe-SQL
  • I-Druid yokufinyelela okusebenzisanayo okuphansi, ukuhlola nokufinyelela ukubambezeleka okuphansi kumamethrikhi ochungechunge lwesikhathi
  • I-Tableau, i-Zeppelin ne-Pivot yokubuka idatha

Sithole ukuthi nakuba la mathuluzi enikeza amakhono anamandla kakhulu, sibe nobunzima bokwenza lawa makhono atholakale kuzithameli ezibanzi ku-Twitter. Ngokwandisa inkundla yethu nge-Google Cloud, sigxile ekwenzeni lula amathuluzi ethu okuhlaziya ayo yonke i-Twitter.

I-Google's BigQuery Data Warehouse

Amaqembu amaningana ku-Twitter asevele ayifakile i-BigQuery kwamanye amapayipi awo okukhiqiza. Sisebenzisa ubuchwepheshe babo, saqala ukuhlola amakhono e-BigQuery kuwo wonke amacala okusebenzisa i-Twitter. Umgomo wethu kwakuwukunikeza i-BigQuery kuyo yonke inkampani nokuyimisa ibe sezingeni futhi siyisekele ngaphakathi kwesethi yamathuluzi ye-Data Platform. Lokhu kwakunzima ngenxa yezizathu eziningi. Besidinga ukuthuthukisa ingqalasizinda ukuze singenise ngokuthembekile idatha enkulu, sisekele ukuphathwa kwedatha yenkampani yonke, siqinisekise izilawuli ezifanele zokufinyelela, futhi siqinisekise ubumfihlo bekhasimende. Kudingeka futhi sidale amasistimu okwabiwa kwezinsiza, ukuqapha, nokukhokhela emuva ukuze amaqembu akwazi ukusebenzisa i-BigQuery ngempumelelo.

NgoNovemba 2018, sikhiphe ukukhishwa kwe-alpha kwenkampani yonke kwe-BigQuery ne-Data Studio. Sinikeze abasebenzi be-Twitter amanye amaspredishithi ethu asetshenziswa kakhulu anedatha yomuntu siqu ehlanziwe. I-BigQuery isetshenziswe abasebenzisi abangaphezu kuka-250 abavela emaqenjini ahlukahlukene ahlanganisa ubunjiniyela, ezezimali nokuthengisa. Muva nje, bebesebenzisa izicelo ezingaba ngu-8k, becubungula cishe i-100 PB ngenyanga, ngaphandle kokubala izicelo ezihleliwe. Ngemva kokuthola impendulo eyakhayo, sinqume ukuqhubekela phambili futhi sinikeze i-BigQuery njengesisetshenziswa esiyinhloko sokuxhumana nedatha ku-Twitter.

Nawu umdwebo wezinga eliphezulu le-architecture ye-Google BigQuery ye-warehouse data.

I-BigQuery ye-Google ikwenze kanjani ukuhlaziya idatha yedemokhrasi. Ingxenye 1
Sikopisha idatha kusuka kumaqoqo e-Hadoop asendaweni kuya ku-Google Cloud Storage (GCS) sisebenzisa ithuluzi langaphakathi le-Cloud Replicator. Bese sisebenzisa i-Apache Airflow ukwakha amapayipi asebenzisa "bq_layisha»ukulayisha idatha isuka ku-GCS iye ku-BigQuery. Sisebenzisa i-Presto ukubuza amasethi edatha e-Parquet noma e-Thrift-LZO ku-GCS. I-BQ Blaster iyithuluzi langaphakathi le-Scalding lokulayisha amasethi edatha e-HDFS Vertica kanye ne-Thrift-LZO ku-BigQuery.

Ezigabeni ezilandelayo, sixoxa ngendlela yethu nobungcweti ezindaweni zokusebenzisa kalula, ukusebenza, ukuphathwa kwedatha, impilo yesistimu, kanye nezindleko.

Kulula ukusetshenziswa

Sithole ukuthi bekulula kubasebenzisi ukuthi baqalise nge-BigQuery ngoba yayingadingi ukufakwa kwesofthiwe futhi abasebenzisi babekwazi ukuyifinyelela ngokusebenzisa isixhumi esibonakalayo sewebhu. Nokho, abasebenzisi babedinga ukujwayelana nezinye izici nemiqondo ye-GCP, okuhlanganisa izinsiza ezifana namaphrojekthi, amasethi edatha, namathebula. Senze izinto zokufundisa nezifundiswa ukusiza abasebenzisi ukuthi baqale. Ngokuqonda okuyisisekelo abakuzuzile, abasebenzisi bakuthola kulula ukuzulazula kumasethi edatha, ukubuka i-schema nedatha yethebula, sebenzisa imibuzo elula, futhi babone ngeso lengqondo imiphumela ku-Data Studio.

Umgomo wethu wokungena kwedatha ku-BigQuery kwakuwukuvumela ukulayishwa okungenazihibe kwedathasethi ye-HDFS noma ye-GCS ngokuchofoza okukodwa. Sicabangele Umqambi wamafu (iphethwe i-Airflow) kodwa ayikwazanga ukuyisebenzisa ngenxa yemodeli yethu yezokuphepha Yokwabelana Okukhawulelwe Yesizinda (okuningi ngalokhu esigabeni Sokuphathwa Kwedatha ngezansi). Sazama ukusebenzisa Isevisi Yokudluliswa Kwedatha ye-Google (i-DTS) ukuze sihlele imithwalo yemisebenzi ye-BigQuery. Nakuba i-DTS yashesha ukumisa, yayingaguquguquki ekwakheni amapayipi anenciki. Ukuze sikhishwe i-alpha, sakhe uhlaka lwethu lwe-Apache Airflow ku-GCE futhi silulungiselela ukuthi lusebenze ekukhiqizweni futhi sikwazi ukusekela imithombo yedatha eyengeziwe njenge-Vertica.

Ukuguqula idatha ibe yi-BigQuery, abasebenzisi bakha amaphayiphi wedatha ye-SQL elula besebenzisa imibuzo ehleliwe. Kumapayipi ayinkimbinkimbi anezigaba eziningi anokuncika, sihlela ukusebenzisa uhlaka lwethu lwe-Airflow noma i-Cloud Composer kanye I-Cloud Dataflow.

Ukukhiqiza

I-BigQuery yakhelwe ngenjongo evamile yemibuzo ye-SQL ecubungula amanani amakhulu edatha. Ayihloselwe ukubambezeleka okuphansi, imibuzo ephezulu yokuphuma edingwa isizindalwazi sokwenziwayo, noma ukuhlaziya kochungechunge lwesikhathi sokubambezeleka okuphansi okwenziwe I-Apache Druid. Ngemibuzo yezibalo ezisebenzisanayo, abasebenzisi bethu balindele izikhathi zokuphendula ezingaphansi komzuzu owodwa. Bekufanele sidizayine ukusebenzisa kwethu i-BigQuery ukuze sihlangabezane nalokhu ebesikulindele. Ukuze kuhlinzekwe ukusebenza okubikezelwe kubasebenzisi bethu, sisebenzise amandla e-BigQuery, atholakala kumakhasimende ngenhlawulo ephansi evumela abanikazi bephrojekthi ukuthi bagcine izikhala ezincane zemibuzo yabo. Isikhala I-BigQuery iyunithi yamandla ekhompuyutha adingekayo ukuze kusayinwe imibuzo ye-SQL.

Sihlaziye imibuzo engaphezu kuka-800 ecubungula cishe u-1 TB wedatha ngayinye futhi sathola ukuthi isikhathi sokwenza esimaphakathi sasingamasekhondi angu-30. Siphinde safunda ukuthi ukusebenza kuncike kakhulu ekusetshenzisweni kwesikhala sethu kumaphrojekthi nemisebenzi ehlukene. Bekufanele sicacise ngokucacile ukukhiqizwa kwethu kanye nezindawo ezigciniwe zesikhashana ukuze sigcine ukusebenza kwezimo zokusetshenziswa kokukhiqiza nokuhlaziya ku-inthanethi. Lokhu kube nomthelela omkhulu ekwakhiweni kwethu kokubhukha izikhala kanye nezikhundla zamaphrojekthi.

Sizokhuluma ngokuphathwa kwedatha, ukusebenza kanye nezindleko zamasistimu ezinsukwini ezizayo engxenyeni yesibili yokuhumusha, kodwa manje simema wonke umuntu ukuthi khulula webinar bukhoma, lapho uzokwazi ukufunda ngokuningiliziwe mayelana nenkambo, futhi ubuze imibuzo kuchwepheshe wethu - u-Egor Mateshuk (Unjiniyela Omkhulu Wedatha, uMaximaTelecom).

Funda kabanzi:

Source: www.habr.com

Engeza amazwana