Indlela iGoogle's BigQuery eyenze uhlalutyo lwedatha yedemokhrasi. Icandelo loku-1

Hayi Habr! Ubhaliso lwekhosi entsha luvuliwe e-OTUS ngoku Injineli yedatha. Ukulindela ukuqala kwekhosi, ngokwesiko sikulungiselele ukuguqulelwa kwezinto ezinomdla.

Yonke imihla, ngaphezulu kwekhulu lezigidi zabantu batyelela i-Twitter ukuze bafumanise ukuba kuqhubeka ntoni ehlabathini kwaye baxoxe ngayo. I-tweet nganye kunye nasiphi na esinye isenzo somsebenzisi sivelisa isiganeko esifumanekayo kuhlalutyo lwedatha yangaphakathi ngaphakathi kwe-Twitter. Amakhulu abasebenzi ahlalutya kwaye abonise le datha, kwaye ukuphucula amava abo yinto ephambili kwi-Twitter Data Platform iqela.

Sikholelwa ukuba abasebenzisi abanoluhlu olubanzi lwezakhono zobugcisa kufuneka bakwazi ukufumana idatha kunye nokufikelela ekusebenzeni kakuhle kwe-SQL-based analysis and visualization tools. Oku kuya kuvumela iqela elitsha labasebenzisi abangaphantsi kobuchwephesha, kubandakanywa abahlalutyi bedatha kunye nabaphathi beemveliso, ukukhupha ulwazi kwidatha, ebavumela ukuba baqonde ngcono kwaye basebenzise amandla e-Twitter. Yile ndlela esenza ngayo idemokhrasi uhlalutyo lwedatha kwi-Twitter.

Njengoko izixhobo zethu kunye nobuchule bokuhlalutya idatha yangaphakathi buphuculwe, siye sabona ukuphuculwa kwenkonzo ye-Twitter. Nangona kunjalo, kusekho indawo yokuphucula. Izixhobo zangoku ezifana ne-Scalding zifuna amava enkqubo. Izixhobo zokuhlalutya ezisekelwe kwi-SQL ezifana ne-Presto kunye ne-Vertica zinemiba yokusebenza kwinqanaba elikhulu. Sikwanayo nengxaki yokuhambisa idatha kwiinkqubo ezininzi ngaphandle kokufikelela rhoqo kuyo.

Kunyaka ophelileyo siye sabhengeza intsebenziswano entsha noGoogle, apho sidlulisela iinxalenye zethu iziseko zedatha kwi-Google Cloud Platform (GCP). Sigqibe kwelokuba izixhobo zeLifu likaGoogle Idatha enkulu inokusinceda kumalinge ethu okuhlalutya ngedemokhrasi, ukubonwa kunye nokufunda koomatshini kwi-Twitter:

  • enkulu: Indawo yokugcina idatha yeshishini enenjini yeSQL esekwe idremel, edume ngesantya sayo, ukulula kunye nokuhlangabezana nayo ukufunda koomatshini.
  • istudiyo sedatha: isixhobo esikhulu sokubonwa kwedatha esineempawu zentsebenziswano ezifana neGoogle Docs.

Kweli nqaku, uya kufunda ngamava ethu ngezi zixhobo: into esiyenzileyo, esiyifundileyo kunye nento esiza kuyenza ngokulandelayo. Ngoku siza kugxila kwi-batch kunye nohlalutyo olusebenzayo. Uhlalutyo lwexesha langempela luya kuxutyushwa kwinqaku elilandelayo.

Imbali yeeNdawo zokugcina iiNkcukacha kwi-Twitter

Ngaphambi kokuntywila kwiBigQuery, kufanelekile ukubalisa ngokufutshane imbali yeendawo zokugcina idatha kwi-Twitter. Kwi-2011, uhlalutyo lwedatha ye-Twitter lwenziwa kwi-Vertica kunye ne-Hadoop. Ukwenza imephuNciphisa imisebenzi yeHadoop, sasebenzisa iPig. Kwi-2012, satshintsha i-Pig nge-Scalding, eyayine-Scala API enezibonelelo ezifana nokukwazi ukudala imibhobho enzima kunye nokulula kovavanyo. Nangona kunjalo, kubahlalutyi bedatha abaninzi kunye nabaphathi beemveliso ababekhululekile ukusebenza nge-SQL, yayiyinto enzima yokufunda. Malunga ne-2016, saqala ukusebenzisa i-Presto njengesiphelo sethu sangaphambili se-SQL yedatha ye-Hadoop. I-Spark ibonelele nge-interface ye-Python eyenza ibe lukhetho olufanelekileyo kwisayensi yedatha ye-ad hoc kunye nokufunda koomatshini.

Ukusukela ngo-2018, sisebenzise ezi zixhobo zilandelayo zokuhlalutya idatha kunye nokubonwayo:

  • Ukutshisa kwimigca yemveliso
  • I-Scalding kunye ne-Spark yohlalutyo lwedatha ye-ad hoc kunye nokufunda koomatshini
  • I-Vertica kunye ne-Presto ye-ad hoc kunye nohlalutyo olusebenzayo lwe-SQL
  • I-Druid yokunxibelelana okuphantsi, ukuhlola kunye nokufikelela kwi-latency ephantsi kwiimethrikhi zochungechunge lwexesha
  • Itheyibhile, iZeppelin kunye nePivot yokuJonga iDatha

Sifumene ukuba ngelixa ezi zixhobo zibonelela ngeempawu ezinamandla kakhulu, sibe nobunzima bokwenza ezi mpawu zifumaneke kubaphulaphuli ababanzi kuTwitter. Ngokwandisa iqonga lethu ngeLifu likaGoogle, sigxile ekwenzeni lula izixhobo zethu zohlalutyo kuyo yonke i-Twitter.

Google's BigQuery Data Warehouse

Amaqela amaninzi kwi-Twitter sele efake i-BigQuery kwezinye iipayipi zabo zokuvelisa. Sisebenzisa amava abo, saqala ukuvavanya amathuba eBigQuery kuzo zonke iimeko zokusetyenziswa kwe-Twitter. Injongo yethu ibikukubonelela ngeBigQuery kuyo yonke inkampani, kunye nokuyibeka emgangathweni kunye nokuxhasa ngaphakathi kwezixhobo zePlatifomu yeDatha. Oku kwakunzima ngenxa yezizathu ezininzi. Bekufuneka siphuhlise isiseko sokufumana ngokuthembekileyo isixa esikhulu sedatha, sixhase ulawulo lwedatha yenkampani ngokubanzi, siqinisekise ulawulo olululo lokufikelela, kunye nokuqinisekisa ubumfihlo babathengi. Kwakhona kuye kwafuneka ukuba senze iinkqubo zokwabiwa kwezixhobo, ukubeka esweni, kunye nokubuyisela emva ukuze amaqela asebenzise i-BigQuery ngokufanelekileyo.

NgoNovemba ka-2018, sikhuphe ukukhutshwa kwe-alpha yeBigQuery kunye neDatha Studio kuyo yonke inkampani. Sibonelele ngezona zisetyenzisiweyo zethu zedatha-ecaciswe ispredishithi kubasebenzi bakaTwitter. IBigQuery isetyenziswe ngabasebenzisi abangaphezu kwama-250 abavela kumaqela ahlukeneyo aquka ubunjineli, imali kunye nokuthengisa. Kutshanje, bebeqhuba izicelo ezimalunga ne-8, beqhuba malunga ne-100 PB ngenyanga, bengabali izicelo ezicwangcisiweyo. Emva kokufumana ingxelo entle kakhulu, sigqibe kwelokuba siqhubele phambili kwaye sinikezele ngeBigQuery njengesona sixhobo siphambili sokusebenzisana nedatha kuTwitter.

Nanku umzobo woyilo olukwinqanaba eliphezulu logcino lwedatha lukaGoogle BigQuery.

Indlela iGoogle's BigQuery eyenze uhlalutyo lwedatha yedemokhrasi. Icandelo loku-1
Sikopisha idatha ukusuka kumaqela e-Hadoop asekuhlaleni kwi-Google Cloud Storage (GCS) usebenzisa isixhobo sangaphakathi se-Cloud Replicator. Emva koko sisebenzisa iApache Airflow ukwenza imibhobho esebenzisa "bq_umthwaloΒ»ukulayisha idatha esuka kwi-GCS kwi-BigQuery. Sisebenzisa i-Presto ukubuza i-Parquet okanye i-Thrift-LZO kwiiseti zedatha kwi-GCS. I-BQ Blaster sisixhobo sangaphakathi sokuScalding sokulayisha iiseti zedatha ze-HDFS Vertica kunye ne-Thrift-LZO kwi-BigQuery.

Kula macandelo alandelayo, siza kuxubusha indlela yethu kunye nobuchule ekusebenziseni lula, ukusebenza, ulawulo lwedatha, impilo yenkqubo kunye neendleko.

Ukusebenziseka ngokulula

Sifumanise ukuba kulula kubasebenzisi ukuba baqalise ngeBigQuery njengoko ibingafuni kufakelo lwesoftware kwaye abasebenzisi banokufikelela kuyo ngojongano lwewebhu olubonakalayo. Nangona kunjalo, abasebenzisi bebedinga ukuqhelana nezinye zeempawu kunye neengqikelelo ze-GCP, kuquka izixhobo ezifana neeprojekthi, iiseti zedatha, kunye neetafile. Siphuhlise izifundo kunye nezifundo ukunceda abasebenzisi ukuba baqalise. Ngokuqonda okusisiseko okufunyenweyo, kulula kubasebenzisi ukukhangela iiseti zedatha, bajonge i-schema kunye nedatha yetafile, baqhube imibuzo elula, kwaye babone iziphumo kwiSitudiyo seDatha.

Injongo yethu ngokungena kwedatha kwi-BigQuery ibikukubonelela ngokulayisha ngaphandle komthungo kwe-HDFS okanye iiseti zedatha ze-GCS ngokucofa kanye. Siye saqwalasela Umqambi wamafu (ilawulwa yi-Airflow) kodwa ayizange ikwazi ukuyisebenzisa ngenxa yemodeli yethu yokhuseleko "i-Domain Restricted Sharing" (ngaphezulu koku kwicandelo loLawulo lweDatha elingezantsi). Siye sazama ukusebenzisa iNkonzo yokuGqithiselwa kweDatha kaGoogle (DTS) ukuququzelela imisebenzi yomthwalo weBigQuery. Ngelixa i-DTS yayikhawuleza ukuseta, ayizange iguquguquke ekwakheni imibhobho enokuxhomekeka. Ukukhutshwa kwethu kwe-alpha, senze indawo yethu ye-Apache Airflow kwi-GCE kwaye siyilungiselela ukuveliswa kunye nokukwazi ukuxhasa imithombo yedatha eninzi njengeVertica.

Ukuguqula idatha kwi-BigQuery, abasebenzisi benza imibhobho yedatha ye-SQL elula usebenzisa imibuzo ecwangcisiweyo. Kwimibhobho entsonkothileyo enamanqanaba amaninzi anokuxhomekeka, siceba ukusebenzisa eyethu isakhelo sokuhamba komoya okanye uMqambi wamafu kunye Ukuhamba kwedatha kwilifu.

Imveliso

IBigQuery yenzelwe injongo jikelele yemibuzo yeSQL eqhuba isixa esikhulu sedata. Ayenzelwanga ukubambezeleka okuphantsi, imibuzo ephezulu yokuphuma efunwa nguvimba wedatha wentengiselwano, okanye uhlahlelo lwexesha eliphantsi lexesha lokulibaziseka eliphunyezwe ngu. Apache Druid. Kwimibuzo yohlalutyo olusebenzayo, abasebenzisi bethu balindele ixesha lokuphendula elingaphantsi komzuzu omnye. Kuye kwafuneka ukuba siyile ukusetyenziswa kweBigQuery ukuhlangabezana nolu lindelo. Ukuze kubonelelwe ngomsebenzi oqikelelwayo kubasebenzisi bethu, sisebenzise ukusebenza kweBigQuery, efumaneka kubathengi ngesiseko sentlawulo emiselweyo, evumela abanini beprojekthi ukuba bagcine ubuncinci beendawo zokubeka izicelo zabo. Isamba I-BigQuery yiyunithi yamandla ekhompyuter efunekayo ukwenza imibuzo yeSQL.

Sihlalutye ngaphezulu kwe-800 yemibuzo elungiswayo malunga ne-TB enye yedatha nganye kwaye safumanisa ukuba ixesha eliphakathi lokwenziwa yimizuzwana engama-1. Siphinde safunda ukuba ukusebenza kuxhomekeke kakhulu ekusebenziseni indawo yethu kwiiprojekthi ezahlukeneyo kunye nemisebenzi. Kuye kwafuneka ukuba sahlule ngokucacileyo imveliso yethu kunye ne-ad hoc slot reserves ukuze sigcine ukusebenza kwiimeko zokusetyenziswa kwemveliso kunye nohlalutyo olusebenzayo. Oku kube nefuthe kakhulu kuyilo lwethu logcino-ndawo kunye noluhlu lweprojekthi.

Siza kuthetha malunga nolawulo lwedatha, ukusebenza kunye neendleko zeenkqubo kwiintsuku ezizayo kwinxalenye yesibini yokuguqulela, kwaye ngoku simema wonke umntu ukuba free live webinar, apho unokufunda ngakumbi malunga nekhosi, kunye nokubuza imibuzo kwingcali yethu - u-Egor Mateshuk (uNjineli weDatha oMkhulu, uMaximaTelecom).

Funda ngokugqithisileyo:

umthombo: www.habr.com

Yongeza izimvo