Hayi Habr! Ubhaliso lwekhosi entsha luvuliwe e-OTUS ngoku
Yonke imihla, ngaphezulu kwekhulu lezigidi zabantu batyelela i-Twitter ukuze bafumanise ukuba kuqhubeka ntoni ehlabathini kwaye baxoxe ngayo. I-tweet nganye kunye nasiphi na esinye isenzo somsebenzisi sivelisa isiganeko esifumanekayo kuhlalutyo lwedatha yangaphakathi ngaphakathi kwe-Twitter. Amakhulu abasebenzi ahlalutya kwaye abonise le datha, kwaye ukuphucula amava abo yinto ephambili kwi-Twitter Data Platform iqela.
Sikholelwa ukuba abasebenzisi abanoluhlu olubanzi lwezakhono zobugcisa kufuneka bakwazi ukufumana idatha kunye nokufikelela ekusebenzeni kakuhle kwe-SQL-based analysis and visualization tools. Oku kuya kuvumela iqela elitsha labasebenzisi abangaphantsi kobuchwephesha, kubandakanywa abahlalutyi bedatha kunye nabaphathi beemveliso, ukukhupha ulwazi kwidatha, ebavumela ukuba baqonde ngcono kwaye basebenzise amandla e-Twitter. Yile ndlela esenza ngayo idemokhrasi uhlalutyo lwedatha kwi-Twitter.
Njengoko izixhobo zethu kunye nobuchule bokuhlalutya idatha yangaphakathi buphuculwe, siye sabona ukuphuculwa kwenkonzo ye-Twitter. Nangona kunjalo, kusekho indawo yokuphucula. Izixhobo zangoku ezifana ne-Scalding zifuna amava enkqubo. Izixhobo zokuhlalutya ezisekelwe kwi-SQL ezifana ne-Presto kunye ne-Vertica zinemiba yokusebenza kwinqanaba elikhulu. Sikwanayo nengxaki yokuhambisa idatha kwiinkqubo ezininzi ngaphandle kokufikelela rhoqo kuyo.
Kunyaka ophelileyo siye sabhengeza
enkulu : Indawo yokugcina idatha yeshishini enenjini yeSQL esekweidremel , edume ngesantya sayo, ukulula kunye nokuhlangabezana nayoukufunda koomatshini .istudiyo sedatha: isixhobo esikhulu sokubonwa kwedatha esineempawu zentsebenziswano ezifana neGoogle Docs.
Kweli nqaku, uya kufunda ngamava ethu ngezi zixhobo: into esiyenzileyo, esiyifundileyo kunye nento esiza kuyenza ngokulandelayo. Ngoku siza kugxila kwi-batch kunye nohlalutyo olusebenzayo. Uhlalutyo lwexesha langempela luya kuxutyushwa kwinqaku elilandelayo.
Imbali yeeNdawo zokugcina iiNkcukacha kwi-Twitter
Ngaphambi kokuntywila kwiBigQuery, kufanelekile ukubalisa ngokufutshane imbali yeendawo zokugcina idatha kwi-Twitter. Kwi-2011, uhlalutyo lwedatha ye-Twitter lwenziwa kwi-Vertica kunye ne-Hadoop. Ukwenza imephuNciphisa imisebenzi yeHadoop, sasebenzisa iPig. Kwi-2012, satshintsha i-Pig nge-Scalding, eyayine-Scala API enezibonelelo ezifana nokukwazi ukudala imibhobho enzima kunye nokulula kovavanyo. Nangona kunjalo, kubahlalutyi bedatha abaninzi kunye nabaphathi beemveliso ababekhululekile ukusebenza nge-SQL, yayiyinto enzima yokufunda. Malunga ne-2016, saqala ukusebenzisa i-Presto njengesiphelo sethu sangaphambili se-SQL yedatha ye-Hadoop. I-Spark ibonelele nge-interface ye-Python eyenza ibe lukhetho olufanelekileyo kwisayensi yedatha ye-ad hoc kunye nokufunda koomatshini.
Ukusukela ngo-2018, sisebenzise ezi zixhobo zilandelayo zokuhlalutya idatha kunye nokubonwayo:
- Ukutshisa kwimigca yemveliso
- I-Scalding kunye ne-Spark yohlalutyo lwedatha ye-ad hoc kunye nokufunda koomatshini
- I-Vertica kunye ne-Presto ye-ad hoc kunye nohlalutyo olusebenzayo lwe-SQL
- I-Druid yokunxibelelana okuphantsi, ukuhlola kunye nokufikelela kwi-latency ephantsi kwiimethrikhi zochungechunge lwexesha
- Itheyibhile, iZeppelin kunye nePivot yokuJonga iDatha
Sifumene ukuba ngelixa ezi zixhobo zibonelela ngeempawu ezinamandla kakhulu, sibe nobunzima bokwenza ezi mpawu zifumaneke kubaphulaphuli ababanzi kuTwitter. Ngokwandisa iqonga lethu ngeLifu likaGoogle, sigxile ekwenzeni lula izixhobo zethu zohlalutyo kuyo yonke i-Twitter.
Google's BigQuery Data Warehouse
Amaqela amaninzi kwi-Twitter sele efake i-BigQuery kwezinye iipayipi zabo zokuvelisa. Sisebenzisa amava abo, saqala ukuvavanya amathuba eBigQuery kuzo zonke iimeko zokusetyenziswa kwe-Twitter. Injongo yethu ibikukubonelela ngeBigQuery kuyo yonke inkampani, kunye nokuyibeka emgangathweni kunye nokuxhasa ngaphakathi kwezixhobo zePlatifomu yeDatha. Oku kwakunzima ngenxa yezizathu ezininzi. Bekufuneka siphuhlise isiseko sokufumana ngokuthembekileyo isixa esikhulu sedatha, sixhase ulawulo lwedatha yenkampani ngokubanzi, siqinisekise ulawulo olululo lokufikelela, kunye nokuqinisekisa ubumfihlo babathengi. Kwakhona kuye kwafuneka ukuba senze iinkqubo zokwabiwa kwezixhobo, ukubeka esweni, kunye nokubuyisela emva ukuze amaqela asebenzise i-BigQuery ngokufanelekileyo.
NgoNovemba ka-2018, sikhuphe ukukhutshwa kwe-alpha yeBigQuery kunye neDatha Studio kuyo yonke inkampani. Sibonelele ngezona zisetyenzisiweyo zethu zedatha-ecaciswe ispredishithi kubasebenzi bakaTwitter. IBigQuery isetyenziswe ngabasebenzisi abangaphezu kwama-250 abavela kumaqela ahlukeneyo aquka ubunjineli, imali kunye nokuthengisa. Kutshanje, bebeqhuba izicelo ezimalunga ne-8, beqhuba malunga ne-100 PB ngenyanga, bengabali izicelo ezicwangcisiweyo. Emva kokufumana ingxelo entle kakhulu, sigqibe kwelokuba siqhubele phambili kwaye sinikezele ngeBigQuery njengesona sixhobo siphambili sokusebenzisana nedatha kuTwitter.
Nanku umzobo woyilo olukwinqanaba eliphezulu logcino lwedatha lukaGoogle BigQuery.
Sikopisha idatha ukusuka kumaqela e-Hadoop asekuhlaleni kwi-Google Cloud Storage (GCS) usebenzisa isixhobo sangaphakathi se-Cloud Replicator. Emva koko sisebenzisa iApache Airflow ukwenza imibhobho esebenzisa "
Kula macandelo alandelayo, siza kuxubusha indlela yethu kunye nobuchule ekusebenziseni lula, ukusebenza, ulawulo lwedatha, impilo yenkqubo kunye neendleko.
Ukusebenziseka ngokulula
Sifumanise ukuba kulula kubasebenzisi ukuba baqalise ngeBigQuery njengoko ibingafuni kufakelo lwesoftware kwaye abasebenzisi banokufikelela kuyo ngojongano lwewebhu olubonakalayo. Nangona kunjalo, abasebenzisi bebedinga ukuqhelana nezinye zeempawu kunye neengqikelelo ze-GCP, kuquka izixhobo ezifana neeprojekthi, iiseti zedatha, kunye neetafile. Siphuhlise izifundo kunye nezifundo ukunceda abasebenzisi ukuba baqalise. Ngokuqonda okusisiseko okufunyenweyo, kulula kubasebenzisi ukukhangela iiseti zedatha, bajonge i-schema kunye nedatha yetafile, baqhube imibuzo elula, kwaye babone iziphumo kwiSitudiyo seDatha.
Injongo yethu ngokungena kwedatha kwi-BigQuery ibikukubonelela ngokulayisha ngaphandle komthungo kwe-HDFS okanye iiseti zedatha ze-GCS ngokucofa kanye. Siye saqwalasela
Ukuguqula idatha kwi-BigQuery, abasebenzisi benza imibhobho yedatha ye-SQL elula usebenzisa imibuzo ecwangcisiweyo. Kwimibhobho entsonkothileyo enamanqanaba amaninzi anokuxhomekeka, siceba ukusebenzisa eyethu isakhelo sokuhamba komoya okanye uMqambi wamafu kunye
Imveliso
IBigQuery yenzelwe injongo jikelele yemibuzo yeSQL eqhuba isixa esikhulu sedata. Ayenzelwanga ukubambezeleka okuphantsi, imibuzo ephezulu yokuphuma efunwa nguvimba wedatha wentengiselwano, okanye uhlahlelo lwexesha eliphantsi lexesha lokulibaziseka eliphunyezwe ngu.
Sihlalutye ngaphezulu kwe-800 yemibuzo elungiswayo malunga ne-TB enye yedatha nganye kwaye safumanisa ukuba ixesha eliphakathi lokwenziwa yimizuzwana engama-1. Siphinde safunda ukuba ukusebenza kuxhomekeke kakhulu ekusebenziseni indawo yethu kwiiprojekthi ezahlukeneyo kunye nemisebenzi. Kuye kwafuneka ukuba sahlule ngokucacileyo imveliso yethu kunye ne-ad hoc slot reserves ukuze sigcine ukusebenza kwiimeko zokusetyenziswa kwemveliso kunye nohlalutyo olusebenzayo. Oku kube nefuthe kakhulu kuyilo lwethu logcino-ndawo kunye noluhlu lweprojekthi.
Siza kuthetha malunga nolawulo lwedatha, ukusebenza kunye neendleko zeenkqubo kwiintsuku ezizayo kwinxalenye yesibini yokuguqulela, kwaye ngoku simema wonke umntu ukuba
Funda ngokugqithisileyo:
Isixhobo soKwakha iDatha okanye yintoni eqhelekileyo phakathi kweNdawo yokugcina iDatha kunye neSmoothie Ngena kwi-Delta Lake: ukuNyanzeliswa kweSchema kunye neNdalo Isantya esiphezulu seApache Parquet kwiPython eneApache Arrow
umthombo: www.habr.com