Momwe Google's BigQuery idathandizira kusanthula deta. Gawo 1

Moni, Habr! Kulembetsa kwa maphunziro atsopano kwatsegulidwa pakali pano ku OTUS Data Engineer. Poyembekezera kuyambika kwa maphunzirowa, takonza kale kumasulira kwazinthu zosangalatsa kwa inu.

Tsiku lililonse, anthu opitilira miliyoni miliyoni amachezera pa Twitter kuti adziwe zomwe zikuchitika padziko lapansi ndikukambirana. Tweet iliyonse ndi zochita zina za ogwiritsa ntchito zimapanga chochitika chomwe chimapezeka pakuwunika kwamkati kwa Twitter. Ogwira ntchito mazana ambiri amasanthula ndikuwonera deta iyi, ndikuwongolera zomwe akumana nazo ndizofunikira kwambiri pagulu la Twitter Data Platform.

Timakhulupirira kuti ogwiritsa ntchito omwe ali ndi luso losiyanasiyana azitha kupeza deta ndikukhala ndi mwayi wogwiritsa ntchito zida zowunikira komanso zowonera za SQL. Izi zitha kulola gulu lathunthu la ogwiritsa ntchito ochepa, kuphatikiza osanthula deta ndi oyang'anira zinthu, kuti atenge zidziwitso kuchokera pa data, kuwalola kumvetsetsa bwino ndikugwiritsa ntchito kuthekera kwa Twitter. Umu ndi momwe timapangira demokalase kusanthula kwa data pa Twitter.

Pamene zida zathu ndi luso la kusanthula kwa data lapita patsogolo, tawona Twitter ikupita patsogolo. Komabe, pali mpata woti uwongolere. Zida zamakono monga Scalding zimafuna chidziwitso cha mapulogalamu. Zida zowunikira za SQL monga Presto ndi Vertica zili ndi zovuta zogwirira ntchito pamlingo waukulu. Timakhalanso ndi vuto la kugawa deta pamakina ambiri popanda kupeza nthawi zonse.

Chaka chatha tinalengeza mgwirizano watsopano ndi Google, momwe timasamutsa magawo athu deta zomangamanga pa Google Cloud Platform (GCP). Tazindikira kuti zida za Google Cloud Big Data atha kutithandiza ndi zoyeserera zathu kutengera demokalase kusanthula, kuyang'ana, ndi kuphunzira pamakina pa Twitter:

  • BigQuery: nyumba yosungiramo data yamabizinesi yokhala ndi injini ya SQL yochokera Dremel, yomwe imadziwika ndi liwiro lake, kuphweka komanso kuthana nayo makina kuphunzira.
  • Studio Studio: chida chachikulu chowonera deta ndi Google Docs ngati mgwirizano.

Munkhaniyi, muphunzira za zomwe takumana nazo pogwiritsa ntchito zida izi: zomwe tinachita, zomwe taphunzira, ndi zomwe tidzachite. Tsopano tiyang'ana pa batch ndi ma analytics olumikizana. Tidzakambilana zenizeni zenizeni m'nkhani yotsatira.

Mbiri ya Twitter Data Stores

Musanadumphire mu BigQuery, ndiyenera kufotokoza mwachidule mbiri ya Twitter data rehousing. Mu 2011, kusanthula kwa data pa Twitter kunachitika ku Vertica ndi Hadoop. Tidagwiritsa ntchito Nkhumba kupanga MapReduce Hadoop jobs. Mu 2012, tidasintha Nkhumba ndi Scalding, yomwe inali ndi Scala API yokhala ndi zopindulitsa monga kuthekera kopanga mapaipi ovuta komanso kuyesa kosavuta. Komabe, kwa akatswiri ambiri osanthula deta ndi oyang'anira zinthu omwe anali omasuka kugwira ntchito ndi SQL, inali njira yophunzirira yotalikirapo. Pafupifupi 2016, tinayamba kugwiritsa ntchito Presto ngati mawonekedwe a SQL ku data ya Hadoop. Spark adapereka mawonekedwe a Python, omwe amapangitsa kukhala chisankho chabwino pa sayansi ya data ya ad hoc ndi kuphunzira pamakina.

Kuyambira 2018, tagwiritsa ntchito zida zotsatirazi pakusanthula ndi kuwonera ma data:

  • Kuwotcha kwa ma conveyor opanga
  • Scalding ndi Spark posanthula deta ya ad hoc ndi kuphunzira pamakina
  • Vertica ndi Presto kwa ad hoc ndi kusanthula kolumikizana kwa SQL
  • Druid chifukwa cholumikizirana chochepa, kufufuza komanso kutsika kwa latency kumayendedwe anthawi
  • Tableau, Zeppelin ndi Pivot kuti muwonetsetse deta

Tapeza kuti ngakhale zidazi zimapereka mphamvu zamphamvu kwambiri, tinali ndi vuto lopangitsa kuti izi zitheke kwa anthu ambiri pa Twitter. Pokulitsa nsanja yathu ndi Google Cloud, tikuyang'ana kwambiri kufewetsa zida zathu zowunikira pa Twitter yonse.

Google's BigQuery Data Warehouse

Magulu angapo pa Twitter aphatikiza kale BigQuery m'mapaipi awo opanga. Pogwiritsa ntchito ukadaulo wawo, tidayamba kuwunika zomwe BigQuery ali nazo pazochitika zonse za Twitter. Cholinga chathu chinali kupereka BigQuery ku kampani yonse ndikuyimitsa ndikuyithandizira mkati mwa zida za Data Platform. Izi zinali zovuta pazifukwa zambiri. Tinkafunika kupanga maziko oti tilowetse zambiri za data, kuthandizira kasamalidwe ka data pamakampani, kuwonetsetsa kuwongolera koyenera, ndikuwonetsetsa kuti kasitomala achinsinsi. Tinafunikanso kupanga njira zogawira zinthu, kuyang'anira, ndi kubweza ndalama kuti magulu agwiritse ntchito BigQuery moyenera.

Mu Novembala 2018, tidatulutsa kutulutsidwa kwa alpha kwa BigQuery ndi Data Studio. Tapereka kwa ogwira ntchito pa Twitter ena mwamasamba omwe timagwiritsa ntchito kwambiri omwe ali ndi zidziwitso zawo zoyeretsedwa. BigQuery yakhala ikugwiritsidwa ntchito ndi ogwiritsa ntchito oposa 250 ochokera m'magulu osiyanasiyana kuphatikiza uinjiniya, zachuma ndi malonda. Posachedwapa, anali kuyendetsa zopempha za 8k, akukonza za 100 PB pamwezi, osawerengera zopempha zomwe zakonzedwa. Titalandira ndemanga zabwino kwambiri, tidaganiza zopita patsogolo ndikupereka BigQuery ngati chida chachikulu cholumikizirana ndi data pa Twitter.

Nachi chithunzi chapamwamba kwambiri cha kapangidwe kathu ka Google BigQuery data warehouse.

Momwe Google's BigQuery idathandizira kusanthula deta. Gawo 1
Timakopera deta kuchokera m'magulu a Hadoop kupita ku Google Cloud Storage (GCS) pogwiritsa ntchito chida chamkati cha Cloud Replicator. Kenako timagwiritsa ntchito Apache Airflow kupanga mapaipi omwe amagwiritsa ntchito "bq_ katunduΒ»kutsitsa deta kuchokera ku GCS kupita ku BigQuery. Timagwiritsa ntchito Presto kufunsira Parquet kapena Thrift-LZO mu GCS. BQ Blaster ndi chida chamkati cha Scalding chotsitsa HDFS Vertica ndi Thrift-LZO dataset mu BigQuery.

M'magawo otsatirawa, tikukambirana za njira yathu komanso ukadaulo wathu pazovuta zogwiritsa ntchito, magwiridwe antchito, kasamalidwe ka data, thanzi ladongosolo, komanso mtengo.

Kugwiritsa ntchito mosavuta

Tidapeza kuti zinali zosavuta kuti ogwiritsa ntchito ayambe ndi BigQuery chifukwa sizimafunikira kukhazikitsa mapulogalamu ndipo ogwiritsa ntchito amatha kuzipeza kudzera pa intaneti yodziwika bwino. Komabe, ogwiritsa ntchito amayenera kudziwa zina mwazinthu ndi malingaliro a GCP, kuphatikiza zothandizira monga mapulojekiti, ma dataset, ndi matebulo. Tapanga zida zophunzitsira ndi maphunziro kuti tithandizire ogwiritsa ntchito kuti ayambe. Ndi kumvetsetsa koyambira komwe adapeza, ogwiritsa ntchito adapeza kukhala kosavuta kuyang'ana ma seti a data, kuwona schema ndi data ya tebulo, kufunsa mafunso osavuta, ndikuwona zotsatira mu Data Studio.

Cholinga chathu cholowetsa deta mu BigQuery chinali kuti titsegule ma dataset a HDFS kapena GCS ndikudina kamodzi. Tinalingalira Cloud Composer (yoyendetsedwa ndi Airflow) koma sanathe kuigwiritsa ntchito chifukwa cha mtundu wa chitetezo cha Domain Restricted Sharing (zambiri pa izi mu gawo la Data Management pansipa). Tinayesa kugwiritsa ntchito Google Data Transfer Service (DTS) kuti tikonze ntchito za BigQuery. Ngakhale kuti DTS inali yofulumira kukhazikitsa, sizinali zosinthika kumanga mapaipi okhala ndi zodalira. Kuti titulutse alpha, tapanga dongosolo lathu la Apache Airflow ku GCE ndipo tikulikonzekera kuti lizigwira ntchito ndikupanga ndikutha kuthandizira magwero ambiri a data monga Vertica.

Kuti asinthe deta kukhala BigQuery, ogwiritsa ntchito amapanga mapaipi osavuta a SQL pogwiritsa ntchito mafunso omwe adakonzedwa. Pamapaipi ovuta a magawo angapo omwe ali ndi kudalira, tikukonzekera kugwiritsa ntchito dongosolo lathu la Airflow kapena Cloud Composer pamodzi ndi Kutulutsa kwa Mtambo.

Kukonzekera

BigQuery idapangidwa kuti izingofunsa mafunso a SQL omwe amasanthula kuchuluka kwa data. Sichikupangidwira kuchedwa kwapang'onopang'ono, mafunso apamwamba omwe amafunidwa ndi nkhokwe yamalonda, kapena kusanthula kwa nthawi yochepa kwa nthawi yomwe yakhazikitsidwa. Apache Druid. Pamafunso okhudzana ndi ma analytics, ogwiritsa ntchito amayembekeza nthawi zoyankhidwa zosakwana miniti imodzi. Tidayenera kupanga kugwiritsa ntchito kwathu BigQuery kuti tikwaniritse zoyembekeza izi. Pofuna kuti ogwiritsa ntchito athu azigwira ntchito modziwikiratu, tidawonjezera magwiridwe antchito a BigQuery, opezeka kwa makasitomala pamtengo wotsika mtengo womwe umalola eni mapulojekiti kusungitsa mipata yochepera pazofunsa zawo. Kagawo BigQuery ndi gawo la mphamvu zamakompyuta zomwe zimafunikira kuti muyankhe mafunso a SQL.

Tidasanthula mafunso opitilira 800 omwe akukonza pafupifupi 1 TB ya data iliyonse ndikupeza kuti nthawi yayitali yopha inali masekondi 30. Tidaphunziranso kuti magwiridwe antchito amadalira kwambiri kugwiritsa ntchito kagawo kathu pama projekiti ndi ntchito zosiyanasiyana. Tidayenera kufotokozera momveka bwino zomwe timapanga komanso zosungirako za ad hoc kuti tisunge magwiridwe antchito pakupanga komanso kusanthula pa intaneti. Izi zidakhudza kwambiri kamangidwe kathu ka malo osungitsa malo komanso utsogoleri wama projekiti.

Tikambirana za kasamalidwe ka data, magwiridwe antchito ndi mtengo wa machitidwe m'masiku akubwerawa mu gawo lachiwiri la kumasulira, koma tsopano tikuyitanitsa aliyense ufulu live webinar, pamene mudzatha kuphunzira mwatsatanetsatane za maphunzirowa, komanso kufunsa mafunso kwa katswiri wathu - Egor Mateshuk (Senior Data Engineer, MaximaTelecom).

Werengani zambiri:

Source: www.habr.com

Kuwonjezera ndemanga