Google's BigQuery yakagadziridza sei data yekuongorora. Chikamu 1

Mhoro, Habr! Kunyoreswa kwekosi nyowani kwavhurwa izvozvi paOTUS "Data Engineer". Mukutarisira kutanga kwekosi, isu takagara takagadzirira shanduro yezvinyorwa zvinonakidza kwauri.

Mazuva ese, vanhu vanopfuura miriyoni zana vanoshanyira Twitter kuti vaone zviri kuitika munyika uye kuzvikurukura. Yese tweet uye yega yega mushandisi chiito inogadzira chiitiko chinowanikwa che Twitter chemukati data kuongororwa. Mazana evashandi anoongorora uye kuona iyi data, uye kuvandudza ruzivo rwavo chinhu chepamusoro chechikwata che Twitter Data Platform.

Isu tinotenda kuti vashandisi vane hunyanzvi hwakasiyana-siyana hwehunyanzvi vanofanirwa kuwana data uye kuwana mukana wekuita zvakanaka SQL-based analysis uye maturusi ekuona. Izvi zvinobvumira boka idzva revashandisi vashoma vehunyanzvi, kusanganisira vanoongorora data uye maneja echigadzirwa, kuti vatore ruzivo kubva kune data, zvichivabvumira kunzwisisa zviri nani uye kushandisa kugona kwe Twitter. Aya ndiwo maitiro atinoita demokrasi data analytics pa Twitter.

Sezvo maturusi edu uye emukati data analytics kugona kwave nani, taona Twitter ichivandudza. Zvisinei, pachine nzvimbo yekuvandudza. Zvishandiso zvazvino seScalding zvinoda ruzivo rwekugadzira. SQL-yakavakirwa maturusi ekuongorora akadai sePresto neVertica ane nyaya dzekuita pachiyero. Isu tinewo dambudziko rekugovera data kune akawanda masisitimu pasina nguva dzose kuwana kwairi.

Gore rakapera takazivisa kubatana kutsva neGoogle, mukati matinotamisa zvikamu zvedu data infrastructure paGoogle Cloud Platform (GCP). Isu takagumisa kuti Google Cloud zvishandiso Big Data inogona kutibatsira nematanho edu ekuita demokrasi analytics, kuona, uye kudzidza muchina pa Twitter:

  • bigquery: bhizinesi data warehouse ine SQL injini yakavakirwa Dremel, iyo ine mukurumbira nekumhanya kwayo, kuve nyore uye kubata nayo kudzidza muchina.
  • Data Studio: hombe data kuona chishandiso neGoogle Docs-senge maficha ekubatana.

Munyaya ino, uchadzidza nezvezvakaitika kwatiri tichishandisa maturusi aya: zvatakaita, zvatakadzidza, uye zvatichaita. Iye zvino tichatarisa pane batch uye interactive analytics. Tichakurukura nguva-chaiyo analytics munyaya inotevera.

Nhoroondo ye Twitter Data Stores

Usati wanyura muBigQuery, zvakafanira kurondedzera muchidimbu nhoroondo ye Twitter data warehousing. Muna 2011, Twitter data analysis yakaitwa muVertica neHadoop. Isu takashandisa Nguruve kugadzira MepuReduce Hadoop mabasa. Muna 2012, takatsiva Nguruve neScalding, yaive neScala API ine mabhenefiti akadai sekugona kugadzira mapaipi akaomarara uye nyore kuyedza. Nekudaro, kune vakawanda vanoongorora data uye maneja ezvigadzirwa vaive vakasununguka kushanda neSQL, yaive yakanyatso kudzika yekudzidza. Munenge muna 2016, takatanga kushandisa Presto seSQL interface kuHadoop data. Spark yakapa Python interface inoita kuti ive sarudzo yakanaka yead hoc data sainzi uye kudzidza muchina.

Kubva 2018, takashandisa maturusi anotevera ekuongorora data uye kuona:

  • Scalding yekugadzira conveyors
  • Scalding uye Spark yead hoc data yekuongorora uye kudzidza muchina
  • Vertica uye Presto ye ad hoc uye inopindirana SQL ongororo
  • Druid yepasi inopindirana, yekuongorora uye yakaderera latency yekuwana kune nguva yakatevedzana metrics
  • Tableau, Zeppelin uye Pivot yekuona data

Takaona kuti nepo maturusi aya achipa hunyanzvi hwakasimba, takanetseka kuita kuti kugona uku kuwanikwe kune vateereri vakawanda paTwitter. Nekuwedzera puratifomu yedu neGoogle Cloud, tiri kutarisa kurerutsa maturusi edu eanalytics kune ese Twitter.

Google's BigQuery Data Warehouse

Zvikwata zvakati wandei paTwitter zvakatoisa BigQuery mune mamwe mapaipi ekugadzira. Tichishandisa hunyanzvi hwavo, takatanga kuongorora kugona kweBigQuery kune ese machesi ekushandisa Twitter. Chinangwa chedu chaive chekupa BigQuery kukambani yese uye kuimisa uye kuitsigira mukati meData Platform zvishandiso. Izvi zvakanga zvakaoma nokuda kwezvikonzero zvakawanda. Taifanira kugadzira zvivakwa kuti tipinde nekuvimbika mavhoriyamu makuru edata, kutsigira kambani-yakafara manejimendi data, kuve nechokwadi chekutonga kwekuwana kwakakodzera, uye kuona kuvanzika kwevatengi. Isu taifanirawo kugadzira masisitimu ekugova zviwanikwa, kutarisa, uye kubhadharisa kuitira kuti zvikwata zvishandise BigQuery nemazvo.

MunaNovember 2018, takaburitsa kambani-yakafara alpha kuburitswa kweBigQuery uye Data Studio. Takapa vashandi veTwitter mamwe emaspredishiti edu anowanzo shandiswa ane data rakacheneswa. BigQuery yakashandiswa nevashandisi vanopfuura mazana maviri nemakumi mashanu kubva kuzvikwata zvakasiyana zvinosanganisira engineering, mari uye kushambadzira. Nguva pfupi yadarika, vanga vachimhanyisa zvikumbiro zve250k, vachigadzira nezve 8 PB pamwedzi, vasingaverenge zvikumbiro zvakarongwa. Mushure mekugamuchira mhinduro yakanaka kwazvo, takafunga kuenderera mberi nekupa BigQuery seyo yekutanga sosi yekudyidzana nedata pa Twitter.

Heino dhizaini repamusoro-soro reGoogle BigQuery data warehouse architecture.

Google's BigQuery yakagadziridza sei data yekuongorora. Chikamu 1
Isu tinokopa data kubva pane-nzvimbo Hadoop masumbu kuenda kuGoogle Cloud Storage (GCS) tichishandisa iyo yemukati Cloud Replicator chishandiso. Isu tinobva tashandisa Apache Airflow kugadzira mapaipi anoshandisa "bq_loadΒ»kurodha data kubva kuGCS muBigQuery. Isu tinoshandisa Presto kubvunza Parquet kana Thrift-LZO datasets muGCS. BQ Blaster chishandiso chemukati cheScalding chekurodha HDFS Vertica uye Thrift-LZO datasets muBigQuery.

Muzvikamu zvinotevera, tinokurukura maitiro edu uye hunyanzvi munzvimbo dzekureruka kwekushandisa, kuita, manejimendi data, hutano hwehurongwa, uye mutengo.

Kunakidzwa kwekushandiswa

Takaona kuti zvaive nyore kuti vashandisi vatange neBigQuery nekuti yaisada kuisirwa software uye vashandisi vaigona kuiwana kuburikidza neiyo intuitive web interface. Nekudaro, vashandisi vaifanira kujairana nezvimwe zveGCP uye pfungwa, kusanganisira zviwanikwa zvakaita semapurojekiti, dhatabheti, uye matafura. Isu takagadzira zvekudzidzisa uye zvidzidzo zvekubatsira vashandisi kuti vatange. Nekunzwisisa kwekutanga kwakawanikwa, vashandisi vakawana zviri nyore kufamba-famba seti yedata, kuona schema uye data retafura, mhanyisa mibvunzo yakapusa, uye kuona mhedzisiro muData Studio.

Chinangwa chedu chekupinda data muBigQuery chaive chekugonesa kurodha zvisina musono kweHDFS kana GCS dataset nekudzvanya kumwe chete. Takafunga Cloud Composer (inotungamirirwa neAirflow) asi havana kukwanisa kuishandisa nekuda kweDomain Restricted Kugovera kuchengetedza modhi (zvimwe pane izvi muData Management chikamu pazasi). Takaedza kushandisa Google Data Transfer Service (DTS) kuronga BigQuery mabasa. Nepo DTS yakakurumidza kumisikidza, yakanga isingachinjiki pakuvaka mapaipi ane zvinotsamira. Yedu kuburitswa kwealpha, isu takavaka yedu Apache Airflow chimiro muGCE uye tiri kuigadzirira kuti imhanye mukugadzira uye kukwanisa kutsigira mamwe data masosi seVertica.

Kushandura data kuita BigQuery, vashandisi vanogadzira akareruka SQL data mapaipi vachishandisa yakarongwa mibvunzo. Kune akaomesesa mapaipi ematanho akawanda ane anotsamira, isu tinoronga kushandisa yedu yedu Airflow chimiro kana Cloud Composer pamwe chete. Cloud Dataflow.

Kubudirira

BigQuery yakagadzirirwa chinangwa chakajairwa SQL mibvunzo inogadzirisa huwandu hukuru hwe data. Haina kuitirwa iyo yakaderera latency, yakakwirira throughput mibvunzo inodikanwa nea transaction dhatabhesi, kana kune yakaderera latency nguva yakatevedzana kuongororwa kwakaitwa. Apache Druid. Pamibvunzo inopindirana yekuongorora, vashandisi vedu vanotarisira nguva dzekupindura isingasviki miniti imwe. Taifanira kugadzira kushandisa kwedu BigQuery kuzadzisa tarisiro idzi. Kuti tipe mashandiro anofungidzirwa kuvashandisi vedu, takakwidziridza BigQuery mashandiro, anowanikwa kune vatengi pamutengo wakatsetseka unobvumira varidzi veprojekiti kuchengetedza mashoma slots kumibvunzo yavo. Slot BigQuery chikamu chesimba rekombuta rinodiwa kuita mibvunzo yeSQL.

Takaongorora pamusoro pe800 mibvunzo ichigadzirisa ingangoita 1 TB yedata imwe neimwe uye takaona kuti avhareji yenguva yekuuraya yaive masekondi makumi matatu. Isu takadzidza zvakare kuti kuita kunoenderana zvakanyanya nekushandiswa kweslot yedu mumapurojekiti akasiyana uye mabasa. Isu taifanira kunyatso tsanangura kugadzirwa kwedu uye ad hoc slot reserves kuchengetedza mashandiro emakesi ekushandisa ekugadzira uye kuongororwa kwepamhepo. Izvi zvakapesvedzera zvakanyanya dhizaini yedu yekuchengetera slot uye hutungamiriri hweprojekiti.

Tichataura nezve data manejimendi, kushanda uye mutengo wemasisitimu mumazuva anouya muchikamu chechipiri cheshanduro, asi ikozvino tinokoka munhu wese mahara live webinar, panguva iyo iwe uchakwanisa kudzidza zvakadzama nezvekosi, pamwe nekubvunza mibvunzo kune nyanzvi yedu - Egor Mateshuk (Senior Data Engineer, MaximaTelecom).

Verenga zvimwe:

Source: www.habr.com

Voeg