Google's BigQuery yakagadziridza sei data yekuongorora. Chikamu 1

Mhoro, Habr! Kunyoreswa kwekosi nyowani kwavhurwa izvozvi paOTUS "Data Engineer"Tichitarisira kutanga kwekosi, takugadzirira shanduro yezvimwe zvinhu zvinonakidza.

Zuva rega rega, vanhu vanopfuura miriyoni zana vanoshanyira Twitter kuti vadzidze uye nekukurukura zviri kuitika munyika. Yese tweet uye chimwe chiitiko chemushandisi chinoburitsa chiitiko chinowanikwa che Twitter chemukati data kuongororwa. Mazana evashandi anoongorora nekuona iyi data, uye kuvandudza ruzivo rwayo rwemushandisi chinhu chinonyanya kukosha chechikwata cheTwitter Data Platform.

Isu tinotenda kuti vashandisi vane huwandu hwakawanda hwehunyanzvi hwehunyanzvi vanofanirwa kuwana data uye kuwana zvine simba SQL-based analysis uye maturusi ekuona. Izvi zvaizogonesa boka idzva revashandisi vashoma vehunyanzvi, kusanganisira vanoongorora data uye maneja echigadzirwa, kuti vatore ruzivo kubva kune data, zvichiita kuti vanzwisise zviri nani uye kuwedzera kugona kwe Twitter. Aya ndiwo maitiro atinoita demokrasi kuongororwa kwedata pa Twitter.

Sezvo isu takavandudza maturusi edu uye kugona kwekuongorora kwemukati data, takaona kuvandudzwa kwebasa re Twitter. Zvisinei, pachine nzvimbo yekuvandudza. Zvishandiso zvazvino, seScalding, zvinoda ruzivo rwekugadzira. SQL-yakavakirwa maturusi ekuongorora, sePresto neVertica, ane nyaya dzekuita pachiyero. Isu tine dambudziko rekugovera data kune akawanda masisitimu pasina kuenderana kuwana.

Gore rakapera takazivisa kubatana kutsva neGoogle, mukati matinotamisa zvikamu zvedu data infrastructure paGoogle Cloud Platform (GCP). Takaona kuti Google Cloud zvishandiso Big Data inogona kutibatsira mumatanho edu ekuita demokrasi kuongororwa, kuona, uye kudzidza muchina pa Twitter:

  • bigquery: bhizinesi data warehouse ine SQL injini yakavakirwa pa Dremel, iyo ine mukurumbira nekumhanya kwayo, kuve nyore uye kubata nayo kudzidza muchina.
  • Data Studio: Chishandiso chikuru chekuona data nekubatana maficha akafanana neGoogle Docs.

Muchinyorwa chino, iwe uchadzidza nezve chiitiko chedu chekushanda nematurusi aya: zvatakaita, zvatakadzidza, uye zvatichaita zvinotevera. Parizvino, isu tichatarisa pane batch uye inopindirana analytics. Tichakurukura real-time analytics muchinyorwa chinotevera.

Nhoroondo ye Twitter's Data Warehouses

Usati wapinda muBigQuery, zvakafanira kurondedzera muchidimbu nhoroondo yekuchengetera data pa Twitter. Muna 2011, kuongororwa kwedata kwe Twitter kwakaitwa muVertica neHadoop. Isu takashandisa Nguruve kugadzira MepuReduce mabasa eHadoop. Muna 2012, takatsiva Nguruve neScalding, iyo yakapa Scala API ine mabhenefiti akadai sekugona kugadzira mapaipi akaomarara uye nyore kuyedza. Nekudaro, kune vakawanda vanoongorora data uye mamaneja echigadzirwa akanyanya kugadzikana neSQL, iyi yaive yakadzika yekudzidza curve. Muna 2016, takatanga kushandisa Presto seSQL interface yeHadoop data. Spark yakapa Python interface, zvichiita kuti ive sarudzo yakanaka yekutsvaga ad hoc data uye kudzidza muchina.

Kubva 2018, takashandisa maturusi anotevera ekuongorora data uye kuona:

  • Scalding yekugadzira conveyors
  • Scalding uye Spark yead hoc data yekuongorora uye kudzidza muchina
  • Vertica uye Presto ye ad hoc uye inopindirana SQL ongororo
  • Druid yepasi-latency, inopindirana, uye yekuongorora kuwana kune nguva akateedzana metrics
  • Tableau, Zeppelin, uye Pivot yekuona data

Takaona kuti nepo maturusi aya achipa hunyanzvi hwakasimba, isu tainetseka kuti tiite kuti vawanikwe kune vakawanda vateereri paTwitter. Nekuwedzera puratifomu yedu neGoogle Cloud, tiri kutarisa kurerutsa maturusi edu ekuongorora e Twitter zvakazara.

Google BigQuery Data Warehouse

Zvikwata zvakati wandei paTwitter zvakange zvatobatanidza BigQuery mune mamwe mapaipi avo ekugadzira. Kuvaka pane zvavakaita, takatanga kuongorora kugona kweBigQuery kune ese makesi ekushandisa Twitter. Chinangwa chedu chaive chekupa BigQuery kambani-yakafara uye kuimisa uye kuitsigira mukati meData Platform suite yezvishandiso. Izvi zvakanga zvakaoma nokuda kwezvikonzero zvakawanda. Taifanira kugadzira zvivakwa kuti tipinde nekuvimbika mavhoriyamu makuru edata, kutsigira kambani-yakafara hutongi hwedata, kuve nechokwadi chekutonga kwakakodzera, uye kuchengetedza kuvanzika kwevatengi. Isu taifanirawo kugadzira masisitimu ekugova zviwanikwa, kutarisa, uye kubhadharisa kuti tive nechokwadi chekuti zvikwata zvaigona kushandisa BigQuery.

Muna Mbudzi 2018, takatangisa kambani-yakafara alpha kuburitswa kweBigQuery uye Data Studio. Isu takapa vashandi veTwitter mamwe ematafura edu anowanzo shandiswa ane data rakacheneswa. Vanopfuura vashandisi ve250 kubva kuzvikwata zvakasiyana, kusanganisira engineering, mari, uye kushambadzira, vaishandisa BigQuery. Nguva pfupi yadarika, vanga vachimhanyisa zviuru zvisere zvemibvunzo, vachigadzirisa zvakatenderedza zana PB pamwedzi, vasingasanganisire mibvunzo yakarongwa. Mushure mekugamuchira mhinduro yakanaka kwazvo, takafunga kuenderera mberi nekupa BigQuery seyo yekutanga sosi yekudyidzana nedata pa Twitter.

Heino dhiyagiramu yemhando yepamusoro yeGoogle BigQuery data warehouse.

Google's BigQuery yakagadziridza sei data yekuongorora. Chikamu 1
Isu tinodzokorora data kubva pane-zvivakwa Hadoop masumbu kuenda kuGoogle Cloud Storage (GCS) tichishandisa yedu yemukati Cloud Replicator chishandiso. Isu tinobva tashandisa Apache Airflow kugadzira mapaipi anoshandisabq_load" yekurodha data kubva kuGCS muBigQuery. Isu tinoshandisa Presto kubvunza Parquet kana Thrift-LZO datasets muGCS. BQ Blaster chishandiso chemukati cheScalding chekuisa HDFS Vertica uye Thrift-LZO dataset muBigQuery.

Muzvikamu zvinotevera, tinokurukura maitiro edu uye maonero munzvimbo dzekureruka kwekushandisa, kuita, kutonga kwedata, kuwanikwa kwehurongwa, uye mutengo.

Kunakidzwa kwekushandiswa

Takaona kuti vashandisi vakazviona zviri nyore kutanga neBigQuery nekuti yaisada chero kuisirwa software uye yaiwanikwa kuburikidza neiyo intuitive web interface. Nekudaro, vashandisi vaifanira kujairana nemamwe maficha eGCP uye pfungwa, kusanganisira zviwanikwa zvakaita semapurojekiti, dhatabheti, uye matafura. Isu takagadzira zvekudzidzisa uye tutorials kubatsira vashandisi kuti vatange. Kana vachinge vawana kunzwisisa kwekutanga, vashandisi vaigona kufamba zviri nyore dhataseti, kuona schema uye data retafura, kumhanya mibvunzo yakapusa, uye kuona mhedzisiro muData Studio.

Chinangwa chedu chekupinza data muBigQuery chaive chekupa kurodha zvisina musono kweHDFS kana GCS dataset nekudzvanya kumwe chete. Takafunga Cloud Composer (inotungamirirwa neAirflow), asi havana kukwanisa kuishandisa nekuda kwe "Domain Restricted Sharing" modhi yekuchengetedza (zvimwe pane izvi muchikamu che "Data Management" pazasi). Takaedza kushandisa Google Data Transfer Service (DTS) kuronga BigQuery mutoro wemabasa. Nepo DTS yakakurumidza kumisikidza, yakanga isingachinjike pakuvaka mapaipi ane zvinotsamira. Kuburitswa kwedu alpha, takavaka yedu Apache Airflow nharaunda muGCE uye tiri kuigadzirira kugadzirwa uye kugona kutsigira mamwe masosi data, seVertica.

Kushandura data kuita BigQuery, vashandisi vanogadzira akareruka SQL data mapaipi vachishandisa yakarongwa mibvunzo. Kumapaipi akaomesesa ematanho akawanda ane anotsamira, isu tinoronga kushandisa yedu yedu Airflow zvivakwa kana Cloud Composer takabatana. Cloud Dataflow.

Kubudirira

BigQuery yakagadzirirwa zvakajairwa-chinangwa SQL mibvunzo inogadzira mavhoriyamu makuru edata. Haina kugadzirirwa iyo yakaderera-latency, yakakwirira-kuburikidza mibvunzo inodiwa neiyo transaction dhatabhesi, kana yeyakaderera-latency nguva yakatevedzana kuongororwa kwakaitwa na. Apache DruidPamibvunzo inopindirana yekuongorora, vashandisi vedu vanotarisira nguva dzekupindura isingasviki miniti imwe. Taifanira kugadzira chiitiko chedu cheBigQuery kuzadzisa tarisiro idzi. Kuti tive nechokwadi chekufungidzira kuita kwevashandisi vedu, isu takasimudzira BigQuery mashandiro anowanikwa kune vatengi pane yakapfava-reti hwaro, iyo inobvumira varidzi veprojekiti kuchengetedza mashoma slots kumibvunzo yavo. Slot BigQuery chikamu chesimba rekombuta rinodiwa kuita mibvunzo yeSQL.

Takaongorora pamusoro pe800 mibvunzo ichigadzirisa ingangoita 1 TB yedata imwe neimwe uye takaona kuti avhareji yenguva yekuuraya yaive masekondi makumi matatu. Isu takadzidza zvakare kuti kuita kwainyanya kutsamira pakushandisa slot mumapurojekiti akasiyana uye mabasa. Taifanira kusiyanisa zvakajeka pakati pekugadzirwa kwedu uye ad hoc slot reservations kuchengetedza mashandiro emakesi ekushandisa ekugadzira uye ongororo inopindirana. Izvi zvakapesvedzera zvakanyanya dhizaini yedu yekuchengetedza slot uye hurongwa hweprojekiti.

Tichataura nezve data manejimendi, kushanda, uye mutengo wehurongwa mumazuva anouya muchikamu chechipiri cheshanduro, asi parizvino tinokoka munhu wese kuti auye. mahara live webinar, kwaunogona kudzidza zvakawanda nezvekosi uye kubvunza mibvunzo kune nyanzvi yedu, Egor Mateshuk (Senior Data Engineer, MaximaTelecom).

Verenga zvimwe:

Source: www.habr.com

Tenga inovimbika yekutambira kwemasaiti ane DDoS dziviriro, VPS VDS maseva 🔥 Tenga webhusaiti yakavimbika ine dziviriro yeDDoS, maseva eVPS VDS | ProHoster