Google's BigQuery yakagadziridza sei data yekuongorora. Chikamu 2

Mhoro, Habr! Kunyoreswa kwekosi nyowani kwavhurwa izvozvi paOTUS "Data Engineer". Tichitarisira kutanga kwekosi, tinoenderera mberi nekugovana newe zvinhu zvinobatsira.

Verenga chikamu chekutanga

Google's BigQuery yakagadziridza sei data yekuongorora. Chikamu 2

Data management

Yakasimba Data Governance ndiyo musimboti dzidziso ye Twitter Engineering. Sezvo isu tichishandisa BigQuery mupuratifomu yedu, tinotarisa pakuwanikwa kwedata, kutonga kwekuwana, chengetedzo uye kuvanzika.

Kuti tiwane nekugadzirisa data, isu takawedzera yedu Data Access Layer kuti DAL) kupa maturusi ezvese ari-panzvimbo uye Google Cloud data, ichipa imwechete interface uye API yevashandisi vedu. SeGoogle Data Catalog iri kuenda kunowanikwa, tichaisanganisira mumapurojekiti edu kuti tipe vashandisi zvinhu zvakaita sekutsvaga kwekoramu.

BigQuery inoita kuti zvive nyore kugovera uye kuwana data, asi taifanira kuve nekutonga pamusoro peizvi kudzivirira kuburitswa kwedata. Pakati pezvimwe zvishandiso, takasarudza maviri mabasa:

  • Domain inorambidza kugovera: Beta chimiro chekudzivirira vashandisi kugovera BigQuery dataset nevashandisi kunze kwe Twitter.
  • VPC service controls: Kudzora kunodzivirira kuburitswa kwedata uye kunoda kuti vashandisi vawane BigQuery kubva kune inozivikanwa IP kero siyana.

Isu takaita huchokwadi, mvumo, uye kuongorora (AAA) zvinodiwa zvekuchengetedza sezvinotevera:

  • Kuvimbika: Isu takashandisa GCP mushandisi maakaundi kune ad hoc zvikumbiro uye masevhisi maakaundi ekukumbira kugadzira.
  • Mvumo: Isu taida kuti dataset yega yega ive neyamuridzi sevhisi account uye boka revaverengi.
  • Kuongorora: Takaendesa kunze BigQuery stackdriver logs, iyo ine ruzivo rwakadzama rwemubvunzo, muBigQuery dataset kuti iongororwe nyore.

Kuti tive nechokwadi chekuti data yega yega yevashandisi veTwitter inobatwa nemazvo, tinofanira kunyoresa ese BigQuery datasets, anonotate data remunhu, kuchengetedza kwakaringana kuchengetedza, uye kudzima (scrape) data yakabviswa nevashandisi.

Takatarisa paGoogle Purogiramu inonzi Cloud Data Loss Prevention, iyo inoshandisa muchina kudzidza kurongedza uye kugadzirisa data rakadzama, asi rakasarudza mukufarira kutsanangura nemaoko dataset nekuda kwechokwadi. Isu tinoronga kushandisa iyo Data Loss Prevention API kuwedzera iyo tsika chirevo.

PaTwitter, takagadzira zvikamu zvina zvekuvanzika zvemadatasets muBigQuery, akanyorwa pano mukudzika kwekunzwa:

  • Yakanyanya kunzwisiswa data seti inoitwa kuti iwanikwe pane-inodiwa hwaro zvichibva pamusimboti werombo rombo. Imwe neimwe seti yedata ine boka rakasiyana revaverengi, uye isu tichateedzera mashandisirwo nemaakaundi ega.
  • Medium senitivity datasets (imwe-nzira imwe pseudonyms uchishandisa salted hashing) haina Ruzivo Rwemunhu Identifiable (PII) uye anowanikwa kune boka rakakura revashandi. Ichi chiyero chakanaka pakati pekunetseka kwekuvanzika uye data utility. Izvi zvinobvumira vashandi kuita mabasa ekuongorora, sekuverenga nhamba yevashandisi vakashandisa chimiro, vasingazive kuti vashandisi chaivo ndivanaani.
  • Yakaderera senitivity datasets ine ruzivo rwese rwevashandisi. Iyi inzira yakanaka kubva pakuona kwekuvanzika, asi haigone kushandiswa pakuongorora nhanho yevashandisi.
  • Public datasets (yakaburitswa kunze kwe Twitter) inowanikwa kune vese vashandi ve Twitter.

Kana zviri zvekutema matanda, takashandisa mabasa akarongwa kuverenga BigQuery datasets uye kuanyoresa neData Access Layer (DAL), Twitter metadata repository. Vashandisi vanozotsanangura dhatabhesi neruzivo rwekuvanzika uye zvakare kudoma nguva yekuchengeta. Kana iri yekuchenesa, isu tinoongorora mashandiro uye mutengo wesarudzo mbiri: 1. Kuchenesa dheti muGCS uchishandisa maturusi akaita seScalding nekuaisa muBigQuery; 2. Kushandisa BigQuery DML zvirevo. Tingangoshandisa musanganiswa wenzira mbiri kuti tisangane nezvinodiwa zvemapoka akasiyana uye data.

System functionality

Nekuti BigQuery ibasa rinofambiswa, pakanga pasina chikonzero chekusanganisa timu yeSRE yeTwitter mukutarisira masisitimu kana mabasa etafura. Zvaive nyore kupa humwe huwandu hwese kuchengetedza uye komputa. Tinogona kushandura slot reservation nekugadzira tikiti nerutsigiro rweGoogle. Takaona nzvimbo dzinogona kuvandudzwa, dzakadai seyega-service slot allocation uye dashboard kunatsurudzwa pakuongorora, uye takaendesa zvikumbiro izvozvo kuGoogle.

mari

Ongororo yedu yekutanga yakaratidza kuti mari yemubvunzo yeBigQuery nePresto yaive padanho rimwe chete. Takatenga ma slots e fixed mutengo kuti uve nemutengo wakagadzikana pamwedzi pane kubhadhara pane kudiwa paTB yedata rakagadziriswa. Sarudzo iyi yaive zvakare yakavakirwa pamhinduro kubva kune vashandisi vaisada kufunga nezvemitengo vasati vaita chikumbiro chega chega.

Kuchengeta data muBigQuery kwakaunza mitengo mukuwedzera kumitengo yeGCS. Zvishandiso zvakaita seScalding zvinoda madhaseti muGCS, uye kuti tiwane BigQuery taifanira kurodha madhata mamwechete muBigQuery fomati. Capacitor. Tiri kushanda paScalding yekubatanidza kune BigQuery datasets izvo zvinobvisa kukosha kwekuchengeta dhatabhesi mune zvese GCS uye BigQuery.

Pazviitiko zvisingawanzoitiki zvaida kubvunzwa zvisingaite zvemakumi emapetabytes, takasarudza kuti kuchengeta dhatabheti muBigQuery kwaisadhure uye takashandisa Presto kuwana zvakananga dataset muGCS. Kuti tiite izvi, tiri kutarisa BigQuery External Data Sources.

Matanho anotevera

Takaona kufarira kwakawanda muBigQuery kubva pakaburitswa alpha. Tiri kuwedzera mamwe madataset uye mimwe mirairo kuBigQuery. Isu tinogadzira zvinongedzo zvematurusi ekuongorora data seScalding kuverenga nekunyora kuBigQuery kuchengetedza. Tiri kutarisa maturusi akaita se Looker uye Apache Zeppelin ekugadzira bhizinesi remhando mishumo uye manotsi tichishandisa BigQuery dataset.

Kudyidzana kwedu neGoogle kwave kuita zvakanaka uye tinofara kuenderera mberi nekuvandudza kudyidzana uku. Takashanda neGoogle kuita zvedu Partner Issue Trackerkutumira mibvunzo zvakananga kuGoogle. Mamwe acho, akadai seBigQuery Parquet loader, akatoitwa neGoogle.

Hezvino zvimwe zvezvikumbiro zvedu zvekutanga zveGoogle:

  • Zvishandiso zvekugamuchira data nyore uye kutsigirwa kweiyo LZO-Thrift fomati.
  • Chikamu cheawa
  • Kupinda kutonga kuvandudzwa senge tafura-, mutsara-, uye column-level mvumo.
  • bigquery External Data Sources neHive Metastore kubatanidzwa uye kutsigirwa kweiyo LZO-Thrift fomati.
  • Yakavandudzwa data catalog yekubatanidza muBigQuery mushandisi interface
  • Self-sevhisi yekugovera slot uye yekutarisa.

mhedziso

Democratizing data analytics, kuona, uye kudzidza muchina nenzira yakachengeteka ndiyo inonyanya kukosha kuData Platform timu. Takaona Google BigQuery neData Studio semidziyo inogona kubatsira kuzadzisa chinangwa ichi, uye takaburitsa BigQuery Alpha kambani yose gore rapfuura.

Takawana mibvunzo muBigQuery iri nyore uye inoshanda. Takashandisa maturusi eGoogle kupinza uye kushandura data yemapaipi ari nyore, asi kune yakaoma mapaipi taifanira kuvaka yedu Airflow chimiro. Munzvimbo yekutonga data, masevhisi eBigQuery ehuchokwadi, mvumo, uye kuongororwa anozadzisa zvatinoda. Kuti tigadzirise metadata nekuchengetedza zvakavanzika, taida kuchinjika zvakanyanya uye taifanira kuvaka yedu tega masisitimu. BigQuery, kuve sevhisi inotungamirwa, yaive nyore kushandisa. Mari yekubvunza yaive yakafanana nemidziyo yaivepo. Kuchengeta data muBigQuery kunounza mitengo mukuwedzera kumitengo yeGCS.

Pakazara, BigQuery inoshanda zvakanaka kune general SQL kuongororwa. Tiri kuona kufarira kwakawanda muBigQuery, uye tiri kushanda kuti titamise mamwe maseti edata, kuunza zvikwata zvakawanda, uye kugadzira mapaipi akawanda neBigQuery. Twitter inoshandisa data rakasiyana-siyana rinoda musanganiswa wezvishandiso zvakaita seScalding, Spark, Presto, uye Druid. Isu tine chinangwa chekuenderera mberi nekusimbisa maturusi edu ekuongorora data uye kupa nhungamiro yakajeka kuvashandisi vedu pamashandisiro atingaita zvatinopa.

Mazwi ekutenda

Ndinoda kutenda vanyori vandinoshanda navo uye vandinoshanda navo, Anju Jha naWill Pascucci, nekubatana kwavo kukuru uye kushanda nesimba pabasa iri. Ndinodawo kutenda mainjiniya nemamaneja kubva kuzvikwata zvakati kuti paTwitter neGoogle vakatibatsira nevashandisi veBigQuery paTwitter vakapa mhinduro dzakakosha.

Kana iwe uchida kushanda pamatambudziko aya, tarisa yedu vacancies muData Platform timu.

Hunhu hweData muDWH - Dhata Warehouse Consistency

Source: www.habr.com

Voeg