Kamoo Google's BigQuery e entseng tlhahlobo ea data ka demokrasi. Karolo ea 2

Lumela, Habr! Ho ingolisa bakeng sa mokhoa o mocha oa thuto ho bulehile hona joale OTUS Moenjiniere oa Boitsebiso. Ka tebello ea ho qala ha thupelo, re tsoela pele ho arolelana litaba tse molemo le uena.

Bala karolo ea pele

Kamoo Google's BigQuery e entseng tlhahlobo ea data ka demokrasi. Karolo ea 2

Tsamaiso ea data

Puso e Matla ea Data ke motheo oa motheo oa Twitter Engineering. Ha re ntse re kenya tšebetsong BigQuery sethaleng sa rona, re tsepamisa maikutlo ho sibollo ea data, taolo ea phihlello, ts'ireletso le boinotšing.

Ho sibolla le ho laola data, re atolositse Layer ea rona ea phihlello ea data ho DAL) ho fana ka lisebelisoa bakeng sa data ea sebakeng le Google Cloud, ho fana ka sebopeho se le seng le API bakeng sa basebelisi ba rona. Joalo ka Google Lethathamo la lintlha e ea molemong oa ho fumaneha ka kakaretso, re tla e kenyelletsa mererong ea rona ho fa basebelisi likarolo tse kang ho batla likholumo.

BigQuery e etsa hore ho be bonolo ho arolelana le ho fihlella data, empa re ne re hloka ho ba le taolo e itseng ho thibela ts'ebetso ea data. Har'a lisebelisoa tse ling, re khethile mesebetsi e 'meli:

Re kentse tšebetsong litlhoko tsa netefatso, tumello, le tlhahlobo (AAA) bakeng sa ts'ireletso ka tsela e latelang:

  • Netefatso: Re sebelisitse liakhaonto tsa basebelisi ba GCP bakeng sa likopo tsa maemo a tšohanyetso le liakhaonto tsa lits'ebeletso bakeng sa likopo tsa tlhahiso.
  • Tumello: Re ne re hloka hore datha e ngoe le e ngoe e be le ak'haonte ea litšebeletso tsa beng le sehlopha sa babali.
  • Tlhahlobo: Re rometse li-stackdriver tsa BigQuery, tse nang le tlhaiso-leseling e felletseng ea lipotso, ho pokello ea data ea BigQuery bakeng sa tlhahlobo e bonolo.

Ho netefatsa hore data ea basebelisi ba Twitter e sebetsoa ka nepo, re tlameha ho ngolisa li-dataset tsohle tsa BigQuery, re hlakise lintlha tsa botho, re boloke polokelo e nepahetseng, 'me re hlakole (scrape) data e hlakotsoeng ke basebelisi.

Re shebile Google Thibelo ea ho Lahleheloa ke Cloud API, e sebelisang ho ithuta ka mochini ho hlophisa le ho hlophisa lintlha tsa bohlokoa, empa e nkile qeto e emelang ho hlakisa pokello ea data ka bowena ka lebaka la ho nepahala. Re rera ho sebelisa API ea Thibelo ea tahlehelo ea data ho eketsa litlatsetso tsa tloaelo.

Ho Twitter, re thehile mekhahlelo e mene ea lekunutu bakeng sa li-datasets ho BigQuery, tse thathamisitsoeng mona ka tatellano e theohang ea kutloisiso:

  • Lisebelisoa tsa data tse hlokolosi haholo li fumaneha ka mokhoa o hlokahalang ho latela molao-motheo oa monyetla o fokolang. Sete e 'ngoe le e' ngoe ea data e na le sehlopha se arohaneng sa babali, 'me re tla latela tšebeliso ea li-account tsa motho ka mong.
  • Li-datasets tsa kutlo tse mahareng (mabitso a maiketsetso a tsela e le 'ngoe a sebelisa hashing e letsoai) ha li na Lintlha tse Identifiable Personal (PII) mme li fumaneha ho sehlopha se seholo sa basebetsi. Ena ke teka-tekano e ntle lipakeng tsa litaba tsa lekunutu le ts'ebeliso ea data. Sena se lumella basebetsi ho etsa mesebetsi ea tlhahlobo, joalo ka ho bala palo ea basebelisi ba sebelisitseng karolo, ntle le ho tseba hore na basebelisi ba 'nete ke bo-mang.
  • Li-dataset tsa kutlo tse tlase tse nang le lintlha tsohle tse khethollang basebelisi. Ena ke mokhoa o motle ho latela pono ea lekunutu, empa e ke ke ea sebelisoa bakeng sa tlhahlobo ea boemo ba basebelisi.
  • Lintlha tsa sechaba (tse lokollotsoeng ka ntle ho Twitter) li fumaneha ho basebeletsi bohle ba Twitter.

Ha e le mabapi le ho rema lifate, re sebelisitse mesebetsi e reriloeng ho bala li-dataset tsa BigQuery le ho li ngolisa ho Data Access Layer (DAL), polokelo ea metadata ea Twitter. Basebelisi ba tla hlakisa li-dataset ka lintlha tsa lekunutu hape ba hlalose nako ea ho e boloka. Ha e le ho hloekisa, re lekola ts'ebetso le litšenyehelo tsa likhetho tse peli: 1. Ho hloekisa li-database ho GCS ho sebelisa lisebelisoa tse kang Scalding le ho li kenya ho BigQuery; 2. Ho sebelisa lipolelo tsa BigQuery DML. Mohlomong re tla sebelisa motsoako oa mekhoa ka bobeli ho fihlela litlhoko tsa lihlopha tse fapaneng le data.

Ts'ebetso ea sistimi

Hobane BigQuery ke ts'ebeletso e laoloang, ho ne ho sa hlokahale ho kenyelletsa sehlopha sa Twitter sa SRE taolong ea litsamaiso kapa mesebetsing ea deske. Ho ne ho le bonolo ho fana ka bokhoni bo bongata bakeng sa polokelo le komporo. Re ka fetola pehelo ea slot ka ho theha tekete ka tšehetso ea Google. Re hloaile libaka tse ka ntlafatsoang, tse kang kabo ea sebaka sa ho itšebelletsa le lintlafatso tsa dashboard bakeng sa tlhahlobo, 'me ra romela likopo tseo ho Google.

ditjeho tsa ho

Tlhahlobo ea rona ea pele e bontšitse hore litšenyehelo tsa ho botsa BigQuery le Presto li ne li lekana. Re rekile slots bakeng sa tsitsitseng theko ea ho ba le litšenyehelo tse tsitsitseng tsa khoeli le khoeli sebakeng sa tefo se batloang haholo ka TB ea data e sebetsitsoeng. Qeto ena e ne e boetse e thehiloe maikutlong a basebelisi ba neng ba sa batle ho nahana ka litšenyehelo pele ba etsa kopo ka 'ngoe.

Ho boloka datha ho BigQuery ho tlisitse litšenyehelo ho kenyelletsa litšenyehelo tsa GCS. Lisebelisoa tse kang Scalding li hloka li-datasets ho GCS, 'me ho fihlella BigQuery re ile ra tlameha ho kenya li-dataset tse tšoanang ka sebopeho sa BigQuery. Moqapi. Re ntse re sebetsana le khokahanyo ea Scalding ho li-dataset tsa BigQuery e tla felisa tlhoko ea ho boloka datha ho GCS le BigQuery.

Bakeng sa liketsahalo tse sa tloaelehang tse hlokang ho botsoa khafetsa ka mashome a li-petabyte, re nkile qeto ea hore ho boloka li-database ho BigQuery ho ne ho se na litšenyehelo tse tlase 'me re sebelisitse Presto ho fihlella ka ho toba li-database tsa GCS. Ho etsa sena, re shebile Mehloli ea Boitsebiso ea Ntle ea BigQuery.

Mehato e latelang

Re bone thahasello e kholo ho BigQuery ho tloha ha alpha e lokolloa. Re eketsa li-database le litaelo tse ling ho BigQuery. Re etsa likhokahano bakeng sa lisebelisoa tsa tlhahlobo ea data joalo ka Scalding ho bala le ho ngolla polokelong ea BigQuery. Re shebile lisebelisoa tse kang Looker le Apache Zeppelin bakeng sa ho theha litlaleho le lintlha tsa boleng ba khoebo re sebelisa li-dataset tsa BigQuery.

Tšebelisano ea rona le Google e bile le litholoana tse ntle haholo 'me re thabela ho tsoela pele le ho nts'etsapele ts'ebelisano ena. Re sebelitse le Google ho sebelisa tsa rona Tracker ea Litaba tsa Molekaneho romella lipotso ka kotloloho ho Google. Tse ling tsa tsona, joalo ka BigQuery Parquet loader, li se li kentsoe tšebetsong ke Google.

Mona ke tse ling tsa likopo tsa rona tsa mantlha tsa likarolo tsa Google:

  • Lisebelisoa tsa ho amohela data le tšehetso e bonolo bakeng sa sebopeho sa LZO-Thrift.
  • Karohano ea hora
  • Lintlafatso tsa taolo ea phihlello joalo ka litumello tsa boemo ba tafole, mola-, le kholomo.
  • kgolohadi Mehloli ea Kantle ea Boitsebiso ka kopanyo ea Hive Metastore le tšehetso bakeng sa sebopeho sa LZO-Thrift.
  • Khokahano e ntlafalitsoeng ea lethathamo la lintlha ho sebopeho sa mosebelisi sa BigQuery
  • Ho itšebeletsa bakeng sa kabo ea slot le ho beha leihlo.

fihlela qeto e

Ho etsa demokrasi, ho bona, le ho ithuta ka mochini ka mokhoa o sireletsehileng ke ntho e tlang pele ho sehlopha sa Data Platform. Re khethile Google BigQuery le Data Studio e le lisebelisoa tse ka re thusang ho fihlela sepheo sena, 'me ra lokolla khamphani ea BigQuery Alpha selemong se fetileng.

Re fumane hore lipotso ho BigQuery li bonolo ebile li sebetsa hantle. Re sebelisitse lisebelisoa tsa Google ho kenya le ho fetola data bakeng sa lipeipi tse bonolo, empa bakeng sa lipeipi tse rarahaneng re ile ra tlameha ho iketsetsa moralo oa rona oa Airflow. Sebakeng sa taolo ea data, lits'ebeletso tsa BigQuery bakeng sa netefatso, tumello le tlhahlobo ea libuka li fihlela litlhoko tsa rona. Ho laola boitsebiso le ho boloka boinotšing, re ne re hloka ho feto-fetoha le maemo 'me re ile ra tlameha ho iketsetsa litsamaiso tsa rona. BigQuery, kaha ke ts'ebeletso e laoloang, ho ne ho le bonolo ho e sebelisa. Litšenyehelo tsa ho botsa li ne li tšoana le lisebelisoa tse teng. Ho boloka datha ho BigQuery ho kenya litšenyehelo ho kenyelletsa litšenyehelo tsa GCS.

Ka kakaretso, BigQuery e sebetsa hantle bakeng sa tlhahlobo e akaretsang ea SQL. Re bona batho ba bangata ba nang le thahasello ho BigQuery, 'me re ntse re sebeletsa ho fallisa lisebelisoa tse ling tsa data, ho tlisa lihlopha tse ngata, le ho haha ​​​​liphaephe tse ling ka BigQuery. Twitter e sebelisa lintlha tse fapaneng tse tla hloka motsoako oa lisebelisoa tse kang Scalding, Spark, Presto, le Druid. Re ikemiselitse ho tsoela pele ho matlafatsa lisebelisoa tsa rona tsa tlhahlobo ea data le ho fana ka tataiso e hlakileng ho basebelisi ba rona mabapi le mokhoa oa ho sebelisa linyehelo tsa rona hantle.

Mantsoe a teboho

Ke rata ho leboha bangoli-'moho le 'na, Anju Jha le Will Pascucci, ka tšebelisano-'moho ea bona e kholo le mosebetsi o boima mosebetsing ona. Ke rata hape ho leboha baenjiniere le batsamaisi ba lihlopha tse 'maloa tsa Twitter le Google ba re thusitseng le basebelisi ba BigQuery ho Twitter ba faneng ka maikutlo a bohlokoa.

Haeba u thahasella ho sebetsana le mathata ana, sheba rona likheo tsa mesebetsi sehlopheng sa Data Platform.

Boleng ba Boitsebiso ho DWH - Data Warehouse Consistency

Source: www.habr.com

Eketsa ka tlhaloso