Momwe Google's BigQuery idathandizira kusanthula deta. Gawo 2

Moni, Habr! Kulembetsa kwa maphunziro atsopano kwatsegulidwa pakali pano ku OTUS Data Engineer. Poyembekezera kuyamba kwa maphunzirowa, tikupitiriza kugawana nanu zinthu zothandiza.

Werengani gawo loyamba

Momwe Google's BigQuery idathandizira kusanthula deta. Gawo 2

Kusamalira deta

Ulamuliro Wamphamvu wa Data ndiye mfundo yayikulu ya Twitter Engineering. Pamene tikukhazikitsa BigQuery papulatifomu yathu, timayang'ana kwambiri zakupeza deta, kuwongolera mwayi wofikira, chitetezo ndi zinsinsi.

Kuti tipeze ndikuwongolera data, takulitsa Gulu lathu la Data Access kuti DAL) kuti apereke zida zonse zapamalo ndi Google Cloud data, kupereka mawonekedwe amodzi ndi API kwa ogwiritsa ntchito athu. Monga Google Data Catalog ikupita ku kupezeka kwanthawi zonse, tidzayiphatikiza m'mapulojekiti athu kuti tipatse ogwiritsa ntchito zinthu monga kusaka mgawo.

BigQuery imapangitsa kukhala kosavuta kugawana ndi kupeza deta, koma tinkafunika kukhala ndi mphamvu pa izi kuti tipewe kusokoneza deta. Mwa zida zina, tinasankha ntchito ziwiri:

  • Kugawana komwe kumaletsedwa: Mawonekedwe a Beta kuti aletse ogwiritsa ntchito kugawana ma dataset a BigQuery ndi ogwiritsa ntchito kunja kwa Twitter.
  • Ulamuliro wa ntchito za VPC: Chiwongolero chomwe chimalepheretsa kuchotsedwa kwa data ndipo chimafuna kuti ogwiritsa ntchito azitha kupeza BigQuery kuchokera kumagawo odziwika a IP.

Takhazikitsa zovomerezeka, zovomerezeka, ndi zowerengera (AAA) zachitetezo motere:

  • Kutsimikizira: Tidagwiritsa ntchito maakaunti a ogwiritsa ntchito a GCP pazofunsira zanthawi yayitali ndi maakaunti autumiki pofunsira kupanga.
  • Chilolezo: Tidafuna kuti deta iliyonse ikhale ndi akaunti ya eni ake komanso gulu la owerenga.
  • Kuwunika: Tidatumiza zolemba za BigQuery stackdriver, zomwe zinali ndi zambiri zamafunso, ku BigQuery dataset kuti muwunike mosavuta.

Kuonetsetsa kuti deta yaumwini ya ogwiritsa ntchito Twitter ikusamalidwa bwino, tiyenera kulembetsa ma dataset onse a BigQuery, kufotokoza zambiri zaumwini, kusunga malo oyenera, ndi kuchotsa (scrape) deta yomwe yachotsedwa ndi ogwiritsa ntchito.

Tinayang'ana pa Google Cloud Data Loss Prevention API, yomwe imagwiritsa ntchito kuphunzira pamakina kuti igawanitse ndikusintha zidziwitso zodziwika bwino, koma idasankha mokomera kufotokozera pawokha dataset chifukwa cholondola. Tikukonzekera kugwiritsa ntchito Data Loss Prevention API kuti tiwonjeze malingaliro athu.

Pa Twitter, tapanga magulu anayi achinsinsi azinthu za data mu BigQuery, zomwe zalembedwa pano motsika kwambiri:

  • Ma seti a data omwe amakhudzidwa kwambiri amapangidwa kuti apezeke pakufunika kutengera mfundo yamwayi wocheperako. Seti iliyonse ya data ili ndi gulu lina la owerenga, ndipo tidzatsata kagwiritsidwe ntchito ndi akaunti imodzi.
  • Zosungirako zapakatikati (ma pseudonyms a njira imodzi pogwiritsa ntchito mchere wothira mchere) alibe Chidziwitso Chodziwikiratu (PII) ndipo amatha kupezeka ndi gulu lalikulu la antchito. Uku ndi kulinganiza bwino pakati pazovuta zachinsinsi ndi kugwiritsa ntchito deta. Izi zimathandiza ogwira ntchito kuchita ntchito zowunikira, monga kuwerengera kuchuluka kwa ogwiritsa ntchito omwe adagwiritsa ntchito, osadziwa omwe amagwiritsa ntchito enieni.
  • Maseti okhudzidwa otsika okhala ndi chidziwitso chonse cha ogwiritsa ntchito. Iyi ndi njira yabwino pazinsinsi, koma siingagwiritsidwe ntchito posanthula mulingo wa ogwiritsa ntchito.
  • Zolemba zapagulu (zotulutsidwa kunja kwa Twitter) zimapezeka kwa onse ogwira ntchito pa Twitter.

Ponena za kudula mitengo, tidagwiritsa ntchito zomwe zidakonzedweratu kuwerengera nkhokwe za BigQuery ndikuzilembetsa ndi Data Access Layer (DAL), posungira metadata ya Twitter. Ogwiritsa ntchito azifotokozera za data ndi zinsinsi komanso kutchula nthawi yosunga. Ponena za kuyeretsa, tikuwunika magwiridwe antchito ndi mtengo wazinthu ziwiri: 1. Kuyeretsa ma dataset mu GCS pogwiritsa ntchito zida monga Scalding ndikuziyika mu BigQuery; 2. Kugwiritsa ntchito mawu a BigQuery DML. Tidzagwiritsa ntchito njira zonse ziwiri kuti tikwaniritse zofunikira zamagulu ndi deta.

Kachitidwe kachitidwe

Chifukwa BigQuery ndi ntchito yoyendetsedwa, panalibe chifukwa chophatikizira gulu la Twitter la SRE pakuwongolera machitidwe kapena ntchito zapa desiki. Zinali zosavuta kupereka mphamvu zambiri zosungirako komanso makompyuta. Titha kusintha kusungitsa malo popanga tikiti ndi chithandizo cha Google. Tidazindikira madera omwe atha kuwongoleredwa, monga kugawika kwa malo odzipangira okha komanso kukonza dashboard kuti tiziwunikidwa, ndikupereka zopemphazo ku Google.

mtengo

Kusanthula kwathu koyambirira kunawonetsa kuti mtengo wamafunso a BigQuery ndi Presto anali pamlingo womwewo. Tinagula malo okhazikika mtengo kukhala ndi mtengo wokhazikika pamwezi m'malo molipira zomwe zikufunidwa pa TB ya data yosinthidwa. Chisankhochi chinakhazikitsidwanso ndi mayankho ochokera kwa ogwiritsa ntchito omwe sanafune kuganiza za ndalama asanapange pempho lililonse.

Kusunga deta mu BigQuery kunabweretsa ndalama kuwonjezera pa mtengo wa GCS. Zida monga Scalding zimafuna ma dataseti mu GCS, ndipo kuti tipeze BigQuery tinkafunika kuyika ma dataset omwewo mumtundu wa BigQuery. Wogwira ntchito. Tikugwira ntchito yolumikizana ndi Scalding kumagulu a data a BigQuery omwe achotsa kufunika kosunga ma data mu GCS ndi BigQuery.

Pazochitika zosowa zomwe zimafuna kufunsa pafupipafupi ma petabyte makumi angapo, tinaganiza kuti kusunga ma dataset mu BigQuery sikunali kotsika mtengo ndipo tinagwiritsa ntchito Presto kuti tipeze mwachindunji ma dataset mu GCS. Kuti tichite izi, tikuyang'ana BigQuery External Data Sources.

Masitepe otsatira

Tawona chidwi chachikulu ku BigQuery kuyambira pomwe alpha idatulutsidwa. Tikuwonjezera ma dataset ambiri ndi malamulo ena ku BigQuery. Timapanga zolumikizira za zida zowunikira data monga Scalding kuti muwerenge ndikulembera ku BigQuery yosungirako. Tikuyang'ana zida monga Looker ndi Apache Zeppelin popanga malipoti ndi zolemba zamabizinesi pogwiritsa ntchito ma dataset a BigQuery.

Mgwirizano wathu ndi Google wakhala wopindulitsa kwambiri ndipo ndife okondwa kupitiriza ndi kukhazikitsa mgwirizanowu. Tinagwira ntchito ndi Google kuti tigwiritse ntchito zathu Partner Issue Trackerkutumiza mafunso mwachindunji ku Google. Zina mwa izo, monga BigQuery Parquet loader, zakhazikitsidwa kale ndi Google.

Nazi zina mwazopempha zathu zofunika kwambiri za Google:

  • Zida zolandirira bwino deta ndikuthandizira mtundu wa LZO-Thrift.
  • Gawo la ola
  • Kuwongolera kofikirako monga zilolezo za tebulo-, mizere, ndi magawo.
  • BigQuery Zida Zakunja Zakunja ndi kuphatikiza kwa Hive Metastore ndikuthandizira mtundu wa LZO-Thrift.
  • Kuphatikizika kwa kalozera wa data mu BigQuery mawonekedwe ogwiritsa ntchito
  • Kudzichitira pawokha pogawira kagawo ndikuwunika.

Pomaliza

Kusanthula deta kwa demokalase, kuyang'ana, ndi kuphunzira pamakina motetezeka ndizofunikira kwambiri ku gulu la Data Platform. Tidazindikira Google BigQuery ndi Data Studio ngati zida zomwe zingathandize kukwaniritsa cholingachi, ndipo tidatulutsa BigQuery Alpha kampani yonse chaka chatha.

Tinapeza kuti mafunso mu BigQuery ndi osavuta komanso ogwira mtima. Tidagwiritsa ntchito zida za Google kulowetsa ndikusintha deta yamapaipi osavuta, koma pamapaipi ovuta tidayenera kupanga dongosolo lathu la Airflow. M'malo owongolera deta, ntchito za BigQuery zotsimikizira, zololeza, ndi zowerengera zimakwaniritsa zosowa zathu. Kuti tizitha kuyang'anira metadata komanso kusunga zinsinsi, tinkafunika kusinthasintha kwambiri ndipo tinkafunika kupanga tokha. BigQuery, pokhala ntchito yoyendetsedwa, inali yosavuta kugwiritsa ntchito. Ndalama zofunsira zinali zofanana ndi zida zomwe zilipo kale. Kusunga deta mu BigQuery kumawononga ndalama kuwonjezera pa mtengo wa GCS.

Ponseponse, BigQuery imagwira ntchito bwino pakuwunika wamba kwa SQL. Tikuwona chidwi chachikulu mu BigQuery, ndipo tikuyesetsa kusamutsa ma data ambiri, kubweretsa magulu ambiri, ndikupanga mapaipi ambiri ndi BigQuery. Twitter imagwiritsa ntchito ma data osiyanasiyana omwe angafune kuphatikiza zida monga Scalding, Spark, Presto, ndi Druid. Tikufuna kupitiriza kulimbikitsa zida zathu zosanthula deta ndikupereka malangizo omveka bwino kwa ogwiritsa ntchito athu momwe angagwiritsire ntchito bwino zomwe timapereka.

Mawu oyamikira

Ndikufuna kuthokoza olemba anzanga ndi anzanga, Anju Jha ndi Will Pascucci, chifukwa cha mgwirizano wawo waukulu komanso kugwira ntchito mwakhama pa ntchitoyi. Ndikufunanso kuthokoza mainjiniya ndi mamanenjala ochokera m'magulu angapo a Twitter ndi Google omwe adatithandiza komanso ogwiritsa ntchito a BigQuery pa Twitter omwe adapereka mayankho ofunikira.

Ngati mukufuna kuthana ndi mavutowa, onani zathu ntchito mu gulu la Data Platform.

Ubwino wa Data mu DWH - Kusasinthika kwa Data Warehouse

Source: www.habr.com

Kuwonjezera ndemanga