Yadda Google's BigQuery ya ba da dimokuradiyya nazarin bayanai. Kashi na 2

Hai Habr! An buɗe rajista don sabon rafi a OTUS a yanzu Injiniya Data. A cikin tsammanin farkon karatun, muna ci gaba da raba abubuwa masu amfani tare da ku.

Karanta kashi na daya

Yadda Google's BigQuery ya ba da dimokuradiyya nazarin bayanai. Kashi na 2

Gudanar da bayanai

Ƙarfafan Gudanarwar Bayanai shine tushen tushen Injiniyan Twitter. Yayin da muke aiwatar da BigQuery a cikin dandalinmu, muna mai da hankali kan gano bayanai, ikon samun dama, tsaro da keɓantawa.

Don ganowa da sarrafa bayanai, mun fadada Layer Access Layer ɗin mu zuwa DAL) don samar da kayan aiki don duka kan-gida da kuma bayanan Google Cloud, samar da keɓaɓɓiyar dubawa da API don masu amfani da mu. Kamar Google Catalog Data yana motsawa zuwa ga wadatar gabaɗaya, za mu haɗa shi a cikin ayyukanmu don samarwa masu amfani da fasali kamar binciken shafi.

BigQuery yana sauƙaƙa rabawa da samun damar bayanai, amma muna buƙatar samun wani iko akan wannan don hana fitar da bayanai. Daga cikin sauran kayan aikin, mun zaɓi ayyuka guda biyu:

  • An taƙaita raba yanki: fasalin Beta don hana masu amfani raba bayanan BigQuery tare da masu amfani a wajen Twitter.
  • Gudanar da sabis na VPC: Ikon da ke hana fitar da bayanai kuma yana buƙatar masu amfani don samun damar BigQuery daga sanannun jeri na adireshin IP.

Mun aiwatar da buƙatun tantancewa, izini, da dubawa (AAA) don tsaro kamar haka:

  • Tabbatarwa: Mun yi amfani da asusun mai amfani na GCP don buƙatun ad hoc da asusun sabis don buƙatun samarwa.
  • Izini: Muna buƙatar kowane saitin bayanai don samun asusun sabis na mai shi da ƙungiyar masu karatu.
  • Auditing: Mun fitar da rajistan ayyukan BigQuery, wanda ya ƙunshi cikakkun bayanan aiwatar da tambaya, cikin ma'aunin bayanan BigQuery don bincike mai sauƙi.

Don tabbatar da ana sarrafa bayanan masu amfani da Twitter yadda ya kamata, dole ne mu yi rijistar duk bayanan BigQuery, mu ba da bayanin bayanan sirri, kula da ingantaccen ma'ajiyar bayanai, da share bayanan da masu amfani suka goge.

Mun duba Google API ɗin Rigakafin Asarar Bayanan Gajimare, wanda ke amfani da koyo na na'ura don rarrabawa da gyara bayanai masu mahimmanci, amma ya yanke shawarar yin bayanin bayanan da hannu da hannu saboda daidaito. Muna shirin yin amfani da API na Rigakafin Asarar Bayanai don haɓaka bayanin al'ada.

A Twitter, mun ƙirƙiri nau'ikan sirrin sirri guda huɗu don saitin bayanai a cikin BigQuery, wanda aka jera a nan cikin tsari mai sauƙi:

  • Ana samar da saitin bayanai masu mahimmanci bisa ga buƙatu bisa ƙa'idar mafi ƙarancin gata. Kowane saitin bayanai yana da rukunin masu karatu daban, kuma za mu bi diddigin amfani da asusun ɗaya.
  • Matsakaicin madaidaicin bayanan bayanan (sunan suna ta hanya ɗaya ta amfani da hashing salted) ba su ƙunshe da Bayanin Gane Kai (PII) kuma ana samun dama ga gungun ma'aikata mafi girma. Wannan kyakkyawan ma'auni ne tsakanin damuwar sirri da amfanin bayanai. Wannan yana bawa ma'aikata damar yin ayyukan bincike, kamar ƙididdige adadin masu amfani da suka yi amfani da fasalin, ba tare da sanin ainihin masu amfani ba.
  • Ƙarƙashin bayanan hankali tare da duk bayanan gano mai amfani. Wannan hanya ce mai kyau ta fuskar sirri, amma ba za a iya amfani da ita don tantance matakin mai amfani ba.
  • Rubutun bayanan jama'a (wanda aka saki a wajen Twitter) suna samuwa ga duk ma'aikatan Twitter.

Dangane da shiga, mun yi amfani da ayyukan da aka tsara don ƙididdige bayanan bayanan BigQuery da yi musu rajista tare da Layer Access Layer (DAL), Ma'ajiyar metadata ta Twitter. Masu amfani za su ba da bayanin saitin bayanai tare da bayanan keɓantawa kuma su ƙayyade lokacin riƙewa. Game da tsaftacewa, muna kimanta aiki da farashin zaɓuɓɓuka biyu: 1. Tsaftace bayanan bayanai a cikin GCS ta amfani da kayan aiki kamar Scalding da loda su cikin BigQuery; 2. Amfani da bayanan BigQuery DML. Wataƙila za mu yi amfani da haɗin hanyoyin biyu don biyan buƙatun ƙungiyoyi da bayanai daban-daban.

Ayyukan tsarin

Saboda BigQuery sabis ne da ake sarrafawa, babu buƙatar haɗa ƙungiyar SRE ta Twitter a cikin sarrafa tsarin ko ayyukan tebur. Ya kasance mai sauƙi don samar da ƙarin ƙarfi don duka ajiya da kwamfuta. Za mu iya canza ajiyar ramin ta hanyar ƙirƙirar tikiti tare da tallafin Google. Mun gano wuraren da za a iya ingantawa, kamar rabon ramuka na sabis na kai da haɓaka dashboard don saka idanu, kuma mun ƙaddamar da waɗannan buƙatun ga Google.

kudin

Binciken mu na farko ya nuna cewa farashin tambaya na BigQuery da Presto sun kasance a matsayi ɗaya. Mun sayi ramummuka don gyarawa farashin don samun kwanciyar hankali na kowane wata maimakon biyan kuɗi akan buƙata kowane TB na bayanan da aka sarrafa. Wannan shawarar kuma ta dogara ne akan martani daga masu amfani waɗanda ba sa son yin tunani game da farashi kafin yin kowace buƙata.

Adana bayanai a cikin BigQuery ya kawo farashi ban da farashin GCS. Kayan aiki kamar Scalding suna buƙatar saitin bayanai a cikin GCS, kuma don samun damar BigQuery dole ne mu loda saitin bayanai iri ɗaya cikin tsarin BigQuery Aboki. Muna aiki akan haɗin Scalding zuwa manyan bayanai na BigQuery wanda zai kawar da buƙatar adana bayanan bayanai a cikin GCS da BigQuery.

Ga lokuta da ba kasafai suke buƙatar tambayoyin dubun petabytes ba, mun yanke shawarar cewa adana bayanan bayanai a cikin BigQuery ba shi da tsada kuma mun yi amfani da Presto don samun damar saitin bayanai kai tsaye a cikin GCS. Don yin wannan, muna kallon BigQuery External Data Sources.

Mataki na gaba

Mun ga yawan sha'awa a cikin BigQuery tun lokacin da aka saki alpha. Muna ƙara ƙarin saitunan bayanai da ƙarin umarni zuwa BigQuery. Muna haɓaka masu haɗin kai don kayan aikin nazarin bayanai kamar Scalding don karantawa da rubutu zuwa ma'ajiyar BigQuery. Muna kallon kayan aiki kamar Looker da Apache Zeppelin don ƙirƙirar rahotanni masu inganci da bayanin kula ta amfani da bayanan BigQuery.

Haɗin gwiwarmu da Google ya yi tasiri sosai kuma muna farin cikin ci gaba da haɓaka wannan haɗin gwiwa. Mun yi aiki tare da Google don aiwatar da namu Matsalolin Abokin Hulɗadon aika tambayoyin kai tsaye zuwa Google. Wasu daga cikinsu, irin su BigQuery Parquet loader, Google ya riga ya aiwatar da su.

Anan ga wasu buƙatun fasalin fifikonmu na Google:

  • Kayan aiki don dacewa da liyafar bayanai da goyan baya ga tsarin LZO-Thrift.
  • Rabewar sa'a
  • Samun damar inganta sarrafawa kamar tebur-, jere-, da izini-matakin shafi.
  • BabbanKanya Tushen Bayanai na Waje tare da haɗin gwiwar Hive Metastore da goyan bayan tsarin LZO-Thrift.
  • Ingantattun haɗe-haɗen katalogin bayanai a cikin mahallin mai amfani da BigQuery
  • Sabis na kai don rabon ramuka da saka idanu.

ƙarshe

Dimokuraɗiyya nazarin bayanai, hangen nesa, da koyan na'ura a cikin amintacciyar hanya shine babban fifiko ga ƙungiyar Platform Data. Mun gano Google BigQuery da Data Studio a matsayin kayan aikin da za su iya taimakawa cimma wannan burin, kuma mun saki BigQuery Alpha a duk shekara a bara.

Mun sami tambayoyi a cikin BigQuery don zama masu sauƙi da inganci. Mun yi amfani da kayan aikin Google don shigar da canza bayanai don bututun mai sauƙi, amma don hadadden bututun dole ne mu gina namu tsarin tafiyar da iska. A cikin sararin sarrafa bayanai, sabis na BigQuery don tantancewa, izini, da dubawa sun cika bukatunmu. Don sarrafa metadata da kiyaye sirri, muna buƙatar ƙarin sassauci kuma dole ne mu gina namu tsarin. BigQuery, kasancewa sabis ɗin sarrafawa, ya kasance mai sauƙin amfani. Kudin tambaya yayi kama da kayan aikin da ake dasu. Adana bayanai a cikin BigQuery yana haifar da farashi ban da farashin GCS.

Gabaɗaya, BigQuery yana aiki da kyau don nazarin SQL na gabaɗaya. Muna ganin sha'awa da yawa a cikin BigQuery, kuma muna aiki don ƙaura ƙarin saitin bayanai, kawo ƙarin ƙungiyoyi, da gina ƙarin bututu tare da BigQuery. Twitter yana amfani da bayanai iri-iri waɗanda zasu buƙaci haɗin kayan aiki kamar Scalding, Spark, Presto, da Druid. Muna da niyyar ci gaba da ƙarfafa kayan aikin nazarin bayanan mu kuma muna ba da cikakken jagora ga masu amfani da mu kan yadda mafi kyawun amfani da abubuwan da muke bayarwa.

Kalaman godiya

Ina so in gode wa abokan aikina da abokan aikina, Anju Jha da Will Pascucci, don babban haɗin gwiwa da aiki tukuru a kan wannan aikin. Ina kuma so in gode wa injiniyoyi da manajoji daga ƙungiyoyi da yawa a Twitter da Google waɗanda suka taimaka mana da masu amfani da BigQuery akan Twitter waɗanda suka ba da amsa mai mahimmanci.

Idan kuna sha'awar yin aiki akan waɗannan matsalolin, duba mu guraben aiki a cikin tawagar Data Platform.

Ingancin Bayanai a cikin DWH - Daidaituwar Gidan Ware Bayanai

source: www.habr.com

Add a comment