Yadda Google's BigQuery ya ba da dimokuradiyya nazarin bayanai. Kashi na 1

Hai Habr! An buɗe rajista don sabon rafi a OTUS a yanzu Injiniya Data. A cikin tsammanin fara karatun, mun shirya muku fassarar abu mai ban sha'awa bisa ga al'ada.

Kowace rana, sama da mutane miliyan ɗari suna ziyartar Twitter don gano abubuwan da ke faruwa a duniya kuma su tattauna shi. Kowane tweet da kowane aikin mai amfani yana haifar da wani taron da ke akwai don nazarin bayanan ciki a cikin Twitter. Daruruwan ma'aikata suna nazari da hangen nesa da wannan bayanan, kuma haɓaka ƙwarewar su shine babban fifiko ga ƙungiyar Twitter Data Platform.

Mun yi imanin cewa masu amfani da keɓaɓɓun ƙwarewar fasaha ya kamata su sami damar samun bayanai kuma su sami damar yin amfani da ingantaccen bincike na tushen SQL da kayan aikin gani. Wannan zai ba da damar sabon rukuni na ƙananan masu amfani da fasaha, ciki har da masu nazarin bayanai da masu sarrafa samfur, don fitar da basira daga bayanan, ba su damar fahimta da amfani da ikon Twitter. Wannan shine yadda muke haɓaka nazarin bayanai akan Twitter.

Kamar yadda kayan aikinmu da damarmu don nazarin bayanan ciki suka inganta, mun ga ingantaccen sabis na Twitter. Duk da haka, akwai sauran damar ingantawa. Kayan aikin na yanzu kamar Scalding suna buƙatar ƙwarewar shirye-shirye. Kayan aikin bincike na tushen SQL kamar Presto da Vertica suna da batutuwan aiki a babban sikeli. Har ila yau, muna da matsala tare da rarraba bayanai a cikin tsarin da yawa ba tare da samun damar yin amfani da su akai-akai ba.

A bara mun sanar sabon haɗin gwiwa tare da Google, a cikin abin da muke canja wurin sassa na mu data kayayyakin more rayuwa a kan Google Cloud Platform (GCP). Mun kammala cewa kayan aikin Google Cloud Big Data zai iya taimaka mana a cikin yunƙurinmu don ƙaddamar da bincike, gani da koyan injina akan Twitter:

  • BabbanKanya: sha'anin bayanai sito tare da SQL tushen engine Dremel, wanda ya shahara saboda saurinsa, sauƙi da kuma jurewa koyon inji.
  • studio data: babban kayan aikin gani na bayanai tare da fasalin haɗin gwiwar kamar Google Docs.

A cikin wannan labarin, za ku koyi game da kwarewarmu da waɗannan kayan aikin: abin da muka yi, abin da muka koya da abin da za mu yi na gaba. Yanzu za mu mai da hankali kan tsari da nazari na mu'amala. Za a tattauna nazari na ainihi a talifi na gaba.

Tarihin Ma'ajiyar Bayanai akan Twitter

Kafin nutsewa cikin BigQuery, yana da kyau a ɗan faɗi tarihin ma'ajin bayanai akan Twitter. A cikin 2011, an yi nazarin bayanan Twitter a Vertica da Hadoop. Don ƙirƙirar ayyukan MapReduce Hadoop, mun yi amfani da Alade. A cikin 2012, mun maye gurbin Alade tare da Scalding, wanda ke da Scala API tare da fa'idodi kamar ikon ƙirƙirar bututu masu rikitarwa da sauƙin gwaji. Koyaya, ga yawancin manazarta bayanai da manajan samfur waɗanda suka fi jin daɗin yin aiki tare da SQL, ya kasance babban tsarin koyo. A kusa da 2016, mun fara amfani da Presto azaman ƙarshen SQL ɗin mu don bayanan Hadoop. Spark ya ba da ƙirar Python wanda ya sa ya zama kyakkyawan zaɓi don kimiyyar bayanai na ad hoc da koyon injin.

Tun daga 2018, mun yi amfani da waɗannan kayan aikin don nazarin bayanai da hangen nesa:

  • Scalding don samar da Lines
  • Scalding da Spark don nazarin bayanan ad hoc da koyon injin
  • Vertica da Presto don ad hoc da bincike na SQL mai ma'amala
  • Druid don ƙananan ma'amala, bincike da ƙarancin damar samun ma'auni na jerin lokaci
  • Tableau, Zeppelin da Pivot don Kallon Bayanai

Mun gano cewa yayin da waɗannan kayan aikin ke ba da fasali masu ƙarfi sosai, mun sami wahalar samar da waɗannan fasalulluka ga mafi yawan masu sauraro akan Twitter. Ta hanyar faɗaɗa dandalinmu tare da Google Cloud, muna mai da hankali kan sauƙaƙe kayan aikin binciken mu don duk Twitter.

Google's BigQuery Data Warehouse

Ƙungiyoyi da yawa a Twitter sun riga sun haɗa da BigQuery a cikin wasu bututun da suke samarwa. Yin amfani da kwarewarsu, mun fara kimanta yuwuwar BigQuery don duk maganganun amfani da Twitter. Manufarmu ita ce bayar da BigQuery ga dukan kamfanin, kuma don daidaitawa da goyan bayan shi a cikin kayan aikin Data Platform. Wannan ya yi wahala saboda dalilai da yawa. Muna buƙatar haɓaka abubuwan more rayuwa don dogaro da karɓar ɗimbin bayanai, tallafawa sarrafa bayanan kamfani gabaɗaya, tabbatar da ingantaccen ikon sarrafawa, da tabbatar da sirrin abokin ciniki. Hakanan dole ne mu ƙirƙiri tsarin don rabon albarkatu, saka idanu, da kuma caji don ƙungiyoyi su yi amfani da BigQuery yadda ya kamata.

A cikin Nuwamba 2018, mun saki alpha saki na BigQuery da Data Studio ga dukan kamfanin. Mun bayar da wasu mafi yawan amfani da bayanan sirri da aka share ga ma'aikatan Twitter. An yi amfani da BigQuery fiye da masu amfani da 250 daga ƙungiyoyi daban-daban da suka haɗa da aikin injiniya, kuɗi da tallace-tallace. Kwanan nan, suna gudanar da buƙatun kusan 8, suna sarrafa kusan 100 PB kowane wata, ba tare da ƙirga buƙatun da aka tsara ba. Bayan mun sami kyakkyawar amsawa, mun yanke shawarar ci gaba da bayar da BigQuery a matsayin tushen farko don hulɗa tare da bayanai akan Twitter.

Anan akwai zane na babban matakin gine-gine na rumbun bayanan mu na Google BigQuery.

Yadda Google's BigQuery ya ba da dimokuradiyya nazarin bayanai. Kashi na 1
Muna kwafin bayanai daga gungu na Hadoop na gida zuwa Google Cloud Storage (GCS) ta amfani da kayan aikin Cloud Replicator na ciki. Muna amfani da Apache Airflow don ƙirƙirar bututun da ke amfani da su "bq_yi»don loda bayanai daga GCS zuwa BigQuery. Muna amfani da Presto don bincika bayanan Parquet ko Thrift-LZO a cikin GCS. BQ Blaster kayan aiki ne na Scalding na ciki don loda bayanan HDFS Vertica da Thrift-LZO a cikin BigQuery.

A cikin sassan da ke gaba, za mu tattauna tsarinmu da ƙwarewarmu cikin sauƙin amfani, aiki, sarrafa bayanai, lafiyar tsarin, da farashi.

Sauƙin amfani

Mun gano cewa yana da sauƙi ga masu amfani su fara da BigQuery saboda baya buƙatar shigar da software kuma masu amfani za su iya samun damar yin amfani da shi ta hanyar haɗin yanar gizo mai hankali. Koyaya, masu amfani suna buƙatar sanin wasu fasalulluka da ra'ayoyi na GCP, gami da albarkatu kamar ayyuka, saitin bayanai, da teburi. Mun haɓaka koyawa da koyawa don taimakawa masu amfani su fara. Tare da fahimtar asali da aka samu, yana da sauƙi ga masu amfani don kewaya bayanan bayanai, duba tsari da bayanan tebur, gudanar da tambayoyi masu sauƙi, da hangen sakamako a Studio Studio.

Manufar mu tare da shigar da bayanai a cikin BigQuery shine samar da kaya mara kyau na HDFS ko GCS datasets tare da dannawa ɗaya. Mun yi la'akari Cloud Composer (Airflow ne ke sarrafa shi) amma ba mu sami damar amfani da shi ba saboda tsarin tsaro na "Ƙuntataccen Rarraba Domain" (ƙari akan wannan a cikin sashin Gudanar da Bayanan da ke ƙasa). Mun gwada amfani da Google Data Transfer Service (DTS) don tsara ayyukan lodin BigQuery. Yayin da DTS ya yi saurin kafawa, bai kasance mai sassauƙa ba don gina bututun mai tare da dogaro. Don sakin alpha ɗin mu, mun ƙirƙiri yanayin yanayin iska na Apache a cikin GCE kuma muna shirya shi don samarwa da ikon tallafawa ƙarin hanyoyin bayanai kamar Vertica.

Don canza bayanai zuwa BigQuery, masu amfani suna ƙirƙirar bututun bayanan SQL masu sauƙi ta amfani da tambayoyin da aka tsara. Don hadaddun bututun matakai masu yawa tare da dogaro, muna shirin yin amfani da ko dai namu tsarin kwararar iska ko Mawaƙin Cloud tare da Bayanin Cloud.

Yawan aiki

An ƙera BigQuery don ƙarin dalilai na SQL masu sarrafa bayanai masu yawa. Ba a yi niyya ba don ƙarancin jinkiri, manyan tambayoyin kayan aiki da ake buƙata ta hanyar bayanan ma'amala, ko ƙananan ƙididdigar jerin lokutan jinkiri da aka aiwatar ta hanyar. Apache Druid. Don tambayoyin nazari na mu'amala, masu amfani da mu suna tsammanin lokacin amsawa na ƙasa da minti ɗaya. Dole ne mu tsara amfani da BigQuery don saduwa da waɗannan tsammanin. Don samar da aikin da za a iya faɗi ga masu amfani da mu, mun yi amfani da aikin BigQuery, wanda ke samuwa ga abokan ciniki akan ƙayyadadden farashi, wanda ke ba masu aikin damar adana mafi ƙarancin ramummuka don buƙatun su. Ramin BigQuery wani yanki ne na ikon sarrafa kwamfuta da ake buƙata don aiwatar da tambayoyin SQL.

Mun bincika sama da tambayoyin 800 sarrafa kusan TB na bayanai kowanne kuma mun gano cewa matsakaicin lokacin aiwatarwa shine daƙiƙa 1. Mun kuma koyi cewa aikin ya dogara sosai kan amfani da ramin mu a ayyuka da ayyuka daban-daban. Dole ne mu keɓance kayan aikin mu da ajiyar ramin ad hoc don ci gaba da yin aiki don shari'o'in amfani da samarwa da bincike mai mu'amala. Wannan ya yi tasiri sosai ga ƙirar mu don ajiyar ramummuka da jagororin ayyuka.

Za mu yi magana game da sarrafa bayanai, ayyuka da farashin tsarin a cikin kwanaki masu zuwa a cikin kashi na biyu na fassarar, kuma yanzu muna gayyatar kowa da kowa zuwa. free live webinar, Inda za ku iya ƙarin koyo game da kwas ɗin, da kuma yin tambayoyi ga gwaninmu - Egor Mateshuk (Babban Injiniyan Bayanai, MaximaTelecom).

Kara karantawa:

source: www.habr.com

Add a comment