Pehea ko Google BigQuery i hoʻokaʻawale ai i ka ʻikepili ʻikepili. Mahele 1

Aloha, Habr! Wehe ʻia ke kākau inoa no kahi kahawai papa hou i kēia manawa ma OTUS ʻEnekinia ʻIkepili. I ka hoʻomaka ʻana o ka papa, ua hoʻomākaukau maʻamau mākou i kahi unuhi o nā mea hoihoi no ʻoe.

I kēlā me kēia lā, ʻoi aku ma mua o hoʻokahi haneli miliona poʻe e kipa aku iā Twitter e ʻike i nā mea e hana nei ma ka honua a kūkākūkā. ʻO kēlā me kēia tweet a me nā hana hoʻohana ʻē aʻe e hana i kahi hanana i loaʻa no ka loiloi ʻikepili kūloko o Twitter. ʻO nā haneli o nā limahana e nānā a nānā i kēia ʻikepili, a ʻo ka hoʻomaikaʻi ʻana i kā lākou ʻike he mea nui ia no ka hui Twitter Data Platform.

Manaʻo mākou e hiki i nā mea hoʻohana me ka laulā o nā mākau loea ke ʻike i ka ʻikepili a loaʻa iā lākou i ka hoʻokō maikaʻi ʻana i ka SQL-based analysis and visualization tools. E ʻae kēia i kahi hui hou o nā mea hoʻohana ʻenehana liʻiliʻi, ʻo ia hoʻi nā ʻikepili a me nā luna huahana, e unuhi i nā ʻike mai ka ʻikepili, e ʻae iā lākou e hoʻomaopopo a hoʻohana i nā hiki o Twitter. ʻO kēia ke ʻano o kā mākou democratize ʻikepili ʻikepili ma Twitter.

I ka hoʻomaikaʻi ʻana o kā mākou mau mea hana a me nā mana ʻikepili ʻikepili kūloko, ua ʻike mākou i ka hoʻomaikaʻi ʻana o Twitter. Eia nō naʻe, aia nō kahi wahi no ka hoʻomaikaʻi ʻana. Pono nā mea hana o kēia manawa e like me Scalding i ka ʻike polokalamu. Loaʻa i nā mea hana loiloi SQL-based e like me Presto a me Vertica nā pilikia hana ma ka pālākiō. Loaʻa iā mākou ka pilikia o ka hāʻawi ʻana i nā ʻikepili ma nā ʻōnaehana lehulehu me ka ʻole o ke komo mau ʻana iā ia.

I ka makahiki i hala ua hoʻolaha mākou hui hou me Google, i loko e hoʻoili ai mākou i nā ʻāpana o kā mākou ʻoihana ʻikepili ma Google Cloud Platform (GCP). Ua hoʻoholo mākou i nā mea hana Google Cloud BigʻIkepili Hiki ke kōkua iā mākou me kā mākou mau hana e hoʻokaʻawale i ka ʻikepili, ʻike maka, a me ke aʻo ʻana i ka mīkini ma Twitter:

Ma kēia ʻatikala, e aʻo ʻoe e pili ana i kā mākou ʻike me kēia mau mea hana: nā mea a mākou i hana ai, nā mea a mākou i aʻo ai, a me nā mea a mākou e hana ai ma hope. E kālele ana mākou i kēia manawa i ka pūʻulu a me ka ʻikepili pili. E kūkākūkā mākou i ka ʻatikala manawa maoli ma ka ʻatikala aʻe.

Moʻolelo o nā hale kūʻai ʻikepili Twitter

Ma mua o ka luʻu ʻana i BigQuery, pono e haʻi pōkole i ka mōʻaukala o ka waihona ʻikepili Twitter. Ma 2011, ua hana ʻia ka ʻikepili ʻikepili Twitter ma Vertica a me Hadoop. Ua hoʻohana mākou i ka puaʻa e hana i nā hana MapReduce Hadoop. Ma 2012, ua hoʻololi mākou i ka Pig me ka Scalding, nona kahi Scala API me nā pōmaikaʻi e like me ka hiki ke hana i nā pipeline paʻakikī a me ka maʻalahi o ka hoʻāʻo. Eia nō naʻe, no ka nui o nā mea noiʻi ʻikepili a me nā mea hoʻokele huahana i ʻoi aku ka ʻoluʻolu o ka hana ʻana me SQL, he pihi aʻo ʻoi loa ia. Ma kahi o 2016, hoʻomaka mākou e hoʻohana iā Presto ma ke ʻano he SQL interface i ka ʻikepili Hadoop. Hāʻawi ʻo Spark i kahi interface Python, kahi i koho maikaʻi ai no ka ʻepekema data ad hoc a me ke aʻo ʻana i ka mīkini.

Mai ka makahiki 2018, ua hoʻohana mākou i nā mea hana aʻe no ka nānā ʻana i ka ʻikepili a me ka nānā ʻana:

  • ʻO ka hoʻoulu ʻana no nā mea lawe hana
  • ʻO Scalding a Spark no ka ʻikepili ʻikepili ad hoc a me ke aʻo ʻana i ka mīkini
  • ʻO Vertica a me Presto no ka nānā ʻana i ka SQL ad hoc a me ka interactive
  • ʻO Druid no ka haʻahaʻa haʻahaʻa haʻahaʻa, ʻimi a me ka latency haʻahaʻa i ke komo ʻana i nā metric manawa
  • ʻO Tableau, Zeppelin a me Pivot no ka ʻike ʻike ʻikepili

Ua ʻike mākou ʻoiai ke hāʻawi nei kēia mau mea hana i nā mana ikaika loa, ua paʻakikī mākou i ka hoʻolako ʻana i kēia mau mana i ka lehulehu ākea ma Twitter. Ma ka hoʻonui ʻana i kā mākou paepae me Google Cloud, ke kālele nei mākou i ka hoʻomaʻamaʻa ʻana i kā mākou mau mea hana loiloi no Twitter āpau.

ʻO Google BigQuery Data Warehouse

Ua hoʻokomo mua kekahi mau hui ma Twitter iā BigQuery i kekahi o kā lākou mau paipu hana. Ke hoʻohana nei i kā lākou ʻike, hoʻomaka mākou e loiloi i nā hiki o BigQuery no nā hihia hoʻohana Twitter āpau. ʻO kā mākou pahuhopu, ʻo ia ka hāʻawi ʻana iā BigQuery i ka hui holoʻokoʻa a hoʻopaʻa a kākoʻo iā ia i loko o ka pūʻulu Data Platform. Ua paʻakikī kēia no nā kumu he nui. Pono mākou e hoʻomohala i kahi ʻōnaehana no ka hoʻokomo pono ʻana i ka nui o ka ʻikepili, kākoʻo i ka hoʻokele ʻikepili holoʻokoʻa o ka ʻoihana, e hōʻoia i nā mana komo kūpono, a e hōʻoia i ka pilikino o nā mea kūʻai aku. Pono mākou e hana i nā ʻōnaehana no ka hoʻokaʻawale waiwai, ka nānā ʻana, a me nā uku hoʻihoʻi i hiki i nā hui ke hoʻohana pono iā BigQuery.

I Nowemapa 2018, ua hoʻokuʻu mākou i kahi hoʻokuʻu alpha āpau o BigQuery a me Data Studio. Ua hāʻawi mākou i nā limahana Twitter i kekahi o kā mākou pāpalapala hoʻohana pinepine ʻia me nā ʻikepili pilikino i hoʻomaʻemaʻe ʻia. Ua hoʻohana ʻia ʻo BigQuery e ʻoi aku ma mua o 250 mau mea hoʻohana mai nā hui like ʻole me ka ʻenekinia, kālā a me ke kālepa. ʻO ka mea hou loa, ke holo nei lākou e pili ana i nā noi 8k, e hana ana ma kahi o 100 PB i kēlā me kēia mahina, ʻaʻole helu i nā noi i hoʻonohonoho ʻia. Ma hope o ka loaʻa ʻana o nā manaʻo maikaʻi loa, ua hoʻoholo mākou e neʻe i mua a hāʻawi iā BigQuery i kumu kumu no ka launa pū ʻana me ka ʻikepili ma Twitter.

Eia ke kiʻi kiʻekiʻe o kā mākou hale waihona ʻikepili Google BigQuery.

Pehea ko Google BigQuery i hoʻokaʻawale ai i ka ʻikepili ʻikepili. Mahele 1
Kākoʻo mākou i ka ʻikepili mai nā pūʻulu Hadoop ma ka hale i ka Google Cloud Storage (GCS) me ka hoʻohana ʻana i ka mea hana Cloud Replicator kūloko. A laila hoʻohana mākou i ka Apache Airflow e hana i nā pipeline e hoʻohana ana "bq_load»e hoʻouka i ka ʻikepili mai GCS i BigQuery. Hoʻohana mākou iā Presto e nīnau i nā ʻikepili Parquet a i ʻole Thrift-LZO ma GCS. ʻO BQ Blaster kahi mea hana Scalding kūloko no ka hoʻouka ʻana i nā ʻikepili HDFS Vertica a me Thrift-LZO i BigQuery.

Ma nā ʻāpana aʻe, kūkākūkā mākou i kā mākou ala a me kā mākou akamai i nā wahi o ka maʻalahi o ka hoʻohana, ka hana, ka hoʻokele ʻikepili, ke olakino ʻōnaehana, a me ke kumukūʻai.

ʻOka maʻalahi o ka hoʻohana

Ua ʻike mākou he mea maʻalahi i nā mea hoʻohana ke hoʻomaka me BigQuery no ka mea ʻaʻole pono ia e hoʻokomo i nā polokalamu a hiki i nā mea hoʻohana ke komo iā ia ma o kahi kikowaena pūnaewele intuitive. Eia nō naʻe, pono nā mea hoʻohana e kamaʻāina i kekahi o nā hiʻohiʻona a me nā manaʻo o GCP, me nā kumuwaiwai e like me nā papahana, nā waihona, a me nā papa. Ua kūkulu mākou i nā mea hoʻonaʻauao a me nā kumu aʻo e kōkua i nā mea hoʻohana e hoʻomaka. Me ka ʻike kumu i loaʻa, ua maʻalahi nā mea hoʻohana e hoʻokele i nā pūʻulu ʻikepili, nānā i ka schema a me ka ʻikepili papa, holo i nā nīnau maʻalahi, a nānā i nā hopena ma Data Studio.

ʻO kā mākou pahuhopu no ka hoʻokomo ʻana i ka ʻikepili i loko o BigQuery, ʻo ia ka mea e hiki ai i ka hoʻouka pono ʻana o nā waihona HDFS a i ʻole GCS me hoʻokahi kaomi. Ua noonoo makou Kapua haku mele (mālama ʻia e Airflow) akā ʻaʻole hiki ke hoʻohana iā ia ma muli o kā mākou Domain Restricted Sharing kumu hoʻohālike (ʻoi aʻe ma kēia ma ka ʻāpana Hoʻokele Data ma lalo). Ua hoʻāʻo mākou me ka hoʻohana ʻana i ka Google Data Transfer Service (DTS) no ka hoʻonohonoho ʻana i nā ukana hana BigQuery. ʻOiai ua wikiwiki ʻo DTS i ka hoʻonohonoho ʻana, ʻaʻole hiki ke maʻalahi no ke kūkulu ʻana i nā pipeline me nā mea hilinaʻi. No kā mākou hoʻokuʻu alpha, ua kūkulu mākou i kā mākou Apache Airflow framework ma GCE a ke hoʻomākaukau nei mākou e holo i ka hana a hiki ke kākoʻo i nā kumu ʻikepili hou aʻe e like me Vertica.

No ka hoʻololi ʻana i ka ʻikepili i BigQuery, hana nā mea hoʻohana i nā pipeline data SQL maʻalahi me ka hoʻohana ʻana i nā nīnau i hoʻonohonoho ʻia. No nā laina paipu paʻakikī me nā mea hilinaʻi, hoʻolālā mākou e hoʻohana i kā mākou Airflow framework a i ʻole Cloud Composer me Cloud Dataflow.

'Ohanahana

Hoʻolālā ʻia ʻo BigQuery no nā nīnau SQL kumu nui e hoʻoponopono i ka nui o ka ʻikepili. ʻAʻole ia i manaʻo ʻia no ka latency haʻahaʻa, nā nīnau throughput kiʻekiʻe e koi ʻia e kahi waihona transactional, a i ʻole no ka loiloi wā latency haʻahaʻa i hoʻokō ʻia. ʻO Apache Druid. No nā nīnau noiʻi hoʻopili, manaʻo kā mākou mea hoʻohana i nā manawa pane ʻoi aku ma mua o hoʻokahi minuke. Pono mākou e hoʻolālā i kā mākou hoʻohana ʻana iā BigQuery e hoʻokō i kēia mau manaʻo. No ka hāʻawi ʻana i ka hana wānana no kā mākou mea hoʻohana, ua hoʻohana mākou i ka hana BigQuery, i loaʻa i nā mea kūʻai aku ma ke kumu uku paʻa e hiki ai i nā mea nona ka papahana ke mālama i nā slot liʻiliʻi no kā lākou mau nīnau. Kauila ʻO BigQuery kahi ʻāpana o ka mana helu e pono ai e hoʻokō i nā nīnau SQL.

Ua nānā mākou ma luna o 800 mau nīnau e hoʻoponopono ana ma kahi o 1 TB o ka ʻikepili i kēlā me kēia a ʻike mākou he 30 kekona ka awelika o ka manawa hoʻokō. Ua aʻo pū mākou e hilinaʻi nui ka hana i ka hoʻohana ʻana i kā mākou slot i nā papahana a me nā hana like ʻole. Pono mākou e wehewehe pono i kā mākou hana ʻana a me ka mālama ʻana i nā slot ad hoc e mālama i ka hana no nā hihia hoʻohana hana a me ka nānā ʻana ma ka pūnaewele. Ua hoʻoikaika nui kēia i kā mākou hoʻolālā no ka mālama ʻana i ka slot a me ka hierarchy papahana.

E kamaʻilio mākou e pili ana i ka hoʻokele ʻikepili, ka hana a me ke kumukūʻai o nā ʻōnaehana i nā lā e hiki mai ana ma ka ʻāpana ʻelua o ka unuhi, akā i kēia manawa ke kono nei mākou i nā mea āpau e free live webinar, i ka manawa e hiki ai iā ʻoe ke aʻo i nā kikoʻī e pili ana i ka papa, a me nā nīnau i kā mākou loea - Egor Mateshuk (Senior Data Engineer, MaximaTelecom).

E heluhelu hou:

Source: www.habr.com

Pākuʻi i ka manaʻo hoʻopuka