Kodi tikufuna nyanja ya data? Zoyenera kuchita ndi nkhokwe ya data?

Nkhaniyi ndi yomasulira nkhani yanga pa medium - Kuyamba ndi Data Lake, yomwe inakhala yotchuka kwambiri, mwinamwake chifukwa cha kuphweka kwake. Chifukwa chake, ndinaganiza zolembera mu Chirasha ndikuwonjezera pang'ono kuti ndifotokoze momveka bwino kwa munthu wamba yemwe sali katswiri wazinthu zomwe malo osungiramo data (DW) ali, ndi nyanja ya data (Data Lake), ndi momwe amachitira. gwirizana pamodzi .

Chifukwa chiyani ndidafuna kulemba za nyanja ya data? Ndakhala ndikugwira ntchito ndi deta ndi analytics kwa zaka zoposa 10, ndipo tsopano ndikugwira ntchito ndi deta yaikulu ku Amazon Alexa AI ku Cambridge, yomwe ili ku Boston, ngakhale ndimakhala ku Victoria ku Vancouver Island ndipo nthawi zambiri ndimayendera Boston, Seattle. , ndi ku Vancouver, ndipo nthawi zina ngakhale ku Moscow, ndimalankhula pamisonkhano. Ndimalembanso nthawi ndi nthawi, koma ndimalemba makamaka mu Chingerezi, ndipo ndalemba kale mabuku ena, Ndikufunikanso kugawana zomwe zikuchitika ku North America, ndipo nthawi zina ndimalemba telegraph.

Ndakhala ndikugwira ntchito ndi malo osungiramo deta, ndipo kuyambira 2015 ndinayamba kugwira ntchito limodzi ndi Amazon Web Services, ndipo nthawi zambiri ndimasintha ku cloud analytics (AWS, Azure, GCP). Ndawona kusintha kwa mayankho a analytics kuyambira 2007 ndipo ngakhale ndinagwira ntchito kwa wogulitsa malo osungiramo deta Teradata ndikuyigwiritsa ntchito ku Sberbank, pamene Big Data ndi Hadoop inawonekera. Aliyense anayamba kunena kuti nthawi yosungiramo zinthu inali itadutsa ndipo tsopano aliyense akugwiritsa ntchito Hadoop, ndiyeno anayamba kulankhula za Data Lake, kachiwiri, kuti tsopano mapeto a malo osungiramo deta afikadi. Koma mwamwayi (mwinamwake mwatsoka kwa ena omwe adapanga ndalama zambiri kukhazikitsa Hadoop), malo osungiramo deta sanachoke.

M'nkhaniyi tiwona zomwe nyanja ya data ili. Nkhaniyi idapangidwira anthu omwe ali ndi chidziwitso chochepa kapena alibe chidziwitso ndi malo osungira ma data.

Kodi tikufuna nyanja ya data? Zoyenera kuchita ndi nkhokwe ya data?

Pachithunzipa pali Nyanja ya Bled, iyi ndi imodzi mwa nyanja zomwe ndimakonda kwambiri, ngakhale kuti ndinalipo kamodzi kokha, ndinakumbukira moyo wanga wonse. Koma tikambirana za mtundu wina wa nyanja - nyanja ya data. Mwina ambiri a inu mudamvapo za mawuwa kangapo, koma tanthauzo linanso silingapweteke aliyense.

Choyamba, apa pali matanthauzo otchuka kwambiri a Data Lake:

"mafayilo osungira mitundu yonse ya data yaiwisi yomwe ilipo kuti iunike ndi aliyense m'bungwe" - Martin Fowler.

"Ngati mukuganiza kuti malo osungiramo data ndi botolo lamadzi - loyeretsedwa, lopakidwa ndi kupakidwa kuti ligwiritsidwe ntchito mosavuta, ndiye kuti nyanja ya data ndi nkhokwe yayikulu yamadzi mwachilengedwe. Ogwiritsa ntchito, nditha kudzitengera ndekha madzi, kulowa pansi kwambiri, kufufuza. ”- James Dixon.

Tsopano tikudziwa motsimikiza kuti nyanja ya data ndi ya analytics, imatilola kusunga deta yambiri mu mawonekedwe ake oyambirira ndipo tili ndi zofunikira komanso zosavuta kupeza deta.

Nthawi zambiri ndimakonda kufewetsa zinthu, ngati ndimatha kufotokozera mawu ovuta m'mawu osavuta, ndiye ndimamvetsetsa ndekha momwe zimagwirira ntchito komanso zomwe zimafunikira. Tsiku lina, ndinali kuyang'ana pazithunzi za zithunzi za iPhone, ndipo zinandiwonekera, iyi ndi nyanja yeniyeni ya data, ndinapanganso slide pamisonkhano:

Kodi tikufuna nyanja ya data? Zoyenera kuchita ndi nkhokwe ya data?

Zonse ndi zophweka. Timatenga chithunzi pafoni, chithunzicho chimasungidwa pafoni ndipo chikhoza kusungidwa ku iCloud (kusungira mafayilo amtambo). Foni imasonkhanitsanso metadata yazithunzi: zomwe zikuwonetsedwa, geo tag, nthawi. Zotsatira zake, titha kugwiritsa ntchito mawonekedwe osavuta a iPhone kuti tipeze chithunzi chathu ndipo timawonanso zizindikiro, mwachitsanzo, ndikasaka zithunzi ndi mawu akuti moto, ndimapeza zithunzi za 3 ndi chithunzi chamoto. Kwa ine, izi zili ngati chida cha Business Intelligence chomwe chimagwira ntchito mwachangu komanso momveka bwino.

Ndipo ndithudi, sitiyenera kuiwala za chitetezo (chilolezo ndi kutsimikiziridwa), apo ayi deta yathu imatha kutha mosavuta pagulu. Pali nkhani zambiri zokhudzana ndi makampani akuluakulu ndi oyambitsa omwe deta yawo inayamba kupezeka poyera chifukwa cha kunyalanyaza kwa opanga komanso kulephera kutsatira malamulo osavuta.

Ngakhale chithunzi chosavuta chotere chimatithandiza kulingalira za nyanja ya data, kusiyana kwake ndi malo osungiramo zinthu zakale ndi zinthu zake zazikulu:

  1. Loading Data (Ingestion) ndi gawo lofunikira panyanja ya data. Deta imatha kulowa m'malo osungiramo zinthu m'njira ziwiri - batch (kutsitsa pakapita nthawi) ndi kusuntha (kuthamanga kwa data).
  2. Kusungira mafayilo (Kusungirako) ndiye chigawo chachikulu cha Data Lake. Tinkafunika zosungirako kuti zikhale zosavuta kuziyika, zodalirika kwambiri, komanso zotsika mtengo. Mwachitsanzo, mu AWS ndi S3.
  3. Catalog ndi Search (Catalog ndi Search) - kuti tipewe Data Swamp (apa ndipamene timataya deta yonse mu mulu umodzi, ndiyeno n'zosatheka kugwira nawo ntchito), tifunika kupanga metadata wosanjikiza kuti tigawanitse deta. kotero kuti ogwiritsa ntchito azitha kupeza mosavuta deta, zomwe amafunikira kuti azisanthula. Kuphatikiza apo, mutha kugwiritsa ntchito njira zina zofufuzira monga ElasticSearch. Kusaka kumathandiza wogwiritsa ntchito kupeza deta yofunikira pogwiritsa ntchito mawonekedwe osavuta.
  4. Processing (Njira) - sitepe iyi ili ndi udindo wokonza ndikusintha deta. Titha kusintha deta, kusintha mawonekedwe ake, kuyeretsa, ndi zina zambiri.
  5. Chitetezo (Chitetezo) - Ndikofunikira kuthera nthawi pamapangidwe achitetezo a yankho. Mwachitsanzo, kusungitsa deta panthawi yosungira, kukonza ndi kutsitsa. Ndikofunika kugwiritsa ntchito njira zovomerezeka ndi zovomerezeka. Pomaliza, chida chowunikira chimafunika.

Kuchokera pakuwona kothandiza, titha kuwonetsa nyanja ya data ndi zikhumbo zitatu:

  1. Sungani ndi kusunga chilichonse - nyanja yosungiramo data imakhala ndi data yonse, zonse zomwe sizinasinthidwe pa nthawi iliyonse komanso zosinthidwa / zoyeretsedwa.
  2. Deep Scan - nyanja ya data imalola ogwiritsa ntchito kufufuza ndi kusanthula deta.
  3. Kufikira kosinthika - Nyanja ya data imapereka mwayi wosinthika wazinthu zosiyanasiyana komanso zochitika zosiyanasiyana.

Tsopano titha kulankhula za kusiyana pakati pa malo osungiramo deta ndi nyanja ya data. Nthawi zambiri anthu amafunsa kuti:

  • Nanga bwanji malo osungira zinthu?
  • Kodi tikusintha malo osungiramo data ndikuyika nyanja ya data kapena tikukulitsa?
  • Kodi ndizothekabe kuchita popanda nyanja ya data?

Mwachidule, palibe yankho lomveka bwino. Zonse zimadalira momwe zinthu zilili, luso la gulu ndi bajeti. Mwachitsanzo, kusamutsa malo osungiramo zinthu ku Oracle kupita ku AWS ndikupanga nyanja ya data ndi kampani ya Amazon - Woot - Nkhani yathu yanyanja ya data: Momwe Woot.com idapangira nyanja yopanda seva pa AWS.

Kumbali ina, wogulitsa Snowflake akunena kuti simuyeneranso kuganiza za nyanja ya data, popeza nsanja yawo ya data (mpaka 2020 inali malo osungiramo zinthu) imakulolani kuti muphatikize nyanja ya data ndi malo osungiramo deta. Sindinagwirepo ntchito zambiri ndi Snowflake, ndipo ndi chinthu chapadera chomwe chingachite izi. Mtengo wa nkhaniyi ndi nkhani ina.

Pomaliza, lingaliro langa ndiloti timafunikirabe malo osungiramo zinthu monga gwero lalikulu lazomwe timapereka malipoti, ndipo chilichonse chomwe sichingafanane timasunga munyanja ya data. Ntchito yonse ya analytics ndikupereka mwayi wosavuta kuti bizinesi ipange zisankho. Chilichonse chomwe anganene, ogwiritsa ntchito mabizinesi amagwira ntchito bwino ndi malo osungiramo data kuposa nyanja ya data, mwachitsanzo ku Amazon - pali Redshift (analytical data warehouse) ndipo pali Redshift Spectrum/Athena (mawonekedwe a SQL panyanja ya data mu S3 yochokera Mng'oma / Presto). Zomwezo zikugwiranso ntchito kuzinthu zina zamakono zosungiramo deta.

Tiyeni tiwone kamangidwe kake kosungiramo data:

Kodi tikufuna nyanja ya data? Zoyenera kuchita ndi nkhokwe ya data?

Iyi ndi njira tingachipeze powerenga. Tili ndi magwero, pogwiritsa ntchito ETL/ELT timakopera deta mu malo osungiramo deta ndikugwirizanitsa ndi Business Intelligence solution (ndimakonda kwambiri Tableau, nanga anu?).

Yankholi lili ndi zovuta izi:

  • Ntchito za ETL/ELT zimafuna nthawi ndi zothandizira.
  • Monga lamulo, kukumbukira kusunga deta mu malo osungiramo deta sikutsika mtengo (mwachitsanzo, Redshift, BigQuery, Teradata), popeza tifunika kugula gulu lonse.
  • Ogwiritsa ntchito mabizinesi ali ndi mwayi wopeza data yoyeretsedwa komanso yophatikizidwa nthawi zambiri ndipo alibe mwayi wopeza.

Inde, zonse zimadalira mlandu wanu. Ngati mulibe vuto ndi nkhokwe yanu ya data, ndiye kuti simufunika nyanja ya data konse. Koma pamene mavuto abuka ndi kusowa kwa malo, mphamvu, kapena mtengo umagwira ntchito yofunika kwambiri, ndiye kuti mukhoza kulingalira njira ya nyanja ya data. Ichi ndichifukwa chake nyanja ya data ndi yotchuka kwambiri. Nachi chitsanzo cha kamangidwe ka nyanja ya data:
Kodi tikufuna nyanja ya data? Zoyenera kuchita ndi nkhokwe ya data?
Pogwiritsa ntchito njira ya nyanja ya data, timayika zidziwitso munyanja yathu ya data (mgulu kapena kusanja), kenako timakonza zomwe zikufunika. Nyanja ya data imalola ogwiritsa ntchito mabizinesi kupanga zosintha zawo za data (ETL/ELT) kapena kusanthula deta muzothetsera za Business Intelligence (ngati dalaivala wofunikira alipo).

Cholinga cha yankho lililonse la analytics ndikutumikira ogwiritsa ntchito mabizinesi. Choncho, nthawi zonse tiyenera kugwira ntchito mogwirizana ndi zofuna za bizinesi. (Ku Amazon iyi ndi imodzi mwa mfundo - kugwira ntchito chammbuyo).

Kugwira ntchito ndi malo osungiramo data komanso nyanja ya data, titha kufananiza mayankho onse awiri:

Kodi tikufuna nyanja ya data? Zoyenera kuchita ndi nkhokwe ya data?

Chomaliza chachikulu chomwe chingaganizidwe ndikuti malo osungiramo data samapikisana ndi nyanja ya data, koma amakwaniritsa. Koma zili ndi inu kusankha chomwe chili choyenera pa mlandu wanu. Nthawi zonse zimakhala zosangalatsa kuyesa nokha ndikupeza mfundo zolondola.

Ndikufunanso kukuuzani chimodzi mwazochitika pamene ndinayamba kugwiritsa ntchito njira ya nyanja ya data. Chilichonse ndi chochepa kwambiri, ndinayesera kugwiritsa ntchito chida cha ELT (tinali ndi Matillion ETL) ndi Amazon Redshift, yankho langa linagwira ntchito, koma silinagwirizane ndi zofunikira.

Ndinafunika kutenga zipika zapaintaneti, kuzisintha ndikuziphatikiza kuti ndipereke zambiri pamilandu iwiri:

  1. Gulu lotsatsa linkafuna kusanthula zochitika za bot za SEO
  2. IT inkafuna kuyang'ana machitidwe a webusayiti

Zolemba zosavuta, zosavuta kwambiri. Nachi chitsanzo:

https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 
192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 200 200 0 57 
"GET https://www.example.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 
arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067
"Root=1-58337281-1d84f3d73c47ec4e58577259" "www.example.com" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012"
1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-"

Fayilo imodzi inkalemera 1-4 megabytes.

Koma panali vuto limodzi. Tidali ndi madambwe 7 padziko lonse lapansi, ndipo mafayilo 7000 sauzande adapangidwa tsiku limodzi. Izi sizowonjezera voliyumu, ma gigabytes 50 okha. Koma kukula kwa gulu lathu la Redshift kunalinso kakang'ono (node ​​4). Kutsegula fayilo imodzi mwanjira yachikhalidwe kudatenga pafupifupi mphindi imodzi. Ndiko kuti, vuto silinatheretu. Ndipo izi zinali choncho pamene ndinaganiza zogwiritsa ntchito njira ya nyanja ya data. Yankho lake linali motere:

Kodi tikufuna nyanja ya data? Zoyenera kuchita ndi nkhokwe ya data?

Ndizosavuta (ndikufuna kuzindikira kuti mwayi wogwira ntchito mumtambo ndi wosavuta). Ndinagwiritsa ntchito:

  • AWS Elastic Map Reduction (Hadoop) ya Compute Power
  • AWS S3 ngati yosungirako mafayilo ndikutha kubisa deta ndikuchepetsa mwayi
  • Spark monga InMemory computing mphamvu ndi PySpark pamalingaliro ndi kusintha kwa data
  • Parquet chifukwa cha Spark
  • AWS Glue Crawler ngati wotolera metadata za zatsopano ndi magawo
  • Redshift Spectrum ngati mawonekedwe a SQL kunyanja ya data kwa ogwiritsa ntchito a Redshift

Gulu laling'ono kwambiri la EMR+Spark linakonza mulu wonse wamafayilo m'mphindi 30. Pali zochitika zina za AWS, makamaka zambiri zokhudzana ndi Alexa, komwe kuli zambiri.

Posachedwapa ndaphunzira chimodzi mwazovuta za nyanja ya data ndi GDPR. Vuto ndi pamene kasitomala akufunsa kuti achotse ndipo deta ili mu fayilo imodzi, sitingathe kugwiritsa ntchito Chilankhulo cha Data Manipulation ndi ntchito ya DELETE monga mu database.

Ndikukhulupirira kuti nkhaniyi yafotokoza kusiyana pakati pa malo osungiramo deta ndi nyanja ya data. Ngati munali ndi chidwi, nditha kumasulira zambiri zanga kapena zolemba za akatswiri omwe ndimawerenga. Komanso auzeni za mayankho omwe ndimagwira nawo komanso kamangidwe kake.

Source: www.habr.com

Kuwonjezera ndemanga