Muna bukatar tafkin bayanai? Me za a yi da rumbun adana bayanai?

Wannan labarin fassarar labarina ne akan matsakaici - Farawa da Tafkin Data, wanda ya zama sananne sosai, mai yiwuwa saboda sauƙi. Don haka, na yanke shawarar rubuta shi da harshen Rashanci, in ƙara kaɗan don in bayyana wa talaka wanda ba ƙwararriyar bayanai ba mene ne ma'ajin adana bayanai (DW), da kuma menene tafkin data (Data Lake), da kuma yadda suke. a tare .

Me yasa na so in rubuta game da tafkin bayanai? Na yi aiki tare da bayanai da nazari sama da shekaru 10, kuma yanzu ina shakka ina aiki tare da manyan bayanai a Amazon Alexa AI a Cambridge, wanda ke cikin Boston, kodayake ina zaune a Victoria a tsibirin Vancouver kuma sau da yawa na ziyarci Boston, Seattle. , da kuma A Vancouver, da kuma wani lokacin ma a Moscow, Ina magana a taro. Haka nan nakan yi rubutu lokaci zuwa lokaci, amma da turanci nake rubutawa, kuma na riga na rubuta wasu littattafai, Ina kuma da buƙatar raba abubuwan nazari daga Arewacin Amirka, kuma a wasu lokuta ina rubutawa sakon waya.

A koyaushe ina aiki tare da ɗakunan ajiya na bayanai, kuma tun daga 2015 na fara aiki tare da Sabis na Yanar Gizo na Amazon, kuma gabaɗaya na canza zuwa nazarin girgije (AWS, Azure, GCP). Na lura da juyin halitta na hanyoyin bincike tun 2007 kuma har ma na yi aiki ga mai siyar da sito Teradata kuma na aiwatar da shi a Sberbank, wanda shine lokacin da Babban Bayanai tare da Hadoop ya bayyana. Kowa ya fara cewa zamanin ajiya ya wuce kuma yanzu kowa yana amfani da Hadoop, sannan suka fara magana akan tafkin Data, kuma, cewa yanzu ƙarshen ajiyar bayanan ya zo. Amma an yi sa'a (watakila abin takaici ga wasu da suka sami kuɗi da yawa suna kafa Hadoop), rumbun adana bayanan bai tafi ba.

A cikin wannan labarin za mu dubi menene tafkin bayanai. An yi nufin wannan labarin ne ga mutanen da ba su da ɗan gogewa ko kuma ba su da masaniya game da wuraren ajiyar bayanai.

Muna bukatar tafkin bayanai? Me za a yi da rumbun adana bayanai?

A cikin hoton akwai Lake Bled, wannan shine ɗayan tafkin da na fi so, kodayake na kasance a wurin sau ɗaya kawai, na tuna da shi har tsawon rayuwata. Amma za mu yi magana game da wani nau'in tafkin - tafkin bayanai. Wataƙila da yawa daga cikinku kun riga kun ji labarin wannan kalmar fiye da sau ɗaya, amma ƙarin ma'anar ba zai cutar da kowa ba.

Da farko, ga shahararrun ma'anar tafkin Data:

"Ajiye fayil na kowane nau'in danyen bayanan da ke akwai don bincike ta kowa a cikin kungiyar" - Martin Fowler.

"Idan kuna tunanin cewa mart ɗin data kwalabe ne na ruwa - tsarkakewa, tattarawa da kuma shirya don dacewa da amfani, to, tafkin bayanai babban tafki ne na ruwa a yanayinsa. Masu amfani, Zan iya tattara ruwa don kaina, nutse zurfi, bincike. " - James Dixon.

Yanzu mun san tabbas cewa tafkin bayanai yana game da nazari ne, yana ba mu damar adana bayanai masu yawa a cikin ainihin asali kuma muna da damar da ake bukata da dacewa ga bayanan.

Sau da yawa ina so in sauƙaƙa abubuwa, idan zan iya bayyana madaidaicin kalma a cikin kalmomi masu sauƙi, to, na fahimci kaina yadda yake aiki da abin da ake bukata. Wata rana, ina yawo a cikin hoton hoton iPhone, sai ya fado mini, wannan tafkin bayanai ne na gaske, har ma na yi zane-zane don taro:

Muna bukatar tafkin bayanai? Me za a yi da rumbun adana bayanai?

Komai mai sauqi ne. Muna ɗaukar hoto akan wayar, ana ajiye hoton akan wayar kuma ana iya adana shi zuwa iCloud (ajiya na fayil ɗin girgije). Wayar kuma tana tattara metadata na hoto: abin da aka nuna, alamar geo, lokaci. A sakamakon haka, za mu iya amfani da mai amfani-friendly dubawa na iPhone don nemo hoton mu har ma mu ga alamomi, misali, lokacin da na nemo hotuna da kalmar wuta, na sami 3 hotuna tare da hoton wuta. A gare ni, wannan kamar kayan aikin Intelligence na Kasuwanci ne wanda ke aiki da sauri da daidai.

Kuma ba shakka, kada mu manta game da tsaro (izni da tabbatarwa), in ba haka ba bayananmu na iya ƙarewa cikin sauƙi a cikin jama'a. Akwai labarai da yawa game da manyan kamfanoni da masu farawa waɗanda bayanansu suka kasance a bainar jama'a saboda sakaci na masu haɓakawa da rashin bin ƙa'idodi masu sauƙi.

Ko da irin wannan hoto mai sauƙi yana taimaka mana mu yi tunanin menene tafkin bayanai, bambancinsa daga ma'ajin bayanan gargajiya da manyan abubuwansa:

  1. Loading Data (Ingestion) wani muhimmin sashi ne na tafkin bayanai. Bayanai na iya shigar da ma'ajiyar bayanai ta hanyoyi biyu - batch (loading at intervals) da streaming (data flow).
  2. Adana fayil (Storage) shine babban bangaren tafkin Data. Muna buƙatar ma'ajiyar ta zama mai sauƙi mai sauƙi, abin dogaro sosai, da ƙarancin farashi. Misali, a cikin AWS shine S3.
  3. Katalogi da Bincike (Katalogi da Bincike) - Domin mu guje wa Faɗin Data (wannan shine lokacin da muka zubar da duk bayanan a cikin tudu ɗaya, sannan ba zai yiwu a yi aiki da su ba), muna buƙatar ƙirƙirar Layer metadata don rarraba bayanan. ta yadda masu amfani za su iya samun bayanan cikin sauƙi, waɗanda suke buƙata don bincike. Bugu da ƙari, zaku iya amfani da ƙarin hanyoyin bincike kamar ElasticSearch. Bincika yana taimaka wa mai amfani ya sami bayanan da ake buƙata ta hanyar dubawar mai amfani.
  4. Tsarin aiki (Tsarin) - wannan mataki yana da alhakin sarrafawa da canza bayanai. Za mu iya canza bayanai, canza tsarin sa, tsaftace shi, da ƙari mai yawa.
  5. Tsaro (Tsaro) - Yana da mahimmanci don ciyar da lokaci akan tsarin tsaro na mafita. Misali, ɓoye bayanan yayin ajiya, sarrafawa da lodawa. Yana da mahimmanci a yi amfani da hanyoyin tantancewa da izini. A ƙarshe, ana buƙatar kayan aikin tantancewa.

Daga mahangar aiki, za mu iya siffanta tafkin bayanai da halaye guda uku:

  1. Tattara da adana komai - tafkin bayanai ya ƙunshi duk bayanan, duka dayan bayanan da ba a sarrafa su ba na kowane lokaci da kuma sarrafa / tsabtace bayanai.
  2. Zurfafa Scan - tafkin bayanai yana ba masu amfani damar bincika da kuma nazarin bayanai.
  3. Sauƙi mai sauƙi - Tafkin bayanan yana ba da damar sauƙi don bayanai daban-daban da yanayi daban-daban.

Yanzu za mu iya magana game da bambanci tsakanin rumbun adana bayanai da tafkin bayanai. Yawancin lokaci mutane suna tambaya:

  • Me game da rumbun adana bayanai?
  • Shin muna maye gurbin rumbun adana bayanai da tafkin bayanai ko muna fadada shi?
  • Shin har yanzu yana yiwuwa a yi ba tare da tafkin bayanai ba?

A taqaice dai babu wata fayyace amsa. Duk ya dogara ne akan takamaiman yanayi, ƙwarewar ƙungiyar da kasafin kuɗi. Misali, ƙaura wurin ajiyar bayanai zuwa Oracle zuwa AWS da ƙirƙirar tafkin bayanai ta wani reshen Amazon - Woot - Labarin tafkin bayanan mu: Yadda Woot.com ya gina tafkin bayanan uwar garken akan AWS.

A gefe guda kuma, mai siyar da Snowflake ya ce ba kwa buƙatar yin tunani game da tafkin bayanai, tun da dandamalin bayanan su (har zuwa 2020 wurin ajiyar bayanai ne) yana ba ku damar haɗa tafkin bayanai da ma'ajin bayanai. Ban yi aiki da yawa tare da Snowflake ba, kuma samfuri ne na musamman wanda zai iya yin wannan. Farashin batun wani lamari ne.

A ƙarshe, ra'ayina na kaina shine har yanzu muna buƙatar ma'ajiyar bayanai a matsayin babban tushen bayanai don rahotanninmu, kuma duk abin da bai dace ba muna adanawa a cikin tafkin bayanai. Dukkanin aikin nazari shine samar da sauƙi don kasuwanci don yanke shawara. Duk abin da mutum zai iya faɗi, masu amfani da kasuwanci suna aiki da inganci tare da ma'ajin bayanai fiye da tafkin bayanai, misali a cikin Amazon - akwai Redshift (ma'ajiyar bayanai na nazari) kuma akwai Redshift Spectrum / Athena (SQL interface don tafkin bayanai a cikin S3 bisa ga). Hive/Presto). Hakanan ya shafi sauran ɗakunan ajiya na bayanan nazari na zamani.

Bari mu kalli tsarin gine-ginen ɗakunan ajiya na yau da kullun:

Muna bukatar tafkin bayanai? Me za a yi da rumbun adana bayanai?

Wannan maganin gargajiya ne. Muna da tsarin tushen, ta amfani da ETL/ELT muna kwafin bayanai zuwa cikin ma'ajin bayanan nazari kuma mu haɗa shi zuwa mafita na Intelligence na Kasuwanci (abin da na fi so shine Tableau, menene game da ku?).

Wannan maganin yana da illa masu zuwa:

  • Ayyukan ETL/ELT suna buƙatar lokaci da albarkatu.
  • A matsayinka na mai mulki, ƙwaƙwalwar ajiya don adana bayanai a cikin ɗakunan ajiya na bayanai ba arha bane (misali, Redshift, BigQuery, Teradata), tunda muna buƙatar siyan tari duka.
  • Masu amfani da kasuwanci suna samun damar tsaftacewa kuma galibi ana tattara bayanai kuma basu da damar yin amfani da danyen bayanai.

Tabbas, duk ya dogara da batun ku. Idan ba ku da matsala tare da rumbun bayanan ku, to ba kwa buƙatar tafkin data kwata-kwata. Amma lokacin da matsaloli suka taso tare da rashin sarari, iko, ko farashi suna taka muhimmiyar rawa, to zaku iya la'akari da zaɓin tafkin bayanai. Wannan shine dalilin da ya sa tafkin bayanai ya shahara sosai. Ga misalin gine-ginen tafkin bayanai:
Muna bukatar tafkin bayanai? Me za a yi da rumbun adana bayanai?
Yin amfani da hanyar tafkin data, muna loda danyen bayanai a tafkin bayanan mu (batch ko yawo), sannan mu sarrafa bayanan yadda ake bukata. Tafkin bayanan yana ba masu amfani da kasuwanci damar ƙirƙirar canjin bayanan nasu (ETL/ELT) ko bincika bayanai a cikin hanyoyin Intelligence na Kasuwanci (idan akwai direban dole).

Manufar kowane bayani na nazari shine a yiwa masu amfani da kasuwanci hidima. Don haka, dole ne mu yi aiki koyaushe bisa ga buƙatun kasuwanci. (A Amazon wannan shine ɗayan ka'idodin - aiki a baya).

Yin aiki tare da duka ma'ajin bayanai da tafkin bayanai, zamu iya kwatanta mafita guda biyu:

Muna bukatar tafkin bayanai? Me za a yi da rumbun adana bayanai?

Babban abin da za a iya cimma shi ne cewa ma’ajiyar bayanai ba ta yin gogayya da tafkin bayanai, sai dai ya cika ta. Amma ya rage naku don yanke shawarar abin da ya dace da shari'ar ku. Yana da ban sha'awa koyaushe don gwada shi da kanku kuma ku zana kyakkyawan sakamako.

Ina kuma so in gaya muku ɗaya daga cikin lamuran lokacin da na fara amfani da hanyar tafkin bayanai. Komai abu ne maras muhimmanci, Na yi ƙoƙarin amfani da kayan aikin ELT (muna da Matillion ETL) da Amazon Redshift, maganina yayi aiki, amma bai dace da buƙatun ba.

Ina buƙatar ɗaukar rajistan ayyukan yanar gizo, canza su kuma in haɗa su don samar da bayanai don lokuta 2:

  1. Ƙungiyar tallace-tallace ta so ta bincika ayyukan bot don SEO
  2. IT yana son duba ma'aunin aikin gidan yanar gizon

Mai sauqi qwarai, mai sauqi qwarai. Ga misali:

https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 
192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 200 200 0 57 
"GET https://www.example.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 
arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067
"Root=1-58337281-1d84f3d73c47ec4e58577259" "www.example.com" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012"
1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-"

Fayil ɗaya ya auna megabyte 1-4.

Amma akwai wahala daya. Muna da yankuna 7 a duniya, kuma an ƙirƙiri fayiloli dubu 7000 a rana ɗaya. Wannan bai fi girma ba, kawai 50 gigabytes. Amma girman gungu na Redshift shima karami ne ( nodes 4). Load da fayil ɗaya a cikin hanyar gargajiya ya ɗauki kusan minti ɗaya. Wato ba a magance matsalar gaba-da-gaba ba. Kuma wannan shine lamarin lokacin da na yanke shawarar yin amfani da hanyar tafkin bayanai. Maganinta yayi kama da haka:

Muna bukatar tafkin bayanai? Me za a yi da rumbun adana bayanai?

Abu ne mai sauqi qwarai (Ina so in lura cewa fa'idar aiki a cikin girgije shine sauƙi). Na yi amfani da:

  • Rage Taswirar Rijiyar AWS (Hadoop) don Ƙarfin Lissafi
  • AWS S3 azaman ajiyar fayil tare da ikon ɓoye bayanai da iyakance damar shiga
  • Spark azaman ikon sarrafa InMemory da PySpark don dabaru da canjin bayanai
  • Parquet a sakamakon Spark
  • AWS Glue Crawler azaman mai karɓar metadata game da sabbin bayanai da ɓangarori
  • Redshift Spectrum azaman hanyar SQL zuwa tafkin bayanai don masu amfani da Redshift data kasance

Karamin gungu na EMR+Spark ya sarrafa dukkan tarin fayiloli a cikin mintuna 30. Akwai wasu lokuta don AWS, musamman ma da yawa masu alaƙa da Alexa, inda akwai bayanai da yawa.

Kwanan nan na koyi ɗaya daga cikin rashin amfanin tafkin bayanai shine GDPR. Matsalar ita ce lokacin da abokin ciniki ya nemi ya goge shi kuma bayanan suna cikin ɗaya daga cikin fayilolin, ba za mu iya amfani da Harshen Manipulation Data da DELETE aiki kamar a cikin rumbun adana bayanai ba.

Ina fatan wannan labarin ya fayyace bambanci tsakanin ma'ajiyar bayanai da tafkin bayanai. Idan kuna sha'awar, zan iya fassara ƙarin labarai na ko labaran ƙwararrun da na karanta. Sannan kuma na ba da labarin hanyoyin da nake aiki da su da kuma gine-ginen su.

source: www.habr.com

Add a comment