Tinoda dziva re data here? Chii chekuita nedura re data?

Ichi chinyorwa ishanduro yechinyorwa changu chepakati - Kutanga neData Lake, iyo yakazova yakakurumbira, zvichida nekuda kwekureruka kwayo. Naizvozvo, ndakafunga kuzvinyora muchiRussia uye nekuwedzera zvishoma kuti zvijeke kumunhuwo zvake asiri nyanzvi yedata kuti chii chinonzi warehouse (DW), uye kuti dziva re data chii (Data Lake), uye kuti vanozviita sei. kuwirirana .

Sei ndaida kunyora nezve data data? Ndanga ndichishanda nedata uye analytics kweanopfuura makore gumi, uye ikozvino ndiri kunyatso kushanda nedata hombe kuAmazon Alexa AI muCambridge, iri muBoston, kunyangwe ndichigara kuVictoria paVancouver Island uye kazhinji ndinoshanyira Boston, Seattle. , uye MuVancouver, uye dzimwe nguva kunyange muMoscow, ndinotaura pamisangano. Iniwo ndinonyora nguva nenguva, asi ndinonyora kunyanya muChirungu, uye ndakatonyora mamwe mabhuku, Iniwo ndine chido chekugovera maitiro e analytics kubva kuNorth America, uye dzimwe nguva ndinonyora mukati teregiramu.

Ini ndagara ndichishanda nenzvimbo dzekuchengetera data, uye kubvira 2015 ndakatanga kushanda padhuze neAmazon Web Services, uye kazhinji ndichichinjira kune cloud analytics (AWS, Azure, GCP). Ndakacherechedza kushanduka kweanalytics solutions kubvira 2007 uye ndakatoshandira mutengesi wekuchengetera data Teradata uye akaishandisa kuSberbank, uye ndipo pakaonekwa Big Data neHadoop. Munhu wose akatanga kutaura kuti nguva yekuchengetedza yakanga yapfuura uye zvino zvinhu zvose zvakanga zviri paHadoop, uye vakabva vatanga kutaura pamusoro peData Lake, zvakare, kuti ikozvino kuguma kwekuchengetedza data kwakasvika zvechokwadi. Asi nerufaro (zvichida zvinosuruvarisa kune vamwe vakaita mari yakawanda vachigadzira Hadoop), iyo yekuchengetedza data haina kuenda.

Muchikamu chino tichatarisa kuti chii chinonzi dziva re data. Ichi chinyorwa chakagadzirirwa vanhu vane ruzivo rushoma kana vasina ruzivo nezvematura data.

Tinoda dziva re data here? Chii chekuita nedura re data?

Pamufananidzo pane Lake Bled, iri nderimwe remadziva andinoda, kunyange zvazvo ndaivapo kamwe chete, ndakarirangarira kweupenyu hwangu hwose. Asi isu tichataura nezve imwe mhando yedhamu - data data. Zvichida vazhinji venyu vakatonzwa nezve izwi iri kanopfuura kamwe, asi imwe tsananguro haizokuvadzi chero munhu.

Chekutanga pane zvese, heano anonyanya kufarirwa tsananguro yeData Lake:

"faira yekuchengetera marudzi ese e data mbishi inowanikwa kuti iongororwe nemunhu wese musangano" - Martin Fowler.

"Kana iwe uchifunga kuti data mart ibhodhoro remvura - rakacheneswa, rakaiswa mukati uye rakaputirwa kuti rishandise zviri nyore, saka dhamu redhata idhamu hombe remvura muchimiro chayo. Vashandisi, ndinogona kuzviunganidzira mvura, kunyura, kuongorora ”- James Dixon.

Iye zvino isu tinoziva zvechokwadi kuti dziva re data rinenge rine analytics, rinotibvumira kuchengetedza yakawanda data muchimiro chayo chepakutanga uye isu tine inodiwa uye nyore kuwana iyo data.

Ndinowanzoda kurerutsa zvinhu, kana ndikakwanisa kutsanangura izwi rakaoma mumashoko akareruka, ipapo ndinonzwisisa pachangu kuti rinoshanda sei uye kuti chii chinodiwa. Rimwe zuva, ndanga ndichitenderera mu iPhone foto gallery, uye zvakabva zvandijekera, iyi idhamu chaiyo yedata, ndakatogadzira siraidhi yemisangano:

Tinoda dziva re data here? Chii chekuita nedura re data?

Zvese zviri nyore. Isu tinotora foto pafoni, iyo foto inochengetwa pafoni uye inogona kuchengetwa kuICloud (gore faira rekuchengetedza). Foni zvakare inounganidza metadata yemifananidzo: izvo zvinoratidzwa, geo tag, nguva. Nekuda kweizvozvo, isu tinokwanisa kushandisa mushandisi-inoshamwaridzana interface yeiyo iPhone kuti tiwane mufananidzo wedu uye isu tinotoona zviratidzo, semuenzaniso, pandinotsvaga mafoto neshoko remoto, ndinowana 3 mapikicha ane mufananidzo wemoto. Kwandiri, izvi zvakangofanana neBusiness Intelligence chishandiso chinoshanda nekukurumidza uye nemazvo.

Uye zvechokwadi, isu hatifanire kukanganwa nezve chengetedzo (mvumo uye kuvimbiswa), zvikasadaro data redu rinogona kupedzisira rave munharaunda yeruzhinji. Kune nhau dzakawanda pamusoro pemakambani makuru uye ekutanga ayo data yakazovepo pachena nekuda kwekuregeredza kwevagadziri uye kutadza kutevedzera mitemo iri nyore.

Kunyangwe mufananidzo wakapfava wakadaro unotibatsira kufungidzira kuti dziva re data chii, misiyano yaro kubva kune yechinyakare data dura uye zvinhu zvayo zvikuru:

  1. Loading Data (Ingestion) chikamu chakakosha chedhamu re data. Dhata inogona kupinda mudura re data nenzira mbiri - batch (kurodha panguva dzenguva) uye kutenderera (kuyerera kwedata).
  2. Kuchengetedza mafaira (Kuchengeta) ndicho chikamu chikuru cheData Lake. Taida chengetedzo kuti ive nyore scalable, yakavimbika zvakanyanya, uye yakaderera mutengo. Semuenzaniso, muAWS iS3.
  3. Catalog uye Tsvaga (Catalog uye Tsvaga) - kuitira kuti isu tidzivise iyo Data Swamp (apa ndipo patinorasa data rese mumurwi mumwe, uyezve hazvigoneke kushanda nayo), isu tinofanirwa kugadzira metadata layer yekuisa iyo data. kuitira kuti vashandisi vawane nyore data, iyo yavanoda kuongororwa. Uyezve, iwe unogona kushandisa mamwe ekutsvaga mhinduro senge ElasticSearch. Kutsvaga kunobatsira mushandisi kuwana iyo data inodiwa kuburikidza nemushandisi-ushamwari interface.
  4. Processing (Process) - danho iri rine basa rekugadzirisa nekushandura data. Tinogona kushandura data, kushandura chimiro chayo, kuchenesa, uye zvimwe zvakawanda.
  5. Chengetedzo (Chengetedzo) - Zvakakosha kushandisa nguva pane kuchengetedza dhizaini yemhinduro. Semuenzaniso, encryption data panguva yekuchengetedza, kugadzirisa uye kurodha. Zvakakosha kushandisa nzira dzechokwadi uye dzemvumo. Chekupedzisira, chishandiso chekuongorora chinodiwa.

Kubva pamaonero anoshanda, tinogona kuratidza dziva re data nehunhu hutatu:

  1. Unganidza uye chengeta chero chinhu -Dziva re data rine data rese, zvese zviri zviviri zvisina kugadziridzwa data chero nguva uye yakagadziriswa / yakacheneswa data.
  2. Deep Scan - dziva re data rinobvumira vashandisi kuongorora uye kuongorora data.
  3. Flexible access -Dziva redhata rinopa mukana unoshanduka kune akasiyana data uye akasiyana mamiriro.

Iye zvino isu tinogona kutaura nezve musiyano pakati pekuchengetera data uye dziva re data. Kazhinji vanhu vanobvunza kuti:

  • Zvakadini nedura re data?
  • Tiri kutsiva nzvimbo yekuchengetera data nedhamu redhata kana kuti tiri kuwedzera?
  • Zvichiri kuita here kuita pasina dziva re data?

Muchidimbu, hapana mhinduro yakajeka. Zvose zvinoenderana nemamiriro ezvinhu chaiwo, unyanzvi hwechikwata uye bhajeti. Semuyenzaniso, kufambisa nzvimbo yekuchengetera dhata kuenda kuOracle kuenda kuAWS uye kugadzira dziva redhata neAmazon subsidiary - Woot - Yedu data dziva nyaya: Woot.com yakavaka sei serverless data dziva paAWS.

Kune rimwe divi, mutengesi Snowflake anoti iwe hauchadi kufunga nezve data dziva, sezvo yavo data chikuva (kusvika 2020 yaive dhata rekuchengetedza) inobvumidza iwe kusanganisa ese ari maviri dziva data uye dhata rekuchengetedza. Ini handina kushanda zvakanyanya neSnowflake, uye chakasarudzika chigadzirwa chinogona kuita izvi. Mutengo wenyaya imwe nyaya.

Mukupedzisa, maonero angu ndeekuti isu tichiri kuda nzvimbo yekuchengetera data senzvimbo huru yedata rekutaura kwedu, uye chero chisingaenderane tinochengeta mudhamu yedata. Basa rose reanalytics nderekupa mukana nyore kune bhizinesi kuita sarudzo. Chero zvingataurwa nemunhu, vashandisi vebhizinesi vanoshanda zvakanyanya nearehouse yedata pane dziva re data, semuenzaniso muAmazon - kune Redshift (analytical data warehouse) uye kune Redshift Spectrum/Athena (SQL interface yedhamu data muS3 yakavakirwa pa Mukoko/Presto). Izvi zvinoshandawo kune mamwe mazuva ano analytical data warehouses.

Ngatitarisei kune yakajairwa data warehouse architecture:

Tinoda dziva re data here? Chii chekuita nedura re data?

Iyi ndiyo classic solution. Isu tine masource system, tichishandisa ETL/ELT isu tinokopa data mune yekuongorora data warehouse toibatanidza neBusiness Intelligence solution (yandinoda iTableau, ko yako?).

Mhinduro iyi ine zvinotevera zvakaipa:

  • ETL/ELT mashandiro anoda nguva uye zviwanikwa.
  • Sezvo mutemo, ndangariro yekuchengetedza data mune yekuongorora data warehouse haina kudhura (semuenzaniso, Redshift, BigQuery, Teradata), sezvo isu tichida kutenga sumbu rose.
  • Vashandisi vebhizinesi vanokwanisa kuwana data rakacheneswa uye rinowanzo kuunganidzwa uye havakwanise kuwana data raw.

Ehe, zvese zvinoenderana nenyaya yako. Kana iwe usina matambudziko nedura rako re data, saka haudi dziva re data zvachose. Asi kana matambudziko amuka nekushaikwa kwenzvimbo, simba, kana mutengo uchiita basa rakakosha, saka unogona kufunga nezvesarudzo yedhamu data. Ichi ndicho chikonzero dziva re data richizivikanwa zvikuru. Heino muenzaniso weiyo data Lake architecture:
Tinoda dziva re data here? Chii chekuita nedura re data?
Tichishandisa nzira yedhamu yedata, tinoisa data mbishi mudhamu redu re data (batch kana kutenderera), tobva tagadzirisa data sezvinodiwa. Iyo data dziva inobvumira vashandisi vebhizinesi kugadzira yavo yega data shanduko (ETL/ELT) kana kuongorora data muBusiness Intelligence mhinduro (kana mutyairi anodiwa aripo).

Chinangwa chechero analytics mhinduro ndeyekushandira vashandisi vebhizinesi. Nokudaro, tinofanira kushanda nguva dzose maererano nezvinodiwa zvebhizimisi. (KuAmazon iyi ndeimwe yemisimboti - kushanda kumashure).

Kushanda nezvose zviri zviviri nzvimbo yekuchengetera data uye dziva redata, tinogona kuenzanisa mhinduro dzese:

Tinoda dziva re data here? Chii chekuita nedura re data?

Mhedziso huru inogona kukweverwa ndeyokuti dura re data harikwikwidzi nedhamu re data, asi kuti rinozadzisa iyo. Asi zviri kwauri kuti usarudze chakakodzera nyaya yako. Nguva dzose zvinonakidza kuzviedza iwe pachako uye kutora mhedziso dzakakodzera.

Ini ndodawo kukuudza imwe yemakesi pandakatanga kushandisa iyo data dziva nzira. Zvese zvinhu zvidiki, ndakaedza kushandisa ELT chishandiso (taive neMatillion ETL) neAmazon Redshift, mhinduro yangu yakashanda, asi haina kuenderana nezvinodiwa.

Ini ndaifanira kutora matanda ewebhu, ndiashandure uye nekuaunganidza kuti ndipe data yemakesi maviri:

  1. Chikwata chekushambadzira chaida kuongorora bot chiitiko cheSEO
  2. IT yaida kutarisa mawebhusaiti ekuita metrics

Very nyore, nyore matanda. Heino muenzaniso:

https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 
192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 200 200 0 57 
"GET https://www.example.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 
arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067
"Root=1-58337281-1d84f3d73c47ec4e58577259" "www.example.com" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012"
1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-"

Imwe faira yairema 1-4 megabytes.

Asi paiva nedambudziko rimwe chete. Takanga tine 7 domains pasi rose, uye 7000 zviuru mafaira akagadzirwa muzuva rimwe chete. Iyi haisi yakawanda yakawanda, gigabytes makumi mashanu chete. Asi saizi yeRedshift cluster yedu yaivewo diki (50 nodes). Kuisa faira rimwe nenzira yechinyakare kwakatora inenge miniti. Kureva kuti dambudziko harina kugadziriswa zvakanangana. Uye izvi ndizvo zvakaitika pandakafunga kushandisa nzira yedhamu yedata. Mhinduro yacho yakaita seizvi:

Tinoda dziva re data here? Chii chekuita nedura re data?

Zviri nyore (ndinoda kuona kuti mukana wekushanda mugore uri nyore). Ndakashandisa:

  • AWS Elastic Mepu Kuderedza (Hadoop) yeCompute Power
  • AWS S3 senge faira rekuchengetedza nekugona encrypt data uye kudzikisira kuwana
  • Spark seInMemory computing simba uye PySpark yepfungwa uye shanduko yedata
  • Parquet semhedzisiro yeSpark
  • AWS Glue Crawler semuunganidzi wemetadata nezve data nyowani uye zvikamu
  • Redshift Spectrum seSQL interface kune data dziva kune varipo Redshift vashandisi

Iyo diki EMR+Spark cluster yakagadzirisa iyo yese stack yemafaira mumaminetsi makumi matatu. Kune dzimwe nyaya dzeAWS, kunyanya dzakawanda dzine hukama ne Alexa, uko kune data rakawanda.

Nguva pfupi yadarika ndakadzidza imwe yezvakaipira dhamu data iGDPR. Dambudziko nderekuti mutengi akakumbira kuidzima uye data iri mune rimwe remafaira, hatigone kushandisa Data Manipulation Mutauro uye DELETE mashandiro senge mudhatabhesi.

Ndinovimba kuti chinyorwa ichi chajekesa musiyano pakati pekuchengetera data uye dziva re data. Kana wanga uchifarira, ndinogona kududzira zvimwe zvinyorwa zvangu kana zvinyorwa zvenyanzvi dzandakaverenga. Uye zvakare taura nezve mhinduro dzandinoshanda nadzo uye mavakirwo avo.

Source: www.habr.com

Voeg