Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu

Mhoroi mose, zita rangu ndiAlexander, uye ini ndiri Data Quality engineer anotarisa data yemhando yayo. Ichi chinyorwa chichataura nezvekuti ndakauya sei kune izvi uye nei muna 2020 iyi nzvimbo yekuyedza yaive padanho remafungu.

Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu

Global trend

Nyika yanhasi iri kusangana neimwe shanduko yetekinoroji, imwe dhizaini iri kushandiswa kwedata rakaunganidzwa nemhando dzese dzemakambani kusimudzira yavo vhiri rekutengesa, purofiti uye PR. Zvinoita sekuti kuvapo kweakanaka (mhando) data, pamwe nehuchenjeri hune hunyanzvi hunogona kuita mari kubva mairi (kunyatso kuita, kuona, kuvaka michina yekudzidza mienzaniso, nezvimwewo), yava kiyi yekubudirira kune vakawanda nhasi. Kana makore 15-20 apfuura makambani makuru ainyanya kuita basa rakasimba nekuunganidza data uye kuita mari, nhasi iyi ndiyo yakawanda yevanenge vanhu vese vane njere.

Panyaya iyi, makore akati wandei apfuura, ese maportal akazvipira kutsvaga basa kutenderera pasirese akatanga kuzadzwa nemabasa eData Scientists, nekuti munhu wese aive nechokwadi chekuti, ahaya nyanzvi yakadaro, zvaizogoneka kuvaka supermodel yekudzidza muchina. , kufanotaura nezveramangwana uye kuita "quantum leap" yekambani. Nokufamba kwenguva, vanhu vakaziva kuti nzira iyi inenge isingamboshandi chero kupi zvako, sezvo isiri iyo data yose inowira mumaoko enyanzvi dzakadaro dzakakodzera kudzidzira mienzaniso.

Uye zvikumbiro kubva kuData Scientists zvakatanga: "Ngatitengei data rakawanda kubva kune izvi neizvo...", "Hatina data rakakwana...", "Tinoda imwe data, zviri nani yemhando yepamusoro..." . Zvichienderana nezvikumbiro izvi, kudyidzana kwakawanda kwakatanga kuvakwa pakati pemakambani ane imwe kana imwe seti yedata. Sezvingatarisirwa, izvi zvaida hunyanzvi sangano rekuita uku - kubatanidza kune data sosi, kurodha pasi, kutarisa kuti yakange yakatakurwa zvizere, nezvimwewo. Nhamba yemaitiro akadaro yakatanga kukura, uye nhasi tine kudiwa kukuru kwerumwe rudzi. nyanzvi - Data Quality mainjiniya - avo vaizotarisa kuyerera kwedata muhurongwa (mapaipi edata), mhando yedata pane yekupinza uye kubuda, uye kutora mhedziso nezve kukwana kwavo, kuvimbika uye humwe hunhu.

Maitiro eData Quality mainjiniya akauya kwatiri kubva kuU.SA, uko, mukati menguva yekupisa ye capitalism, hapana akagadzirira kurasikirwa nehondo yedata. Pazasi ini ndakapa zviratidziro kubva kune maviri anonyanya kufarirwa basa rekutsvaga nzvimbo muUS: www.monster.com ΠΈ www.dice.com - iyo inoratidza data kubva munaKurume 17, 2020 pahuwandu hwenzvimbo dzakatumirwa dzakagamuchirwa uchishandisa mazwi akakosha: Hunhu hweData uye Dhata Sainzi.

www.monster.com

Data Scientists - 21416 vacancies
Data Quality - 41104 vacancies

Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu
Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu

www.dice.com

Data Scientists - 404 vacancies
Data Quality - 2020 vacancies

Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu
Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu

Zviripachena, mabasa aya haasi kukwikwidzana neimwe nzira. Nemascreenshots, ini ndaingoda kuenzanisira mamiriro azvino pamusika wevashandi maererano nezvikumbiro zveData Quality mainjiniya, avo akawanda anodiwa izvozvi kupfuura Data Scientists.

Muna Chikumi 2019, EPAM, ichipindura kune zvinodiwa nemusika wemazuva ano weIT, yakapatsanura Hunhu hweData kuita imwe tsika. Data Quality mainjiniya, mukati mebasa ravo rezuva nezuva, vanobata data, tarisa maitiro ayo mumamiriro matsva uye masisitimu, tarisa kukosha kweiyo data, kukwana kwayo uye kukosha kwayo. Nezvese izvi, mupfungwa inoshanda, Data Quality mainjiniya vanonyatso kupa nguva shoma kune yekirasi inoshanda yekuyedza, Asi izvi zvakanyanya zvinoenderana neprojekti (ini ndichapa muenzaniso pazasi).

Mabasa einjiniya weData haangogumiri pamaitiro echinyorwa / otomatiki cheki ye "nulls, kuverenga uye sums" mumatafura edhatabhesi, asi zvinoda kunzwisisa kwakadzama kwezvinodiwa zvebhizinesi remutengi uye, nekudaro, kugona kushandura data iripo kuita. ruzivo rwebhizimisi runobatsira.

Data Quality Theory

Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu

Kuti tinyatso kufungidzira basa reinjiniya akadaro, ngationei kuti Dhata Hunhu chii mudzidziso.

Data Data - imwe yematanho eData Management (nyika yese yatinokusiira iwe kuti udzidze wega) uye ine basa rekuongorora data zvichienderana neanotevera maitiro:

Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu
Ini ndinofunga kuti hapana chikonzero chekududzira imwe neimwe yemapoinzi (mupfungwa inonzi "data dimensions"), inonyatsotsanangurwa mumufananidzo. Asi maitiro ekuyedza pachawo haareve kunyatsotevedzera aya maficha mumakesi ebvunzo uye nekuaongorora. MuData Hunhu, senge mune chero imwe mhando yekuyedza, zvinodikanwa, kutanga pane zvese, kuvaka pane iyo data yemhando zvinodiwa zvakabvumiranwa nevatori vechikamu vanoita sarudzo dzebhizinesi.

Zvichienderana neData Quality project, injinjiniya inogona kuita mabasa akasiyana: kubva kune yakajairwa otomatiki tester ine yekumusoro ongororo yemhando yedata, kune munhu anoitisa yakadzama profiling yedata zvinoenderana nemaitiro ari pamusoro.

Tsanangudzo yakadzama yeData Management, Hunhu hweData uye maitiro ane hukama anotsanangurwa zvakanaka mubhuku rinonzi "DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition". Ndinokurudzira zvikuru bhuku rino sesumo yenyaya iyi (iwe uchawana chinongedzo pairi pakupera kwechinyorwa).

Nyaya yangu

Muindasitiri yeIT, ndakashanda nzira yangu kubva kuJunior tester mumakambani echigadzirwa kuenda kuLead Data Quality Engineer kuEPAM. Mushure memakore angangoita maviri ekushanda semuongorori, ndakava nechivimbo chakasimba chekuti ndakanga ndaita zvachose marudzi ese ekuyedzwa: kudzoreredza, kushanda, kushushikana, kugadzikana, chengetedzo, UI, nezvimwewo - uye ndakaedza nhamba huru yezvishandiso zvekuedza, ndine yakashanda panguva imwe chete mumitauro mitatu yekuronga: Java, Scala, Python.

Ndichitarisa kumashure, ndinonzwisisa kuti sei unyanzvi hwangu hwakanga hwakasiyana-siyana-ndaibatanidzwa mumapurogiramu anotungamirirwa nedata, makuru nemaduku. Izvi ndizvo zvakandiunza munyika yezvishandiso zvakawanda nemikana yekukura.

Kuti uonge zvakasiyana-siyana zvezvishandiso uye mikana yekuwana ruzivo rutsva uye unyanzvi, ingotarisa pamufananidzo uri pasi apa, unoratidza vanozivikanwa zvikuru munyika ye "Data & AI".

Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu
Iyi mhando yemufananidzo inounganidzwa gore rega rega nemumwe weanozivikanwa venture capitalists Matt Turck, anobva kusoftware kuvandudza. Here ссылка ku blog yake uye venture capital firm, kwaanoshanda somumwe wake.

Ndakakura hunyanzvi kunyanya nekukurumidza pandakanga ndiri ndega muongorori pachirongwa, kana kuti pakutanga kweprojekiti. Iri panguva yakadai yekuti iwe unofanirwa kuve nemutoro weiyo yese yekuyedza maitiro, uye iwe hauna mukana wekudzokera shure, kumberi chete. Pakutanga zvaityisa, asi zvino zvese zvakanakira bvunzo yakadaro zviri pachena kwandiri:

  • Iwe unotanga kutaurirana nechikwata chose sezvisina kumbobvira zvaitika, sezvo pasina mumiriri wekutaurirana: kana mutariri webvunzo kana vamwe vanoedza.
  • Kunyudzwa mupurojekiti inova yakadzika zvakadzika, uye iwe une ruzivo nezve zvese zvikamu, zvese zvakazara uye zvakadzama.
  • Vagadziri havakutarise se "murume uya wekuyedzerwa asingazivi zvaari kuita," asi seakaenzana anogadzira mabhenefiti anoshamisa kuchikwata nemayedzo ake otomatiki uye tarisiro yezvipembenene zvinoonekwa mune chimwe chikamu che chigadzirwa.
  • Nekuda kweizvozvo, iwe unoshanda zvakanyanya, unokwanisa zvakanyanya, uye zvakanyanya mukudiwa.

Sezvo purojekiti ichikura, mu100% yezviitiko ndakava mutungamiri kune vatsva vanoedza, ndichivadzidzisa uye kupfuudza ruzivo rwandakanga ndadzidza ini. Panguva imwecheteyo, zvichienderana nepurojekiti, ini ndanga ndisingagarogashira mwero wepamusoro wenyanzvi dzekuongorora mota kubva kune manejimendi uye pakanga paine chikonzero chekuvadzidzisa otomatiki (kune avo vanofarira) kana kugadzira maturusi ekushandisa mumabasa emazuva ese (zvishandiso. yekugadzira data nekuiisa muhurongwa, chishandiso chekuita kuyedzwa kwemutoro / kugadzikana kuyedza "nekukurumidza", nezvimwewo).

Muenzaniso weimwe chirongwa

Nehurombo, nekuda kwekusaburitsa pachena zvisungo, handikwanise kutaura zvakadzama nezve mapurojekiti andakashanda, asi ini ndichapa mienzaniso yeakajairika mabasa eData Quality Injiniya pane imwe yemapurojekiti.

Chakakosha chepurojekiti ndechekuita chikuva chekugadzirira data yekudzidzira modhi yekudzidza muchina zvichibva pairi. Mutengi aive kambani hombe yemishonga kubva kuUSA. Nehunyanzvi raive cluster Kubernetes, kusimuka ku AWS EC2 zviitiko, zvine akati wandei ma microservices uye iri pasi peOpen Source chirongwa cheEPAM - pfumo, yakagadziridzwa kune zvinodiwa nemutengi chaiwo (ikozvino purojekiti yakazvarwa patsva mukati odahu) ETL maitiro akarongwa achishandisa apache airflow uye akatamisa data kubva Salesforce vatengi masisitimu mukati AWS S3 Mabhaketi. Tevere, mufananidzo weDocker wemuchina wekudzidza modhi wakaiswa pachikuva, icho chakadzidziswa pane nyowani data uye, uchishandisa REST API interface, yakaburitsa fungidziro yaifarirwa nebhizinesi uye kugadzirisa matambudziko chaiwo.

Mukuona, zvinhu zvose zvaiita seizvi:

Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu
Paive neakawanda ekuyedzwa kwebasa pachirongwa ichi, uye nekupihwa kukurumidza kwekusimudzira uye kukosha kwekuchengetedza kumhanya kwekutenderera kutenderera (mavhiki maviri-sprints), zvaive zvakakodzera kuti nekukasika kufunga nezve otomatiki kuyedzwa kwezvakanyanya kukosha zvikamu zve. hurongwa. Yakawanda yeKubernetes-yakavakirwa papuratifomu pachayo yakafukidzwa nema autotests akaiswa mukati Robot Framework + Python, asi zvaidikanwawo kuvatsigira nekuvawedzera. Pamusoro pezvo, kuitira kurerutsira mutengi, GUI yakagadzirwa kuti igadzirise mamodheru ekufunda emuchina akaiswa kune sumbu, pamwe nekugona kutsanangura kupi uye kupi data rinoda kuendeswa kudzidzisa mamodheru. Uku kuwedzera kwakakura kwaisanganisira kuwedzera kwekuyedzwa kweotomatiki inoshanda, iyo yainyanya kuitwa kuburikidza neREST API mafoni uye nenhamba shoma yemagumo-2-yekupedzisira UI bvunzo. Kutenderedza equator yekufamba uku kwese, takabatanidzwa neanoedza manyorero akaita basa rakanakisa nekuyedzwa kwekugamuchirwa kweshanduro dzechigadzirwa uye kutaurirana nemutengi nezve kugamuchirwa kwechinyorwa chinotevera. Pamusoro pezvo, nekuda kwekuuya kwenyanzvi itsva, takakwanisa kunyora basa redu uye kuwedzera macheki akati wandei akakosha aive akaoma kuita otomatiki ipapo.

Uye pakupedzisira, mushure mekunge tawana kugadzikana kubva papuratifomu uye iyo GUI yekuwedzera pamusoro payo, takatanga kuvaka ETL mapaipi tichishandisa Apache Airflow DAGs. Otomatiki data yemhando yekutarisa yakaitwa nekunyora akakosha Airflow DAGs aitarisa iyo data zvichienderana nemhedzisiro yeETL maitiro. Sechikamu chepurojekiti iyi, takaita rombo rakanaka uye mutengi akatipa mukana kune asingazivikanwi data seti patakaedza. Takatarisa mutsara wedata nemutsara wekutevedzera nemhando, kuvapo kwedata rakaputsika, huwandu hwese hwerekodhi pamberi uye mushure, kuenzanisa kweshanduko dzakaitwa neETL maitiro ekuunganidza, kushandura mazita emakoroni, uye zvimwe zvinhu. Uye zvakare, aya macheki akaiswa kune akasiyana data masosi, semuenzaniso, kuwedzera kune SalesForce, zvakare kuMySQL.

Chekupedzisira chemhando yedata chakaitwa padanho reS3, kwavakachengetwa uye vaive vakagadzirira-kushandisa-mamodhi ekudzidzira muchina. Kuti uwane data kubva kune yekupedzisira CSV faira iri paS3 Bucket uye kuisimbisa, kodhi yakanyorwa uchishandisa boto3 vatengi.

Paivawo nezvinodiwa kubva kumutengi kuchengetedza chikamu che data mune imwe S3 Bucket uye chikamu mune imwe. Izvi zvaidawo kunyora macheki ekuwedzera kutarisa kuvimbika kwekuronga kwakadaro.

Generalized ruzivo kubva kune mamwe mapurojekiti

Muenzaniso weiyo yakajairwa runyorwa rwezviitiko zveData Quality engineer:

  • Gadzirira data rebvunzo (diki diki risiri pamutemo) kuburikidza nechishandiso chemagetsi.
  • Isa iyo data yakagadzirirwa seti kune yekutanga sosi uye tarisa kuti yakagadzirira kushandiswa.
  • Tangisa ETL maitiro ekugadzirisa seti yedata kubva kunzvimbo yekuchengetera kusvika kune yekupedzisira kana yepakati chengetedzo uchishandisa imwe seti yezvirongwa (kana zvichibvira, gadza gadziriso paramita yebasa reETL).
  • Simbisa data rakagadziriswa neETL maitiro emhando yayo uye kutevedzera zvinodiwa nebhizinesi.

Panguva imwecheteyo, chinangwa chikuru checheki hachifanirwe kunge chiri chekuti kuyerera kwedata muhurongwa, kwakashanda uye kusvika pakupedzwa (inova chikamu chekuyedza kunoshanda), asi zvakanyanya pakutarisa uye kusimbisa data ye. kutevedzera zvinotarisirwa zvinodiwa, kuona zvinokanganisa uye zvimwe zvinhu.

Tools

Imwe yehunyanzvi hwekutonga data kwakadai inogona kuve kurongeka kwecheki cheki padanho rega rega rekugadzirisa data, iyo inonzi "data chain" mumabhuku - kutonga kwedata kubva kunobva kune iyo yekupedzisira kushandiswa. Aya marudzi echeki anowanzo shandiswa nekunyora kutarisa SQL mibvunzo. Zviri pachena kuti mibvunzo yakadai inofanirwa kuve yakareruka sezvinobvira uye tarisa zvimedu zvemhando yedata (matafura metadata, mitsara isina chinhu, NULLs, Mhosho mune syntax - humwe hunhu hunodiwa pakutarisa).

Panyaya yekuongorora kudzoreredza, iyo inoshandisa yakagadzirira-yakagadzirwa (isingachinjiki, inochinjika zvishoma) seti yedata, iyo autotest kodhi inogona kuchengeta yakagadzirira-yakagadzirwa matemplate ekutarisa data kuti ienderane nemhando (tsananguro yeinotarisirwa tafura metadata; mutsara sampuli zvinhu zvinogona yakasarudzwa zvisina tsarukano panguva yekuedzwa, nezvimwewo).

Zvakare, panguva yekuyedza, iwe unofanirwa kunyora ETL bvunzo maitiro uchishandisa masisitimu akadai seApache Airflow, Apache spark kana kunyange dema-bhokisi cloud type tool GCP Dataprep, GCP Dataflow Zvichingoenda zvakadaro. Mamiriro ezvinhu aya anomanikidza muinjiniya webvunzo kuti anyure mumisimboti yekushanda kwezvishandiso zviri pamusoro uye zvinotonyanya kuita zvese zvinoshanda kuyedza (semuenzaniso, aripo ETL maitiro papurojekiti) uye oashandisa kutarisa data. Kunyanya, Apache Airflow ine yakagadzirira-yakagadzirwa vashandisi vekushanda neanozivikanwa ekuongorora dhatabhesi, semuenzaniso GCP BigQuery. Muenzaniso wekutanga wekushandiswa kwayo wakatotsanangurwa pano, saka handizozvidzokorora.

Kunze kwemhinduro dzakagadzirirwa-dzakagadzirwa, hapana anokurambidza kuita yako wega matekiniki uye maturusi. Izvi hazvingobatsiri chete purojekiti, asiwo kuData Quality Injiniya pachake, uyo anozovandudza mahorizoni ake ehunyanzvi uye hunyanzvi hwekodha.

Iyo inoshanda sei pachirongwa chaicho

Mufananidzo wakanaka wendima dzekupedzisira nezve "data cheni", ETL uye ubiquitous cheki ndiyo inotevera maitiro kubva kune imwe yemapurojekiti chaiwo:

Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu

Pano, data rakasiyana-siyana (zvakasikwa, rakagadzirirwa nesu) pinda iyo yekuisa "funnel" yehurongwa hwedu: inoshanda, isiriyo, yakavhenganiswa, nezvimwe, zvino inosefetwa uye inoguma mune yepakati chengetedzo, ivo zvakare vanopinda munhevedzano yeshanduko. uye akaiswa mukuchengetedza kwekupedzisira , kubva kune iyo, zvakare, analytics, kuvaka data marts uye kutsvaga bhizinesi ruzivo ruchaitwa. Muchirongwa chakadaro, tisingatarise kushanda kweETL maitiro, tinotarisa pamhando yedata isati yatanga uye mushure meshanduko, pamwe nekubuda kune analytics.

Kupfupisa zviri pamusoro, zvisinei nenzvimbo dzandaishanda, kwese kwandaiita mumapurojekiti eData akagovera zvinotevera:

  • Kuburikidza chete neautomation unogona kuyedza mamwe makesi uye uwane kutenderera kutenderera kunogamuchirwa kune bhizinesi.
  • Muongorori papurojekiti yakadaro ndeimwe yenhengo dzinoremekedzwa zvikuru dzechikwata, sezvo inounza zvikomborero zvakakura kune mumwe nomumwe wevatori vechikamu (kukurumidzira kwekuedzwa, data yakanaka kubva kuData Scientist, kuzivikanwa kwezvikanganiso mumatanho ekutanga).
  • Izvo hazvina basa kuti unoshanda pane yako wega Hardware kana mumakore - zvese zviwanikwa zvinoiswa musumbu seHortonworks, Cloudera, Mesos, Kubernetes, nezvimwe.
  • Mapurojekiti akavakirwa pane microservice nzira, yakagovaniswa uye yakafanana komputa inotungamira.

Ndinoda kuziva kuti kana uchiita bvunzo mumunda weData Hunhu, nyanzvi yekuyedza inoshandura tarisiro yake yehunyanzvi kune kodhi yechigadzirwa uye maturusi anoshandiswa.

Mamiriro akasiyana eData Quality test

Pamusoro pezvo, ini pachangu, ndaona zvinotevera (ini ndichabva ndaita reservation kuti ZVINONYANYA uye zvakasarudzika) maficha ekuyedzwa muData (Big Data) mapurojekiti (masisitimu) nedzimwe nzvimbo:

Yakakura uye diki data tester: maitiro, dzidziso, nyaya yangu

Useful links

  1. Dzidziso: DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition.
  2. Training centre EPAM 
  3. Zvinokurudzirwa zvinhu zvekutanga Data Quality mainjiniya:
    1. Kosi yemahara paStepik: Nhanganyaya kune databases
    2. Kosi pa LinkedIn Kudzidza: Data Sayenzi Nheyo: Data Engineering.
    3. Zvinyorwa:
    4. Vhidhiyo:

mhedziso

Data Data inzira idiki kwazvo inovimbisa, kuva chikamu chezvinoreva kuva chikamu chekutanga. Kamwe muData Hunhu, iwe uchanyudzwa muhuwandu hukuru hwemazuva ano, mu-inoda tekinoroji, asi zvakanyanya kukosha, mikana mikuru ichakuvhurira iwe kuti ugadzire uye uite mazano ako. Iwe unozogona kushandisa inoenderera yekuvandudza nzira kwete chete papurojekiti, asiwo iwe pachako, uchiramba uchigadzira senyanzvi.

Source: www.habr.com

Voeg