Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga

Moni nonse, dzina langa ndine Alexander, ndipo ndine injiniya wa Data Quality yemwe amafufuza zamtundu wake. Nkhaniyi ifotokoza za momwe ndidayendera komanso chifukwa chake mu 2020 gawo loyesali linali pachimake cha mafunde.

Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga

Padziko lonse lapansi

Dziko lamasiku ano likukumana ndi kusintha kwina kwaukadaulo, gawo limodzi lomwe ndikugwiritsa ntchito deta yosonkhanitsidwa ndi mitundu yonse yamakampani kuti alimbikitse malonda awo, phindu ndi PR. Zikuwoneka kuti kukhalapo kwa deta yabwino (yabwino), komanso ubongo waluso omwe angathe kupanga ndalama kuchokera (molondola ndondomeko, kuwonetseratu, kumanga zitsanzo zamakina ophunzirira, etc.), akhala chinsinsi cha kupambana kwa ambiri lero. Ngati zaka 15-20 zapitazo makampani akuluakulu ankagwira ntchito mwakhama ndi kusonkhanitsa deta ndi kupanga ndalama, lero ndilo gawo la pafupifupi anthu onse amisala.

Pachifukwa ichi, zaka zingapo zapitazo, zipata zonse zoperekedwa kufunafuna ntchito padziko lonse lapansi zinayamba kudzazidwa ndi malo a Data Scientists, popeza aliyense anali wotsimikiza kuti, atalemba ntchito katswiri woteroyo, zingatheke kupanga supermodel yophunzirira makina. , kulosera zam'tsogolo ndikuchita "quantum leap" kwa kampaniyo. M'kupita kwa nthawi, anthu anazindikira kuti njira imeneyi pafupifupi konse ntchito kulikonse, popeza si deta onse amene amagwera m'manja mwa akatswiri ndi oyenera maphunziro zitsanzo.

Ndipo zopempha zochokera kwa Data Scientists zinayamba: β€œTiyeni tigule zambiri kuchokera kwa awa ndi awo...”, β€œTilibe deta yokwanira...”, β€œTikufuna zina zambiri, makamaka zapamwamba kwambiri...” . Kutengera zopempha izi, mayanjano ambiri adayamba kupangidwa pakati pamakampani omwe ali ndi deta imodzi kapena ina. Mwachibadwa, izi zimafuna luso bungwe la ndondomekoyi - kulumikiza ku gwero deta, kukopera izo, kuona kuti yodzaza zonse, etc. Chiwerengero cha njira zimenezi anayamba kukula, ndipo lero tili ndi chosowa chachikulu cha mtundu wina wa akatswiri - Mainjiniya a Ubwino wa Data - omwe angayang'anire kayendedwe ka data mudongosolo (mapaipi a data), mtundu wa data pakulowetsa ndi kutulutsa, ndikuzindikira zakukwanira kwawo, kukhulupirika ndi mawonekedwe ena.

Zomwe akatswiri opanga ma Data Quality adabwera kwa ife kuchokera ku USA, komwe, mkati mwa nthawi yovuta ya capitalism, palibe amene ali wokonzeka kutaya nkhondo ya data. Pansipa ndapereka zowonera kuchokera patsamba ziwiri zodziwika bwino zosaka ntchito ku US: www.monster.com ΠΈ www.dice.com - yomwe ikuwonetsa zambiri kuyambira pa Marichi 17, 2020 pa kuchuluka kwa malo omwe adatumizidwa pogwiritsa ntchito mawu osakira: Quality Data and Data Scientist.

www.monster.com

Data Asayansi - 21416 ntchito
Ubwino wa Data - 41104 ntchito

Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga
Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga

www.dice.com

Data Asayansi - 404 ntchito
Quality Data - 2020 ntchito

Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga
Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga

Mwachiwonekere, ntchitozi sizimapikisana wina ndi mzake. Ndi zithunzi zowonera, ndimangofuna kufotokoza zomwe zikuchitika pamsika wantchito malinga ndi zopempha za mainjiniya a Data Quality, omwe ambiri akufunika pano kuposa asayansi a Data.

Mu Juni 2019, EPAM, poyankha zosowa za msika wamakono wa IT, idalekanitsa Ubwino wa Data kukhala mchitidwe wosiyana. Akatswiri opanga ma Data Quality, akamagwira ntchito zawo zatsiku ndi tsiku, amawongolera deta, amayang'ana momwe zimakhalira m'mikhalidwe yatsopano ndi machitidwe, kuyang'anira kufunikira kwa deta, kukwanira kwake ndi kufunikira kwake. Ndi zonsezi, m'lingaliro lenileni, mainjiniya a Data Quality amapereka nthawi yochepa pakuyesa kwakanthawi kochepa, KOMA izi zimadalira kwambiri polojekitiyi (ndipereka chitsanzo pansipa).

Maudindo a injiniya wa Data Quality samangokhala ndi macheke anthawi zonse a "nulls, counts and sums" m'matebulo a database, koma amafunikira kumvetsetsa mozama za zosowa zamabizinesi a kasitomala, motero, kuthekera kosintha zomwe zilipo kuti zikhale. mfundo zothandiza zamabizinesi.

Deta Quality Theory

Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga

Kuti tiganizire bwino za ntchito ya injiniya wotere, tiyeni tiwone zomwe Data Quality ili m'malingaliro.

Ubwino wa deta - imodzi mwamagawo a Data Management (dziko lonse lomwe tikusiyirani kuti muphunzire nokha) ndipo ili ndi udindo wosanthula deta molingana ndi izi:

Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga
Ndikuganiza kuti palibe chifukwa chofotokozera mfundo zonse (mwachidziwitso zimatchedwa "ma data dimensions"), zikufotokozedwa bwino pachithunzichi. Koma kuyesako sikukutanthauza kukopera izi m'mayesero ndikuwunika. Mu Ubwino wa Data, monga muyeso ina iliyonse, ndikofunikira, choyamba, kumangirira pazofunikira zamtundu wa data zomwe zagwirizana ndi omwe atenga nawo mbali pantchito yomwe amapanga zisankho zamabizinesi.

Kutengera projekiti ya Ubwino wa Data, mainjiniya amatha kugwira ntchito zosiyanasiyana: kuchokera kwa woyesa wamba woyesa ndi kuwunika mozama zamtundu wa data, kupita kwa munthu yemwe amalemba mozama za datayo malinga ndi zomwe zili pamwambapa.

Kufotokozera mwatsatanetsatane za Kasamalidwe ka Data, Ubwino wa Data ndi njira zofananira zikufotokozedwa bwino m'buku lotchedwa "DAMA-DMBOK: Body Management of Knowledge: 2nd Edition". Ndikupangira bukuli ngati mawu oyamba pamutuwu (mupeza ulalo wake kumapeto kwa nkhaniyi).

Nkhani yanga

M'makampani a IT, ndinagwira ntchito kuchokera kwa Junior tester m'makampani opanga zinthu kupita ku Lead Data Quality Engineer ku EPAM. Patatha pafupifupi zaka ziwiri ndikugwira ntchito ngati tester, ndinali ndi chikhulupiriro cholimba kuti ndachita mitundu yonse yoyezetsa: kuyambiranso, kugwira ntchito, kupsinjika, kukhazikika, chitetezo, UI, ndi zina zambiri - ndikuyesa zida zambiri zoyesera, kukhala nazo. adagwira ntchito nthawi imodzi m'zilankhulo zitatu: Java, Scala, Python.

Ndikayang'ana m'mbuyo, ndimamvetsetsa chifukwa chake luso langa linali losiyana-siyana - ndinachita nawo ntchito zoyendetsedwa ndi deta, zazikulu ndi zazing'ono. Izi ndi zomwe zidandibweretsa kudziko la zida zambiri komanso mwayi wokulirapo.

Kuti muzindikire zida zosiyanasiyana ndi mwayi wopeza chidziwitso chatsopano ndi luso, ingoyang'anani chithunzichi chomwe chili pansipa, chomwe chikuwonetsa otchuka kwambiri mu dziko la "Data & AI".

Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga
Fanizo lamtunduwu limapangidwa chaka ndi chaka ndi m'modzi mwamabizinesi odziwika bwino a Matt Turck, yemwe amachokera ku chitukuko cha mapulogalamu. Pano ссылка ku blog yake ndi venture capital firm, komwe amagwira ntchito ngati mnzake.

Ndinakula mwaukadaulo makamaka mwachangu pamene ndinali ndekha woyesa ntchitoyo, kapena kumayambiriro kwa ntchitoyi. Ndi nthawi yomwe muyenera kukhala ndi udindo pazoyeserera zonse, ndipo mulibe mwayi wobwerera, mtsogolo. Poyamba zinali zowopsa, koma tsopano zabwino zonse za mayesowa zikuwonekera kwa ine:

  • Mumayamba kuyankhulana ndi gulu lonse kuposa kale, popeza palibe woyimira kuyankhulana: ngakhale woyang'anira mayeso kapena oyesa anzawo.
  • Kumizidwa mu pulojekitiyi kumakhala kozama kwambiri, ndipo mumadziwa zambiri za zigawo zonse, zonse komanso mwatsatanetsatane.
  • Madivelopa samakuwonani inu ngati "munthu woyesedwa yemwe sadziwa zomwe akuchita," koma ngati wofanana yemwe amapeza phindu lalikulu kwa gululo ndi mayeso ake odzipangira okha komanso kuyembekezera nsikidzi zomwe zikuwonekera mu gawo linalake la mankhwala.
  • Zotsatira zake, ndinu ogwira mtima kwambiri, oyenerera, komanso ofunidwa kwambiri.

Ntchitoyi itakula, mu 100% ya milandu ndinakhala mlangizi kwa oyesa atsopano, kuwaphunzitsa ndi kupititsa patsogolo chidziwitso chomwe ndinaphunzira ndekha. Nthawi yomweyo, kutengera polojekitiyi, nthawi zonse sindinkalandira akatswiri apamwamba kwambiri oyesa magalimoto kuchokera kwa oyang'anira ndipo padali kufunika kowaphunzitsa zodzipangira okha (kwa omwe akufuna) kapena kupanga zida zogwiritsira ntchito tsiku ndi tsiku (zida). kupanga deta ndikuyiyika mu dongosolo, chida choyesera katundu / kuyesa kukhazikika "mwachangu", etc.).

Chitsanzo cha ntchito inayake

Tsoka ilo, chifukwa cha udindo wosawululira, sindingathe kuyankhula mwatsatanetsatane za ma projekiti omwe ndidagwirapo, koma ndipereka zitsanzo za ntchito zofananira za Data Quality Engineer pa imodzi mwama projekiti.

Chofunikira cha pulojekitiyi ndikukhazikitsa nsanja yokonzekera deta yophunzitsira mitundu yophunzirira makina motengera momwemo. Wogulayo anali kampani yayikulu yopangira mankhwala kuchokera ku USA. Mwaukadaulo linali gulu Kubernetes, kukwera ku AWS EC2 zochitika, zokhala ndi ma microservice angapo komanso polojekiti ya Open Source ya EPAM - Legiyo, zosinthidwa kuti zigwirizane ndi zosowa za makasitomala enieni (tsopano polojekitiyi yabadwanso oda). Njira za ETL zidakonzedwa pogwiritsa ntchito Apache Airflow ndikusuntha deta kuchokera Maofesi kasitomala machitidwe mu Zowonjezera Zidebe. Kenaka, chithunzi cha Docker cha chitsanzo chophunzirira makina chinayikidwa pa nsanja, chomwe chinaphunzitsidwa pa deta yatsopano ndipo, pogwiritsa ntchito mawonekedwe a REST API, chinapanga maulosi omwe anali okondweretsa bizinesi ndikuthetsa mavuto enieni.

M'mawonekedwe, zonse zidawoneka motere:

Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga
Panali kuyezetsa kochuluka kwa polojekitiyi, ndipo chifukwa cha liwiro la chitukuko cha mawonekedwe ndi kufunikira koyendetsa kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe ka kayendetsedwe kake. dongosolo. Zambiri mwa nsanja zokhazikitsidwa ndi Kubernetes palokha zidaphimbidwa ndi ma autotest omwe adakhazikitsidwa mkati Ntchito ya Robot + Python, koma kunali kofunikiranso kuwathandiza ndi kuwakulitsa. Kuonjezera apo, kuti makasitomala athandizidwe, GUI inalengedwa kuti iyang'anire zitsanzo zophunzirira makina zomwe zimagwiritsidwa ntchito kumagulu, komanso luso lofotokozera kumene ndi kumene deta iyenera kusamutsidwa kuti iphunzitse zitsanzozo. Kuphatikiza kwakukuluku kunaphatikizapo kukulitsa kwa kuyesa kodzipangira, komwe kunkachitika makamaka kudzera mu mafoni a REST API ndi mayeso ochepa a end-2-end UI. Kuzungulira equator ya kayendedwe ka zonsezi, tinaphatikizidwa ndi woyesa pamanja yemwe adachita ntchito yabwino kwambiri ndikuyesa kuvomereza kwamitundu yazinthu ndikulumikizana ndi kasitomala za kuvomereza kumasulidwa kotsatira. Kuonjezera apo, chifukwa cha kubwera kwa katswiri watsopano, tinatha kulemba ntchito yathu ndikuwonjezera macheke angapo ofunika kwambiri omwe anali ovuta kupanga nthawi yomweyo.

Ndipo potsiriza, titatha kukhazikika pa pulatifomu ndi zowonjezera za GUI pamwamba pake, tinayamba kupanga mapaipi a ETL pogwiritsa ntchito ma Apache Airflow DAG. Kuyang'ana kwamtundu wa data pawokha kunachitika polemba ma Airflow DAG apadera omwe amafufuza zomwe zachitika potengera zotsatira za njira ya ETL. Monga gawo la polojekitiyi, tinali ndi mwayi ndipo kasitomala adatipatsa mwayi wopeza ma data osadziwika omwe tidayesapo. Tinayang'ana mzere wa deta ndi mzere kuti ugwirizane ndi mitundu, kukhalapo kwa deta yosweka, chiwerengero chonse cha zolemba zisanachitike ndi pambuyo pake, kuyerekezera zosinthika zomwe zinapangidwa ndi ndondomeko ya ETL pakuphatikizana, kusintha mayina a magawo, ndi zina. Kuphatikiza apo, macheke awa adasinthidwa kuzinthu zosiyanasiyana, mwachitsanzo, kuwonjezera pa SalesForce, komanso ku MySQL.

Macheke omaliza amtundu wa data adachitika kale pamlingo wa S3, pomwe adasungidwa ndipo anali okonzeka kugwiritsa ntchito makina ophunzirira makina. Kuti mupeze deta kuchokera ku fayilo yomaliza ya CSV yomwe ili pa S3 Bucket ndikuitsimikizira, code inalembedwa pogwiritsa ntchito boto3 makasitomala.

Panalinso chofunikira kuchokera kwa kasitomala kusunga gawo la deta mu Chidebe chimodzi cha S3 ndi gawo lina. Izi zinafunikanso kulemba macheke owonjezera kuti muwone kudalirika kwa kusanja koteroko.

Zokumana nazo zonse kuchokera kuzinthu zina

Chitsanzo cha mndandanda wazinthu zambiri za injiniya wa Data Quality:

  • Konzani data yoyeserera (yaing'ono yosavomerezeka) kudzera pa chida chodzipangira.
  • Kwezani zomwe zakonzedwa kugwero loyambirira ndikuwonetsetsa kuti zakonzeka kugwiritsidwa ntchito.
  • Yambitsani njira za ETL pokonza seti ya data kuchokera ku gwero kupita kosungirako komaliza kapena kwapakatikati pogwiritsa ntchito zoikamo zina (ngati kuli kotheka, khazikitsani magawo osinthika a ntchito ya ETL).
  • Tsimikizirani zomwe zasinthidwa ndi njira ya ETL chifukwa cha mtundu wake komanso kutsata zofunikira zamabizinesi.

Panthawi imodzimodziyo, cholinga chachikulu cha macheke sichiyenera kukhala chokha chakuti kayendedwe ka deta mu dongosolo lakhala likugwira ntchito ndikufika pamapeto (yomwe ndi gawo la kuyesa ntchito), koma makamaka pa kufufuza ndi kutsimikizira deta ya kutsatira zomwe zikuyembekezeka, kuzindikira zolakwika ndi zina.

Zida

Imodzi mwa njira zoyendetsera deta yotereyi ikhoza kukhala bungwe la macheke a unyolo pa gawo lililonse la kukonzanso deta, zomwe zimatchedwa "unyolo wa data" m'mabuku - kulamulira deta kuchokera ku gwero mpaka kugwiritsidwa ntchito komaliza. Macheke amtunduwu nthawi zambiri amagwiritsidwa ntchito polemba kufunsa mafunso a SQL. Zikuwonekeratu kuti mafunso otere ayenera kukhala opepuka momwe angathere ndikuyang'ana magawo amtundu wamtundu uliwonse (matebulo metadata, mizere yopanda kanthu, NULLs, Zolakwika mu syntax - zina zofunika kuziwona).

Pankhani ya kuyesa kwa regression, yomwe imagwiritsa ntchito ma data okonzeka (osasinthika, osinthika pang'ono), code autotest ikhoza kusunga ma tempuleti okonzeka kuti ayang'ane deta kuti igwirizane ndi khalidwe (mafotokozedwe a metadata yoyembekezeredwa; zinthu za mzere zomwe zingakhalepo. osankhidwa mwachisawawa panthawi yoyesedwa, ndi zina zotero).

Komanso, pakuyesa, muyenera kulemba njira zoyeserera za ETL pogwiritsa ntchito zida monga Apache Airflow, Apache Spark kapena ngakhale chida chamtundu wamtambo wakuda Zithunzi za GCP, GCP Dataflow Ndi zina zotero. Izi zimakakamiza wopanga mayeso kuti adzilowetse mu mfundo zogwiritsira ntchito zida zomwe zili pamwambazi komanso mogwira mtima kwambiri onse kuyesa kuyesa kogwira ntchito (mwachitsanzo, njira zomwe zilipo kale za ETL pulojekiti) ndikuzigwiritsa ntchito kuwunika deta. Makamaka, Apache Airflow ili ndi ogwiritsa ntchito okonzeka kugwira ntchito ndi nkhokwe zodziwika bwino, mwachitsanzo GCP BigQuery. Chitsanzo choyambirira cha kugwiritsidwa ntchito kwake chafotokozedwa kale apa, kotero sindidzabwereza ndekha.

Kupatula mayankho okonzeka, palibe amene amakuletsani kugwiritsa ntchito njira zanu ndi zida zanu. Izi sizikhala zopindulitsa pa polojekitiyi, komanso kwa Wopanga Ubwino wa Data iyemwini, yemwe potero adzakulitsa luso lake laukadaulo ndi luso lazolemba.

Momwe zimagwirira ntchito pa polojekiti yeniyeni

Chifaniziro chabwino cha ndime zomaliza za "unyolo wa data", ETL ndi kufufuza kulikonse ndi njira yotsatirayi kuchokera ku imodzi mwazinthu zenizeni:

Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga

Apa, deta zosiyanasiyana (mwachilengedwe, zokonzedwa ndi ife) lowetsani zolowera za dongosolo lathu: zovomerezeka, zosavomerezeka, zosakanikirana, ndi zina zotero, kenako zimasefedwa ndikumaliza kusungirako kwapakatikati, kenako zimasinthanso zingapo. ndipo amayikidwa posungira komaliza , komwe, kusanthula, kumanga ma data ndi kufunafuna zidziwitso zamabizinesi kudzachitidwa. M'dongosolo loterolo, osayang'ana momwe magwiridwe antchito a ETL amagwirira ntchito, timayang'ana kwambiri zamtundu wa data musanasinthe komanso pambuyo pake, komanso zomwe zimatuluka ku analytics.

Kuti ndifotokoze mwachidule zomwe zili pamwambapa, mosasamala kanthu za malo omwe ndimagwira ntchito, kulikonse komwe ndimagwira nawo ntchito za Data zomwe zimagawana izi:

  • Pokhapokha pogwiritsa ntchito makina omwe mungathe kuyesa milandu ina ndikupeza njira yomasulidwa yovomerezeka kubizinesi.
  • Woyesa pa polojekiti yotereyi ndi mmodzi mwa mamembala olemekezeka kwambiri a gululo, chifukwa amabweretsa phindu lalikulu kwa aliyense wa omwe atenga nawo mbali (kufulumira kwa kuyesa, deta yabwino kuchokera kwa Data Scientist, kuzindikira zolakwika m'magawo oyambirira).
  • Zilibe kanthu kuti mumagwira ntchito pazida zanu kapena pamtambo - zida zonse zimachotsedwa m'magulu monga Hortonworks, Cloudera, Mesos, Kubernetes, ndi zina zambiri.
  • Ma projekiti amamangidwa panjira ya microservice, yogawidwa komanso yofananira pamakompyuta.

Ndikufuna kudziwa kuti poyesa m'munda wa Ubwino wa Data, katswiri woyesa amasinthira ukadaulo wake ku code ya chinthucho ndi zida zomwe zimagwiritsidwa ntchito.

Zosiyana ndi kuyesa kwa Quality Data

Kuphatikiza apo, kwa ine ndekha, ndazindikira zotsatirazi (ndidzasungitsa nthawi yomweyo kuti ndi ZABWINO KWAMBIRI komanso zodziwikiratu) zomwe zimayesa kuyesa ma projekiti a Data (Big Data) (machitidwe) ndi madera ena:

Zoyesa zazikulu ndi zazing'ono za data: zomwe zikuchitika, malingaliro, nkhani yanga

maulalo othandiza

  1. Malingaliro: DAMA-DMBOK: Bungwe lachidziwitso la Data Management: 2nd Edition.
  2. Malo ophunzitsira EPAM 
  3. Zida zolangizidwa za mainjiniya apamwamba a data:
    1. Maphunziro aulere pa Stepik: Chiyambi cha database
    2. Maphunziro pa LinkedIn Learning: Maziko a Science Science: Data Engineering.
    3. Zolemba:
    4. Video:

Pomaliza

Ubwino wa deta ndi njira yaying'ono yolonjeza, kukhala gawo lomwe limatanthauza kukhala gawo loyambira. Mukakhala mu Quality Data, mudzamizidwa muukadaulo wambiri wamakono, wofunidwa, koma koposa zonse, mipata yayikulu idzakutsegulirani kuti mupange ndikukhazikitsa malingaliro anu. Mudzatha kugwiritsa ntchito njira yosinthira mosalekeza osati pa polojekiti yokha, komanso kwa inu nokha, mukukula mosalekeza ngati katswiri.

Source: www.habr.com

Kuwonjezera ndemanga