Ndivanaani mainjiniya edata, uye unova sei mumwe?

Mhoro zvakare! Musoro wechinyorwa unotaura wega. Mukutarisira kutanga kwekosi "Data Engineer" Isu tinokurudzira kuti unzwisise kuti ndiani mainjiniya edata. Pane zvakawanda zvinobatsira zvinongedzo mune chinyorwa. Kuverenga kufara.

Ndivanaani mainjiniya edata, uye unova sei mumwe?

Nhungamiro yakapusa yekubata iyo Data Engineering wave uye kwete kuirega ichikwevera iwe mugomba rakadzika-dzika.

Zvinoita sekunge munhu wese anoda kuve Data Scientist mazuva ano. Asi zvakadini neData Engineering? Chaizvoizvo, iyi imhando yehusanganiswa yemuongorori wedata uye sainzi data; Injiniya yedata inowanzoita basa rekugadzirisa mafambiro ebasa, kugadzirisa mapaipi, uye ETL maitiro. Nekuda kwekukosha kweaya mabasa, iyi parizvino imwe yakakurumbira nyanzvi jargon iri kushingaira kuwana simba.

Mihoro yakakwira uye kudiwa kukuru chingori chikamu chidiki chezvinoita kuti basa iri riwedzere kunaka! Kana iwe uchida kujoina magamba, hazvina kunonoka kutanga kudzidza. Mune ino post, ndaunganidza ruzivo rwese rwunodiwa kuti ndikubatsire kutora matanho ako ekutanga.

Saka, ngatitangei!

Chii chinonzi Data Engineering?

Kutendeseka, hapana tsananguro iri nani kupfuura iyi:

β€œMusayendisiti anogona kuwana nyeredzi itsva, asi haakwanisi kusika imwe. Anofanira kukumbira mainjiniya kuti vamuitire."

-Gordon Lindsay Glegg

Saka, basa reinjiniya yedata rakakosha.

Sezvinoratidzwa nezita racho, data engineering ine chekuita nedata, kureva kuendesa kwayo, kuchengetedza uye kugadzirisa. Saizvozvo, iro basa guru revainjiniya nderekupa yakavimbika masisitimu e data. Kana tikatarisa iyo AI hierarchy yezvido, data engineering inotora yekutanga 2-3 nhanho: kuunganidza, kufamba uye kuchengetedza, kugadzirira data.

Ndivanaani mainjiniya edata, uye unova sei mumwe?

Chii chinoitwa neanjiniya wedata?

Nekuuya kwedata hombe, chiyero chemutoro chakachinja zvakanyanya. Kana kare nyanzvi idzi dzakanyora mibvunzo mikuru yeSQL uye data yakanyungudutswa vachishandisa maturusi akadai seInformatica ETL, Pentaho ETL, Talend, ikozvino zvinodiwa zveinjiniya zve data zvawedzera.

Mazhinji makambani ane nzvimbo dzakavhurika dzechinzvimbo cheinjiniya yedata ane zvinotevera zvinodiwa:

  • Kuziva kwakanakisa kweSQL uye Python.
  • Ziva nemapuratifomu emakore, kunyanya Amazon Web Services.
  • Kuziva kweJava / Scala kunoda.
  • Kunzwisisa kwakanaka kweSQL uye NoSQL dhatabhesi (data modelling, data warehousing).

Ramba uchifunga, izvi ndizvo chete zvakakosha. Kubva pane iyi runyorwa, zvinogona kufungidzirwa kuti mainjiniya edata inyanzvi mumunda wekuvandudza software uye backend.
Semuenzaniso, kana kambani ikatanga kuburitsa yakawanda data kubva kwakasiyana masosi, basa rako seinjiniya yedata ndere kuronga kuunganidzwa kweruzivo, kugadzirisa kwayo uye kuchengetedza.

Rondedzero yemidziyo inoshandiswa munyaya iyi inogona kusiyana, zvose zvinoenderana nehuwandu hweiyi data, kukurumidza kwekugamuchira kwayo uye heterogeneity. Mazhinji makambani haaite nedata hombe zvachose, saka senzvimbo yepakati, inonzi dura re data, unogona kushandisa database yeSQL (PostgreSQL, MySQL, nezvimwewo) ine diki seti yezvinyorwa zvinodyisa data mukati. imba yokuchengetera zvinhu.

IT hofori dzakadai seGoogle, Amazon, Facebook kana Dropbox dzine zvakakwirira zvinodiwa: ruzivo rwePython, Java kana Scala.

  • Chiitiko nedata hombe: Hadoop, Spark, Kafka.
  • Kuziva kwealgorithms uye data zvimiro.
  • Kunzwisisa izvo zvakakosha zveakagoverwa masisitimu.
  • Zvakaitika nematurusi ekuona data akadai seTableau kana ElasticSearch ichave yekuwedzera.

Ndiko kuti, pane kushanduka kwakajeka kune data hombe, kureva mukugadzirisa kwayo pasi pemitoro yakakura. Aya makambani akawedzera zvinodiwa zve system fault tolerance.

Data Engineers Vs. data masayendisiti

Ndivanaani mainjiniya edata, uye unova sei mumwe?
Zvakanaka, izvo zvaive nyore uye zvinosetsa kuenzanisa (hapana chinhu chega), asi muchokwadi zvakanyanya kuoma.

Kutanga, iwe unofanirwa kuziva kuti kune kwakawanda kusanzwisisika mukutsanangurwa kwemabasa uye hunyanzvi hwesainzi wedata uye injiniya yedata. Ndokunge, iwe unogona kuvhiringidzika nyore nyore nezve hunyanzvi hunodiwa kuti uve akabudirira data engineer. Ehe, kune humwe hunyanzvi hunopindirana nemabasa ese. Asi kune zvakare huwandu hwehunyanzvi hwakapokana nediametrically.

Sainzi yedata ibhizinesi rakakomba, asi isu tiri kuenda kune nyika yesainzi inoshanda data uko varapi vanokwanisa kuzviitira ivo pachavo analytics. Kuti ugone kugonesa mapaipi edata uye zvakabatanidzwa data zvimiro, unoda mainjiniya edata, kwete masayendisiti edata.

Injiniya yedata inonyanya kudiwa kupfuura sainzi wedata?

- Ehe, nekuti usati wagadzira keke keke, iwe unofanirwa kutanga watora, peel uye stock karoti!

Injiniya yedata inonzwisisa kuronga zvirinani kupfuura chero sainzi wedata, asi kana zvasvika kune manhamba, zvakapesana ndezvechokwadi.

Asi heino mukana weinjiniya yedata:

Pasina iye, kukosha kweiyo prototype modhi, kazhinji inoumbwa nechidimbu chemhando inotyisa yekodhi muPython faira, yakawanikwa kubva kune data sainzi uye neimwe nzira inogadzira mhedzisiro, inoenda kune zero.

Pasina injinjini yedata, iyi kodhi haizombove purojekiti uye hapana dambudziko rebhizinesi richagadziriswa zvinobudirira. Injiniya yedata iri kuyedza kushandura zvese izvi kuita chigadzirwa.

Ruzivo rwekutanga iyo injiniya yedata inofanirwa kuziva

Ndivanaani mainjiniya edata, uye unova sei mumwe?

Saka, kana basa iri rikaburitsa chiedza mauri uye uchifarira - unogona kuridzidza, unogona kugona hunyanzvi hwese hunodiwa uye uve nyeredzi chaiyo yedombo mumunda weinjiniya yedata. Uye, hongu, iwe unogona kudhonza izvi kunyangwe pasina hunyanzvi hwekugadzira kana imwe ruzivo rwehunyanzvi. Zvakaoma, asi zvinogoneka!

Ndeapi matanho ekutanga?

Iwe unofanirwa kuve neruzivo rwese kuti chii chii.

Chekutanga pane zvese, Injiniya yeData inoreva sainzi yekombuta. Kunyanya zvakanyanya, iwe unofanirwa kunzwisisa inoshanda algorithms uye data zvimiro. Chechipiri, sezvo mainjiniya edata achishanda nedata, zvinodikanwa kuti unzwisise misimboti yedatabase uye zvimiro zvinovatsigira.

Semuenzaniso, yakajairwa B-muti SQL dhatabhesi yakavakirwa paiyo B-Tree data chimiro, uye, mune yemazuva ano yakagovaniswa repositori, LSM-Muti uye kumwe kugadziridzwa kwematafura hashi.

*Matanho aya anobva pachinyorwa chikuru Adilya Khashtamova. Saka, kana iwe uchiziva chiRussian, tsigira munyori uyu uye uverenge post yake.

1. Algorithms uye data zvimiro

Kushandisa iyo chaiyo data chimiro kunogona kuvandudza zvakanyanya kuita kwealgorithm. Zvakanaka, isu tese tinofanira kunge tichidzidza nezve data zvimiro uye algorithms muzvikoro zvedu, asi izvi hazviwanzo kufukidzwa. Chero zvazvingava, hazvisati zvanyanya kunonoka kuzivana.
Saka heano andinofarira emahara makosi ekudzidza data zvimiro uye algorithms:

Uyezve usakanganwa nezvebasa raThomas Corman rekutanga pane algorithms - Nhanganyaya kune Algorithms. Iyi ndiyo referensi yakakwana kana iwe uchida kuzorodza ndangariro yako.

  • Kuti uvandudze unyanzvi hwako, shandisa Leetcode.

Iwe unogona zvakare kunyura munyika yedhatabhesi nemavhidhiyo anoshamisa kubva kuCarnegie Mellon University paYouTube:

2. Dzidza SQL

Hupenyu hwedu hwese idata. Uye kuti ubvise iyi data kubva kune database, unofanirwa "kutaura" mutauro wakafanana nawo.

SQL (Structured Query Mutauro) ndiwo mutauro wekutaurirana mudura re data. Pasinei nezvinotaurwa nemunhu, SQL yakararama, mupenyu, uye achararama kwenguva yakareba.

Kana wanga uri mukusimudzira kwenguva yakareba, iwe ungangodaro wakaona kuti runyerekupe rwekufa kuri pedyo kweSQL runobuda nguva nenguva. Mutauro wakagadzirwa mukutanga 70s uye uchiri kufarirwa zvakanyanya pakati pevanoongorora, vanogadzira uye vanongofarira.
Pasina ruzivo rweSQL hapana chekuita muinjiniya yedata sezvo uchizofanira kugadzira mibvunzo kuti utore data. Ese emazuva ano mahombe matura data anotsigira SQL:

  • Amazon RedShift
  • HP Vertica
  • pangataura
  • SQL Server

... nevamwe vakawanda.

Kuti uongorore yakakura yedata yakachengetwa mumasisitimu akagoverwa akadai seHDFS, SQL injini dzakagadzirwa: Apache Hive, Impala, etc. Ona, hapana kwainoenda.

Nzira yekudzidza sei SQL? Ingozviita uchidzidzira.

Kuti uite izvi, ini ndinokurudzira kutarisa yakanakisa tutori, iyo, nenzira, yemahara, kubva Mode Analytics.

  1. Yepakati SQL
  2. Kujoinha Data muSQL

Chii chinoita kuti makosi aya ave akakosha ndechekuti ane nharaunda inodyidzana kwaunogona kunyora nekumhanyisa mibvunzo yeSQL mubrowser yako. Resource SQL yemazuva ano hazvizove zvakanyanyisa. Uye iwe unogona kushandisa ruzivo urwu kune Leetcode mabasa muchikamu cheDatabases.

3. Kuronga muPython uye Java / Scala

Nei uchifanira kudzidza mutauro wePython programming, ini ndakatonyora muchinyorwa Python vs R. Kusarudza Iyo Yakanakisa Turusi yeAI, ML uye Data Sayenzi. Kana zvasvika kuJava neScala, mazhinji ezvishandiso zvekuchengeta uye kugadzirisa huwandu hukuru hwe data zvakanyorwa mumitauro iyi. Semuyenzaniso:

  • Apache Kafka (Scala)
  • Hadoop, HDFS (Java)
  • Apache Spark (Scala)
  • Apache Cassandra (Java)
  • HBase (Java)
  • Apache Hive (Java)

Kuti unzwisise kuti maturusi aya anoshanda sei, unofanirwa kuziva mitauro yaakanyorwa nayo. Maitiro eScala anoshanda anotendera iwe kuti unyatso kugadzirisa parallel data kugadzirisa matambudziko. Python, zvinosuruvarisa, haigone kuzvirumbidza nekumhanya uye kuenderana kugadzirisa. Kazhinji, ruzivo rwemitauro yakati wandei uye zvirongwa zveparadigm zvakanakira kuwanda kwemaitiro ekugadzirisa matambudziko.

Kuti unyure mumutauro weScala, unogona kuverenga Kuronga muScala kubva kumunyori wemutauro. Twitter yakaburitsawo gwara rakanaka rekutanga - Chikoro cheScala.

Kana iri Python, ndinotenda Fluent Python bhuku rakanakisa repakati.

4. Zvishandiso zvekushanda nedata hombe

Heino rondedzero yeanonyanya kufarirwa maturusi munyika yedata hombe:

  • Apache spark
  • Apache Kafka
  • Apache Hadoop (HDFS, HBase, Hive)
  • Apache cassandra

Iwe unogona kuwana rumwe ruzivo nezve kuvaka hombe data zvidhinha mune izvi zvinoshamisa interactive environment. Zvishandiso zvinonyanya kufarirwa ndeye Spark neKafka. Ivo vakakodzera kudzidza, zvinokurudzirwa kuti unzwisise mashandiro avanoita kubva mukati. Jay Kreps (co-munyori weKafka) akaburitsa basa rakakura muna 2013 Iyo Log: Izvo Yese Mugadziri weSoftware Anofanira Kuziva Nezve-Chaiyo-Nguva Data Aggregation AbstractionNenzira, pfungwa huru kubva kuTalmud iyi yakashandiswa kugadzira Apache Kafka.

5. Cloud platforms

Ndivanaani mainjiniya edata, uye unova sei mumwe?

Kuziva kweinokwana gore papuratifomu iri pane rondedzero yezvakakosha zvinodiwa kune vanonyorera chinzvimbo cheinjiniya yedata. Vashandirwi vanosarudza Amazon Web Services, neGoogle's Cloud platform munzvimbo yechipiri uye Microsoft Azure inotenderedza vatatu vepamusoro.

Iwe unofanirwa kuve neruzivo rwakanaka rweAmazon EC2, AWS Lambda, Amazon S3, DynamoDB.

6. Distributed systems

Kushanda nedata hombe kunoreva kuvepo kwemasumbu emakomputa anoshanda akazvimiririra, kutaurirana pakati peiyo inoitwa pamusoro petiweki. Kukura kwesumbu, kunowedzera mukana wekutadza kwenhengo dzaro. Kuti uve mukuru data sainzi, iwe unofanirwa kunzwisisa matambudziko uye iripo mhinduro dzeakagoverwa masisitimu. Iyi nzvimbo ndeyekare uye yakaoma.

Andrew Tanenbaum anoonekwa sepiyona mune iyi ndima. Kune avo vasingatyi dzidziso, ndinokurudzira bhuku rake "Distributed Systems", zvingaite sezvinotyisa kune vanotanga, asi zvinonyatso kukubatsira kukwenenzvera hunyanzvi hwako.

Ndinofunga Kugadzira Data-Yakanyanya Zvikumbiro naMartin Kleppmann bhuku rekutanga rakanakisa. Nenzira, Martin ane zvinoshamisa Π±Π»ΠΎΠ³. Basa rake richabatsira kuronga ruzivo nezve kuvaka dhizaini yemazuva ano yekuchengetedza uye kugadzirisa data hombe.
Kune avo vanoda kuona mavhidhiyo, pane kosi paYouTube Distributed computer systems.

7. Data pipelines

Ndivanaani mainjiniya edata, uye unova sei mumwe?

Mapaipi edata chinhu chausingagone kurarama pasina seinjiniya yedata.

Kazhinji yenguva, injinjini yedata inovaka iyo inonzi pombi yedata, ndiko kuti, inogadzira nzira yekuendesa data kubva kune imwe nzvimbo kuenda kune imwe. Aya anogona kunge ari manyorerwo etsika anoenda kune yekunze sevhisi API kana kuita SQL query, kuwedzera data, nekuiisa muchitoro chepakati (data warehouse) kana isina kurongeka data store (data dhamu).

Kupfupisa: iyo yekutanga yekutarisa yeinjiniya yedata

Ndivanaani mainjiniya edata, uye unova sei mumwe?

Kupfupisa, kunzwisisa kwakanaka kwezvinotevera kunodiwa:

  • Information Systems;
  • Kuvandudza software (Agile, DevOps, Design Techniques, SOA);
  • Distributed systems uye parallel programming;
  • Database Basics - Kuronga, Dhizaini, Kushanda uye Kugadzirisa Matambudziko;
  • Dhizaini yezviedzo - A / B bvunzo kuratidza pfungwa, kuona kuvimbika, kuita kwehurongwa, uye kugadzira nzira dzakavimbika dzekuunza mhinduro dzakanaka nekukurumidza.

Izvi zvingori zvishoma zvezvinodiwa kuti uve injinjini yedata, saka dzidza uye unzwisise masisitimu edatha, masisitimu eruzivo, kuenderera mberi kwekutumira / kutumira / kubatanidza, mitauro yekuronga, uye mamwe misoro yesainzi yemakomputa (kwete zvese zvidzidzo).

Uye pakupedzisira, chinhu chekupedzisira asi chakakosha kwazvo chandinoda kutaura.

Nzira yekuve Injiniya yeData haisi nyore sezvaingaita. Haakanganwiri, anogumbura, uye iwe unofanira kunge wakagadzirira izvi. Dzimwe nguva murwendo urwu dzinogona kukusundidzira kurega. Asi iri ibasa chairo uye maitiro ekudzidza.

Ingo usaite sugarcoat kubva pakutanga. Chinhu chose chekufamba ndechekudzidza zvakanyanya sezvinobvira uye kugadzirira matambudziko matsva.
Heino mufananidzo wakanaka wandakawana unoratidza pfungwa iyi zvakanaka:

Ndivanaani mainjiniya edata, uye unova sei mumwe?

Uye hongu, yeuka kudzivirira kupera simba uye kuzorora. Izvi zvakakoshawo zvikuru. Rombo rakanaka!

Munofungei nezvechinyorwa, shamwari? Tinokukoka iwe webinar yemahara, izvo zvichaitika nhasi na20.00. Munguva yewebhu, isu tichakurukura maitiro ekuvaka inoshanda uye scalable data processing system yekambani diki kana kutanga nemutengo wakaderera. Sekuita, isu tichajairana neGoogle Cloud data processing maturusi. Ndinokuwona!

Source: www.habr.com

Voeg