Obani onjiniyela bedatha, futhi uba kanjani munye?

Sawubona futhi! Isihloko se-athikili siyazikhulumela. Sekulindelwe ukuqala kwezifundo Unjiniyela Wedatha Siphakamisa ukuthi uqonde ukuthi onjiniyela bedatha bangobani. Kunezixhumanisi eziningi eziwusizo esihlokweni. Ukufunda okujabulisayo.

Obani onjiniyela bedatha, futhi uba kanjani munye?

Umhlahlandlela olula wokuthi ungabamba kanjani igagasi lobunjiniyela bedatha futhi ungalivumeli likudonsele kwalasha.

Kubonakala sengathi wonke umuntu ufuna ukuba i-Data Scientist kulezi zinsuku. Kodwa kuthiwani ngobunjiniyela beDatha? Ngokuyinhloko, lolu uhlobo lwenhlanganisela yomhlaziyi wedatha kanye nososayensi wedatha; Unjiniyela wedatha ngokuvamile unesibopho sokuphatha ukugeleza komsebenzi, amapayipi okucubungula, nezinqubo ze-ETL. Ngenxa yokubaluleka kwale misebenzi, lena okwamanje enye i-jargon yobungcweti edumile ethola umfutho.

Amaholo aphezulu kanye nesidingo esikhulu kuyingxenye encane yalokho okwenza lo msebenzi ukhange ngokwedlulele! Uma ufuna ukujoyina ama-heroes, akukephuzi kakhulu ukuthi uqale ukufunda. Kulokhu okuthunyelwe, ngiqoqe lonke ulwazi oludingekayo ukukusiza ukuthi uthathe izinyathelo zakho zokuqala.

Ngakho-ke, ake siqale!

Yini Ubunjiniyela Bedatha?

Ngokweqiniso, ayikho incazelo engcono kunalena:

β€œUsosayensi angathola inkanyezi entsha, kodwa akakwazi ukuyidala. Kuzofanele acele unjiniyela ukuthi amenzele."

- UGordon Lindsay Glegg

Ngakho, indima kanjiniyela wedatha ibaluleke kakhulu.

Njengoba igama liphakamisa, ubunjiniyela bedatha buphathelene nedatha, okungukuthi ukulethwa kwayo, ukugcinwa nokucutshungulwa kwayo. Ngakho-ke, umsebenzi oyinhloko wonjiniyela ukuhlinzeka ngengqalasizinda ethembekile yedatha. Uma sibheka isigaba sezidingo ze-AI, ubunjiniyela bedatha buthatha izigaba zokuqala ezi-2-3: ukuqoqwa, ukunyakaza nokugcinwa, ukulungiswa kwedatha.

Obani onjiniyela bedatha, futhi uba kanjani munye?

Wenzani unjiniyela wedatha?

Ngokufika kwedatha enkulu, ububanzi bomthwalo wemfanelo bushintshe kakhulu. Uma ngaphambilini laba chwepheshe bebebhala imibuzo emikhulu ye-SQL kanye nedatha ehlutshiwe besebenzisa amathuluzi afana ne-Informatica ETL, Pentaho ETL, Talend, manje izidingo zonjiniyela bedatha zinyukile.

Izinkampani eziningi ezinezikhala ezivulekile zesikhundla sikanjiniyela wedatha zinezidingo ezilandelayo:

  • Ulwazi oluhle kakhulu lwe-SQL nePython.
  • Zizwa ngezinkundla zamafu, ikakhulukazi i-Amazon Web Services.
  • Ulwazi lwe-Java/Scala luyakhethwa.
  • Ukuqonda kahle kolwazi lwe-SQL kanye ne-NoSQL (imodeli yedatha, ukugcinwa kwedatha).

Khumbula, lezi yizinto ezibalulekile kuphela. Kulolu hlu, kungacatshangwa ukuthi onjiniyela bedatha bangochwepheshe emkhakheni wokuthuthukiswa kwesofthiwe kanye ne-backend.
Isibonelo, uma inkampani iqala ukukhiqiza inani elikhulu ledatha evela emithonjeni ehlukahlukene, umsebenzi wakho njengonjiniyela wedatha ukuhlela ukuqoqwa kolwazi, ukucutshungulwa kwalo nokugcinwa kwalo.

Uhlu lwamathuluzi asetshenziswe kuleli cala lungase luhluke, konke kuncike kumthamo wale datha, isivinini sokutholwa kwayo kanye ne-heterogeneity. Izinkampani eziningi azibhekani nedatha enkulu nhlobo, ngakho-ke njengenqolobane ephakathi nendawo, okuthiwa yi-warehouse yedatha, ungasebenzisa isizindalwazi se-SQL (i-PostgreSQL, i-MySQL, njll.) enesethi encane yemibhalo efaka idatha ku inqolobane.

Iziqhwaga ze-IT ezifana ne-Google, i-Amazon, i-Facebook noma i-Dropbox zinezidingo eziphakeme: ulwazi lwePython, i-Java noma i-Scala.

  • Isipiliyoni ngedatha enkulu: Hadoop, Spark, Kafka.
  • Ulwazi lwama-algorithms nezakhiwo zedatha.
  • Ukuqonda izisekelo zezinhlelo ezisabalalisiwe.
  • Ukuzizwisa ngamathuluzi okubonisa idatha njenge-Tableau noma i-ElasticSearch kuzoba ukuhlanganisa.

Okusho ukuthi, kukhona ukuguquguquka okucacile kudatha enkulu, okungukuthi ekucubunguleni kwayo ngaphansi kwemithwalo ephezulu. Lezi zinkampani ziye zanda izimfuneko zokubekezelela amaphutha esistimu.

Onjiniyela Bedatha Vs. ososayensi bedatha

Obani onjiniyela bedatha, futhi uba kanjani munye?
Kulungile, lokho bekuwukuqhathanisa okulula nokuhlekisayo (akukho lutho lomuntu siqu), kodwa empeleni kuyinkimbinkimbi kakhulu.

Okokuqala, kufanele wazi ukuthi kunokuningi okungaqondakali ekuchazeni izindima namakhono kasosayensi wedatha kanye nonjiniyela wedatha. Okusho ukuthi, ungadideka kalula ngokuthi yimaphi amakhono adingekayo ukuze ube unjiniyela wedatha ophumelelayo. Kunjalo, kukhona amakhono athile adlulana ngazo zombili izindima. Kodwa kukhona nenani lamakhono aphikisana ne-diametrically.

Isayensi yedatha iyibhizinisi elibucayi, kodwa sibheke emhlabeni wesayensi yedatha esebenzayo lapho ongoti bekwazi khona ukuzenzela izibalo zabo. Ukuze unike amandla amapayipi edatha nezakhiwo zedatha ezihlanganisiwe, udinga onjiniyela bedatha, hhayi ososayensi bedatha.

Ingabe unjiniyela wedatha udingeka kakhulu kunososayensi wedatha?

- Yebo, ngoba ngaphambi kokwenza ikhekhe le-carrot, udinga ukuvuna kuqala, ikhasi bese ubeka izaqathe!

Unjiniyela wedatha ukuqonda izinhlelo kangcono kunanoma yimuphi usosayensi wedatha, kodwa uma kuziwa ezibalweni, okuphambene kuyiqiniso.

Kodwa nansi inzuzo kanjiniyela wedatha:

Ngaphandle kwakhe, inani lemodeli ye-prototype, evame ukuhlanganisa ucezu lwekhodi yekhwalithi embi efayeleni lePython, elitholwe kusosayensi wedatha futhi ngandlela thize likhiqiza umphumela, livame ukuba ziro.

Ngaphandle kukanjiniyela wedatha, le khodi ayisoze yaba iphrojekthi futhi ayikho inkinga yebhizinisi ezoxazululwa ngempumelelo. Unjiniyela wedatha uzama ukuguqula konke lokhu kube umkhiqizo.

Ulwazi oluyisisekelo unjiniyela wedatha okufanele alwazi

Obani onjiniyela bedatha, futhi uba kanjani munye?

Ngakho-ke, uma lo msebenzi uveza ukukhanya kuwe futhi unomdlandla - ungawufunda, ungakwazi kahle wonke amakhono adingekayo futhi ube inkanyezi yangempela ye-rock emkhakheni wobunjiniyela bedatha. Futhi, yebo, ungakhipha lokhu ngaphandle kwamakhono wokuhlela noma olunye ulwazi lobuchwepheshe. Kunzima, kodwa kungenzeka!

Yiziphi izinyathelo zokuqala?

Kufanele ube nombono ojwayelekile wokuthi kuyini.

Okokuqala nje, i-Data Engineering ibhekisela kwisayensi yekhompyutha. Ngokucacile, kufanele uqonde ama-algorithms asebenzayo kanye nezakhiwo zedatha. Okwesibili, njengoba onjiniyela bedatha besebenza ngedatha, kuyadingeka ukuqonda izimiso zolwazi kanye nezakhiwo eziwasekelayo.

Isibonelo, isizindalwazi esivamile se-B-tree SQL sisekelwe esakhiweni sedatha ye-B-Tree, kanye, kumakhosombe asabalalisiwe esimanje, i-LSM-Tree nokunye ukuguqulwa kwamathebula e-hashi.

*Lezi zinyathelo zisekelwe esihlokweni esihle kakhulu Adilya Khashtamova. Ngakho-ke, uma usazi isiRashiya, sekela lo mbhali futhi ufunde iposi lakhe.

1. Ama-algorithms nezakhiwo zedatha

Ukusebenzisa ukwakheka kwedatha efanele kungathuthukisa kakhulu ukusebenza kwe-algorithm. Ngokufanelekile, sonke kufanele ngabe sifunda mayelana nezakhiwo zedatha nama-algorithms ezikoleni zethu, kodwa lokhu akuvamile ukuthi kuhlanganiswe. Kunoma yikuphi, akukephuzi kakhulu ukujwayelana.
Ngakho-ke nazi izifundo zami zamahhala eziyintandokazi zokufunda izakhiwo zedatha nama-algorithms:

Futhi ungakhohlwa ngomsebenzi wakudala kaThomas Corman kuma-algorithms - Isingeniso kuma-Algorithms. Lesi ireferensi ephelele uma udinga ukuvuselela inkumbulo yakho.

Ungakwazi futhi ukungena emhlabeni wolwazi onamavidiyo amangalisayo avela eCarnegie Mellon University ku-YouTube:

2. Funda i-SQL

Impilo yethu yonke idatha. Futhi ukuze ukhiphe le datha kusizindalwazi, udinga "ukhuluma" ulimi olufanayo nayo.

I-SQL (Ulimi Oluhlelekile Lombuzo) iwulimi lokuxhumana esizindeni sedatha. Kungakhathaliseki ukuthi ubani uthini, i-SQL iphile, iyaphila, futhi izophila isikhathi eside kakhulu.

Uma usunesikhathi eside uthuthukiswa, kungenzeka ukuthi uqaphele ukuthi amahlebezi mayelana nokufa okuseduze kwe-SQL avela ngezikhathi ezithile. Ulimi lwathuthukiswa ekuqaleni kwawo-70s futhi lusathandwa kakhulu phakathi kwabahlaziyi, abathuthukisi kanye nabathandi nje.
Ngaphandle kolwazi lwe-SQL akukho okungenziwa kubunjiniyela bedatha njengoba nakanjani kuzodingeka udale imibuzo ukuze ubuyise idatha. Zonke izindawo zokugcina idatha ezinkulu zesimanje zisekela i-SQL:

  • I-Amazon Redshift
  • I-HP Vertica
  • Oracle
  • I-SQL Server

... nabanye abaningi.

Ukuze kuhlaziywe isendlalelo esikhulu sedatha egcinwe kumasistimu asabalalisiwe njenge-HDFS, izinjini ze-SQL zasungulwa: i-Apache Hive, i-Impala, njll. Bheka, ayiyi ndawo.

Ungayifunda kanjani i-SQL? Vele ukwenze ngokusebenza.

Ukuze wenze lokhu, ngingancoma ukuthi uhlole okokufundisa okuhle kakhulu, okuthi, ngendlela, kumahhala, kusuka I-Mode Analytics.

  1. I-SQL ephakathi
  2. Ukujoyina Idatha ku-SQL

Okwenza lezi zifundo zikhetheke ukuthi zinendawo yokusebenzisana lapho ungabhala futhi usebenzise imibuzo ye-SQL khona kanye esipheqululini sakho. Insiza I-SQL yesimanje ngeke kube ngokweqile. Futhi ungasebenzisa lolu lwazi ku Imisebenzi ye-Leetcode esigabeni Solwazi.

3. Ukuhlelwa ku-Python ne-Java/Scala

Kungani kufanele ufunde ulimi lohlelo lwePython, sengibhale esihlokweni I-Python vs R. Ukukhetha Ithuluzi Elingcono Kakhulu le-AI, ML kanye Nesayensi Yedatha. Uma kukhulunywa nge-Java ne-Scala, iningi lamathuluzi okugcina nokucubungula amanani amakhulu edatha abhalwe ngalezi zilimi. Ngokwesibonelo:

  • I-Apache Kafka (Scala)
  • I-Hadoop, HDFS (Java)
  • I-Apache Spark (Scala)
  • I-Apache Cassandra (Java)
  • I-HBase (Java)
  • I-Apache Hive (Java)

Ukuze uqonde ukuthi la mathuluzi asebenza kanjani, udinga ukwazi izilimi abhalwe ngazo. Indlela yokusebenza ye-Scala ikuvumela ukuthi uxazulule ngempumelelo izinkinga zokucubungula idatha efanayo. I-Python, ngeshwa, ayikwazi ukuziqhayisa ngesivinini nokucubungula okufanayo. Ngokuvamile, ulwazi lwezilimi ezimbalwa kanye nama-paradigms ohlelo luhle ngobubanzi bezindlela zokuxazulula izinkinga.

Ukuze ungene ngolimi lwe-Scala, ungafunda Uhlelo ku-Scala kusukela kumbhali wolimi. I-Twitter iphinde ishicilele umhlahlandlela omuhle wesethulo - Isikole saseScala.

Ngokuqondene nePython, ngiyakholwa I-Fluent Python incwadi engcono kakhulu yezinga eliphakathi.

4. Amathuluzi okusebenza ngedatha enkulu

Nalu uhlu lwamathuluzi adume kakhulu emhlabeni wedatha enkulu:

  • I-Apache Spark
  • Apache Kafka
  • I-Apache Hadoop (HDFS, HBase, Hive)
  • Apache Cassandra

Ungathola ulwazi olwengeziwe mayelana nokwakha amabhulokhi edatha amakhulu kulokhu okumangalisayo imvelo interactive. Amathuluzi aziwa kakhulu yi-Spark ne-Kafka. Bakufanele ngempela ukutadisha, kuhle ukuqonda ukuthi basebenza kanjani ngaphakathi. U-Jay Kreps (umbambisene we-Kafka) ushicilele umsebenzi oyisikhumbuzo ngo-2013 Ilogi: Lokho Wonke Umthuthukisi Wesofthiwe Okufanele Akwazi Mayelana Nokukhishwa Kwedatha Yesikhathi SangempelaNgendlela, imibono eyinhloko evela kule Talmud yasetshenziselwa ukudala i-Apache Kafka.

5. Izinkundla zamafu

Obani onjiniyela bedatha, futhi uba kanjani munye?

Ulwazi okungenani lwenkundla yefu eyodwa lisohlwini lwezidingo eziyisisekelo zabafakizicelo besikhundla sikanjiniyela wedatha. Abaqashi bakhetha i-Amazon Web Services, inkundla ye-Google yamafu isendaweni yesibili kanye ne-Microsoft Azure ehlanganisa ezintathu eziphezulu.

Kufanele ube nolwazi oluhle lwe-Amazon EC2, AWS Lambda, Amazon S3, DynamoDB.

6. Amasistimu asabalalisiwe

Ukusebenza ngedatha enkulu kusho ukuba khona kwamaqoqo amakhompyutha asebenza ngokuzimela, ukuxhumana phakathi kwawo okwenziwa ngenethiwekhi. Uma iqoqo likhulu, makhulu amathuba okuthi ukwehluleka kwamanodi amalungu alo. Ukuze ube usosayensi wedatha omkhulu, udinga ukuqonda izinkinga nezixazululo ezikhona zezinhlelo ezisabalalisiwe. Le ndawo indala futhi iyinkimbinkimbi.

U-Andrew Tanenbaum uthathwa njengephayona kulo mkhakha. Kulabo abangayesabi ithiyori, ngincoma incwadi yakhe "Amasistimu Asabalalisiwe", kungase kubonakale kunzima kwabaqalayo, kodwa kuzokusiza ngempela ucije amakhono akho.

Ngiyacabanga Ukudizayina Izicelo Ezidinga Idatha Kakhulu nguMartin Kleppmann incwadi engcono kakhulu yesethulo. Ngendlela, uMartin unento emangalisayo Π±Π»ΠΎΠ³. Umsebenzi wakhe uzosiza ukuhlela ulwazi ngokwakha ingqalasizinda yesimanje yokugcina nokucubungula idatha enkulu.
Kulabo abathanda ukubuka amavidiyo, kukhona isifundo ku-Youtube Amasistimu ekhompyutha asabalalisiwe.

7. Amapayipi edatha

Obani onjiniyela bedatha, futhi uba kanjani munye?

Amapayipi edatha yinto ongeke ukwazi ukuphila ngaphandle kwayo njengonjiniyela wedatha.

Isikhathi esiningi, unjiniyela wedatha wakha lokho okubizwa ngokuthi ipayipi ledatha, okungukuthi, udala inqubo yokuletha idatha kusuka endaweni eyodwa kuya kwenye. Lokhu kungase kube imibhalo yangokwezifiso eya ku-API yesevisi yangaphandle noma yenze umbuzo we-SQL, ikhulise idatha, bese iyibeka esitolo esiphakathi nendawo (inqolobane yedatha) noma isitolo sedatha esingahlelekile (amachibi edatha).

Ukufingqa: uhlu lokuhlola oluyisisekelo lukanjiniyela wedatha

Obani onjiniyela bedatha, futhi uba kanjani munye?

Ukufingqa, ukuqonda okuhle kwalokhu okulandelayo kuyadingeka:

  • Izinhlelo Zolwazi;
  • Ukuthuthukiswa kwesofthiwe (i-Agile, i-DevOps, i-Design Techniques, i-SOA);
  • Amasistimu asabalalisiwe kanye nezinhlelo ezifanayo;
  • Okuyisisekelo Kwesizindalwazi - Ukuhlela, Ukuklama, Ukusebenza kanye Nokuxazulula Izinkinga;
  • Idizayini yokuhlolwa - Ukuhlolwa kwe-A/B ukufakazela imiqondo, ukunquma ukwethembeka, ukusebenza kwesistimu, nokuthuthukisa izindlela ezithembekile zokuletha izixazululo ezinhle ngokushesha.

Lezi izimfuneko ezimbalwa nje zokuba unjiniyela wedatha, ngakho-ke funda futhi uqonde izinhlelo zedatha, amasistimu olwazi, ukulethwa okuqhubekayo/ukuthunyelwa/ukuhlanganiswa, izilimi zokuhlela, nezinye izihloko zesayensi yekhompyutha (hhayi zonke izifundo).

Futhi ekugcineni, into yokugcina kodwa ebaluleke kakhulu engifuna ukuyisho.

Indlela yokuba Ubunjiniyela Bedatha ayilula ngendlela engase ibonakale ngayo. Akaxoleli, uyakhungathekisa, futhi kufanele ukulungiselele lokhu. Izikhathi ezithile kulolu hambo zingase zikuphushe ukuthi uyeke. Kodwa lona umsebenzi wangempela kanye nenqubo yokufunda.

Ungalifaki ushukela kusukela ekuqaleni. Iphuzu eliphelele lokuhamba liwukufunda okuningi ngangokunokwenzeka futhi ulungele izinselele ezintsha.
Nasi isithombe esihle engihlangane naso esilicacisa kahle leli phuzu:

Obani onjiniyela bedatha, futhi uba kanjani munye?

Futhi yebo, khumbula ukugwema ukutubeka nokuphumula. Lokhu nakho kubaluleke kakhulu. Ngikufisela inhlanhla!

Nicabangani ngesihloko, bangane? Sikumema ukuthi i-webinar yamahhala, ezokwenzeka namuhla ngo-20.00. Phakathi ne-webinar, sizoxoxa ngokuthi singalwakha kanjani uhlelo lokucutshungulwa kwedatha olusebenzayo nolwesabekayo lwenkampani encane noma ukuqalisa ngezindleko ezincane. Njengomkhuba, sizojwayelana namathuluzi okucubungula idatha ye-Google Cloud. Ngizokubona!

Source: www.habr.com

Engeza amazwana