Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data
Ndikukuuzani kuchokera pa zomwe zinachitikira zanu zomwe zinali zothandiza pamene ndi liti. Ndichidule komanso malingaliro, kuti ziwonekere zomwe mungafune kukumba mopitilira apo - koma apa ndili ndi chidziwitso chaumwini, mwina zonse ndizosiyana kwa inu.

Chifukwa chiyani ndikofunikira kudziwa ndikutha kugwiritsa ntchito zilankhulo zamafunso? Pachimake, Data Science ili ndi magawo angapo ofunikira a ntchito, ndipo choyamba komanso chofunikira kwambiri (popanda izo, ndithudi palibe chomwe chidzagwire ntchito!) Ndikupeza kapena kuchotsa deta. Nthawi zambiri, deta imakhala kwinakwake mwa mawonekedwe ndipo iyenera "kubwezedwa" kuchokera pamenepo. 

Zilankhulo zamafunso zimakupatsani mwayi wochotsa izi! Ndipo lero ndikuwuzani za zilankhulo zomwe zakhala zothandiza kwa ine ndipo ndikuwuzani ndikuwonetsani komwe ndi momwe ndendende - chifukwa chake ndikofunikira kuphunzira.

Padzakhala midadada ikuluikulu itatu yamitundu yamafunso, yomwe tikambirana m'nkhaniyi:

  • Zilankhulo za "Standard" ndizomwe zimamveka bwino polankhula za chilankhulo, monga relational algebra kapena SQL.
  • Zilankhulo zamafunso: mwachitsanzo, Python things pandas, numpy kapena shell scripting.
  • Funsani zilankhulo za ma graph a chidziwitso ndi ma graph database.

Chilichonse cholembedwa apa ndizochitika zaumwini, zomwe zinali zothandiza, ndi kufotokozera za zochitika ndi "chifukwa chiyani zinali zofunika" - aliyense akhoza kuyesa momwe mikhalidwe yofanana ingakuthandizireni ndikuyesera kukonzekera pasadakhale pomvetsetsa zilankhulo izi. Musanapemphe (mwachangu) pulojekiti kapena kupita ku projekiti komwe ikufunika.

"Standard" zilankhulo zamafunso

Zilankhulo zodziwika bwino zimakhala ndendende momwe timaganizira nthawi zambiri tikamafunsa.

Algebra yachiyanjano

Chifukwa chiyani algebra yolumikizana ili yofunika masiku ano? Kuti mumvetse bwino chifukwa chake zilankhulo zamafunso zimapangidwira mwanjira inayake ndikuzigwiritsa ntchito mosamala, muyenera kumvetsetsa zomwe zikuyambitsa.

Kodi relational algebra ndi chiyani?

Tanthauzo lachidziwitso ndi motere: relational algebra ndi njira yotsekedwa ya machitidwe pa maubwenzi mu chitsanzo cha data yogwirizana. Kuti tifotokoze pang'ono zaumunthu, iyi ndi dongosolo la machitidwe pa matebulo kotero kuti zotsatira zake nthawi zonse zimakhala tebulo.

Onani machitidwe onse ogwirizana mu izi Nkhani yochokera ku Habr - apa tikufotokoza chifukwa chake muyenera kudziwa komanso komwe imathandizira.

Chifukwa chiyani?

Kuyamba kumvetsetsa zomwe zilankhulo zimakhudzidwira komanso ntchito zomwe zili kumbuyo kwa mawu azilankhulo zamafunso nthawi zambiri kumapereka kumvetsetsa kwakuya zomwe zimagwira ntchito m'zilankhulo zamafunso komanso momwe.

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data
Kutengedwa kuchokera izi zolemba. Chitsanzo cha opareshoni: kujowina, komwe kumalumikiza matebulo.

Zipangizo zophunzirira:

Maphunziro abwino oyambira ku Stanford. Mwambiri, pali zida zambiri pa algebra ndi chiphunzitso - Coursera, Udacity. Palinso zinthu zambiri pa intaneti, kuphatikiza zabwino maphunziro a maphunziro. Upangiri wanga wanga: muyenera kumvetsetsa algebra yolumikizana bwino - iyi ndiye maziko azoyambira.

SQL

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data
Kutengedwa kuchokera izi zolemba.

SQL kwenikweni ndikukhazikitsa algebra yolumikizana - yokhala ndi chenjezo lofunikira, SQL ndiyolengeza! Ndiko kuti, polemba funso m'chinenero cha algebra yogwirizana, mumanena momwe mungawerengere - koma ndi SQL mumatchula zomwe mukufuna kuchotsa, ndiyeno DBMS imapanga kale mawu (ogwira mtima) m'chinenero cha algebra yogwirizana (yawo). kufanana kumadziwika kwa ife monga Theorem ya Codd).

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data
Kutengedwa kuchokera izi zolemba.

Chifukwa chiyani?

Relational DBMSs: Oracle, Postgres, SQL Server, ndi zina zidakali paliponse ndipo pali mwayi waukulu kwambiri woti muzitha kuyanjana nawo, zomwe zikutanthauza kuti muyenera kuwerenga SQL (yomwe ili yotheka) kapena kulemba ( sichingachitikenso).

Zoyenera kuwerenga ndi kuphunzira

Malinga ndi maulalo omwewo pamwambapa (za algebra yolumikizana), pali kuchuluka kodabwitsa kwazinthu, mwachitsanzo, izi.

Mwa njira, NoSQL ndi chiyani?

"Ndikoyenera kutsindikanso kuti mawu oti "NoSQL" adangoyambira okha ndipo alibe tanthauzo lovomerezeka kapena bungwe lasayansi kumbuyo kwake." Zogwirizana nkhani pa Habr.

M'malo mwake, anthu adazindikira kuti chiwonetsero chokwanira chaubale sichifunikira kuthana ndi mavuto ambiri, makamaka kwa omwe, mwachitsanzo, magwiridwe antchito ndi ofunikira komanso mafunso ena osavuta okhala ndi kuphatikizika amalamulira - komwe ndikofunikira kuwerengera mwachangu ma metric ndikuwalembera Nawonso achichepere, ndi mbali zambiri ndi ubale kunakhala osati zosafunika, komanso zoipa - n'chifukwa normalize chinachake ngati adzawononga chinthu chofunika kwambiri kwa ife (pa ntchito inayake) - zokolola?

Komanso, ma schema osinthika nthawi zambiri amafunikira m'malo mwa masamu okhazikika amtundu wakale - ndipo izi zimathandizira kuti chitukuko chikhale chosavuta pakafunika kuyika makinawo ndikuyamba kugwira ntchito mwachangu, kukonza zotsatira - kapena schema ndi mitundu ya data yosungidwa. sizofunika kwambiri.

Mwachitsanzo, tikupanga dongosolo la akatswiri ndipo tikufuna kusunga zambiri pa domeni inayake pamodzi ndi chidziwitso cha meta - mwina sitingadziwe magawo onse ndikungosunga JSON pa mbiri iliyonse - izi zimatipatsa malo otha kusintha kwambiri kuti tikulitse deta. chitsanzo komanso kubwereza mwachangu - kotero munkhaniyi, NoSQL idzakhala yabwino komanso yowerengeka. Kulowa kwachitsanzo (kuchokera ku imodzi mwama projekiti anga pomwe NoSQL inali pomwe imafunikira).

{"en_wikipedia_url":"https://en.wikipedia.org/wiki/Johnny_Cash",
"ru_wikipedia_url":"https://ru.wikipedia.org/wiki/?curid=301643",
"ru_wiki_pagecount":149616,
"entity":[42775,"Джонни Кэш","ru"],
"en_wiki_pagecount":2338861}

Mutha kuwerenga zambiri apa za NoSQL.

Kuphunzira chiyani?

Apa, m'malo mwake, mukungofunika kusanthula bwino ntchito yanu, zomwe ili nazo komanso machitidwe a NoSQL omwe alipo omwe angagwirizane ndi kufotokozeraku - kenako ndikuyamba kuphunzira dongosololi.

Zinenero Zofunsira Malemba

Poyamba, zikuwoneka kuti Python ikukhudzana bwanji ndi izi - ndi chilankhulo chokonzekera, osati za mafunso konse.

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data

  • Pandas kwenikweni ndi mpeni waku Swiss Army wa Data Science; kusintha kwakukulu kwa data, kuphatikiza, ndi zina zambiri kumachitika mmenemo.
  • Numpy - kuwerengera vekitala, matrices ndi linear algebra pamenepo.
  • Scipy - pali masamu ambiri mu phukusili, makamaka ziwerengero.
  • Jupyter lab - zambiri zowunikira zowunikira zimagwirizana bwino ndi laputopu - zothandiza kudziwa.
  • Zopempha - kugwira ntchito ndi intaneti.
  • Pyspark ndiyodziwika kwambiri pakati pa akatswiri opanga ma data, nthawi zambiri mudzayenera kulumikizana ndi izi kapena Spark, chifukwa cha kutchuka kwawo.
  • *Selenium - yothandiza kwambiri pakusonkhanitsa deta kuchokera kumasamba ndi zida, nthawi zina palibe njira ina yopezera deta.

Langizo langa lalikulu: phunzirani Python!

Pandas

Tiyeni titenge khodi ili mwachitsanzo:

import pandas as pd
df = pd.read_csv(“data/dataset.csv”)
# Calculate and rename aggregations
all_together = (df[df[‘trip_type’] == “return”]
    .groupby(['start_station_name','end_station_name'])
                  	    .agg({'trip_duration_seconds': [np.size, np.mean, np.min, np.max]})
                           .rename(columns={'size': 'num_trips', 
           'mean': 'avg_duration_seconds',    
           'amin': min_duration_seconds', 
           ‘amax': 'max_duration_seconds'}))

M'malo mwake, tikuwona kuti codeyo ikugwirizana ndi SQL yachikale.

SELECT start_station_name, end_station_name, count(trip_duration_seconds) as size, …..
FROM dataset
WHERE trip_type = ‘return’
GROUPBY start_station_name, end_station_name

Koma chofunikira ndichakuti nambala iyi ndi gawo la script ndi mapaipi; m'malo mwake, tikuyika mafunso mupaipi ya Python. Zikatero, chilankhulo chafunso chimabwera kwa ife kuchokera ku malaibulale monga Pandas kapena pySpark.

Nthawi zambiri, mu pySpark tikuwona mtundu wofananira wakusintha kwa data kudzera muchilankhulo chafunso mu mzimu wa:

df.filter(df.trip_type = “return”)
  .groupby(“day”)
  .agg({duration: 'mean'})
  .sort()

Komwe ndi zomwe mungawerenge

Pa Python palokha osati vuto pezani zida zophunzirira. Pali maphunziro ambiri pa intaneti panda, pySpark ndi maphunziro pa Kuthamanga (komanso palokha DS). Ponseponse, zomwe zili pano ndizabwino kugwiritsa ntchito googling, ndipo ndikadasankha phukusi limodzi kuti ndiyang'anepo, ingakhale pandas, inde. Ponena za kuphatikiza kwa zida za DS + Python nazonso kwambiri.

Shell ngati chilankhulo chofunsira

Mapulojekiti angapo okonza ndi kusanthula deta omwe ndagwira nawo ntchito, kwenikweni, ndi zolemba zachipolopolo zomwe zimayimba ma code ku Python, Java, ndi zipolopolo zimadzilamulira zokha. Chifukwa chake, nthawi zambiri, mutha kuwona mapaipi mu bash/zsh/etc ngati mtundu wina wamafunso apamwamba (mutha,,,,,,,, koma izi sizofanana ndi DS code m'zilankhulo zachipolopolo), tiyeni tipereke chitsanzo chosavuta - ndinafunika kupanga mapu a QID a wikidata ndi maulalo athunthu ku ma Russian ndi English wikis, chifukwa cha izi ndinalemba pempho losavuta kuchokera ku malamulo omwe ali mu bash komanso chifukwa cha zotsatira zomwe ndinalemba mu Python, zomwe ndinalemba. kupanga pamodzi motere:

pv “data/latest-all.json.gz” | 
unpigz -c  | 
jq --stream $JQ_QUERY | 
python3 scripts/post_process.py "output.csv"

kumene

JQ_QUERY = 'select((.[0][1] == "sitelinks" and (.[0][2]=="enwiki" or .[0][2] =="ruwiki") and .[0][3] =="title") or .[0][1] == "id")' 

Izi zinali, kwenikweni, mapaipi onse omwe adapanga mapu ofunikira; monga tikuwonera, chilichonse chimagwira ntchito motsata:

  • pv filepath - imapereka njira yopita patsogolo kutengera kukula kwa fayilo ndikudutsa zomwe zili mkati mwake
  • unpigz -c adawerenga gawo lazosungidwa ndikuzipereka kwa jq
  • jq yokhala ndi kiyi - mtsinje udatulutsa zotsatira zake ndikuzipereka kwa positi processor (mofanana ndi chitsanzo choyambirira) mu Python
  • mkati, positiprosesa anali osavuta boma makina kuti formatted linanena bungwe 

Pazonse, payipi yovuta ikugwira ntchito mumayendedwe oyenda pa data yayikulu (0.5TB), yopanda zida zofunikira komanso yopangidwa kuchokera ku payipi yosavuta ndi zida zingapo.

Langizo lina lofunikira: kutha kugwira ntchito bwino komanso moyenera mu terminal ndikulemba bash/zsh/etc.

Zikakhala zothandiza kuti? Inde, pafupifupi kulikonse - kachiwiri, pali zinthu zambiri zophunzirira pa intaneti. Makamaka, apa izi nkhani yanga yapita.

R kulemba

Apanso, wowerenga akhoza kufuula - chabwino, ichi ndi chinenero chonse cha mapulogalamu! Ndipo ndithudi, iye adzakhala wolondola. Komabe, nthawi zambiri ndimakumana ndi R mwanjira yoti, kwenikweni, inali yofanana kwambiri ndi chilankhulo chofunsa.

R ndi malo owerengera makompyuta ndi chilankhulo cha static computing ndi zowonera (malinga ndi izi).

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data
kutengedwa kuchokera pano. Mwa njira, ndikupangira, zinthu zabwino.

Chifukwa chiyani wasayansi wa data akufunika kudziwa R? Osachepera, chifukwa pali gulu lalikulu la anthu omwe si a IT omwe amasanthula deta mu R. Ndinakumana nawo m'malo otsatirawa:

  • Gawo lazamankhwala.
  • Akatswiri a zamoyo.
  • Gawo lazachuma.
  • Anthu omwe ali ndi maphunziro a masamu okha omwe amachita ndi mawerengero.
  • Ziwerengero zapadera ndi mitundu yophunzirira yamakina (yomwe nthawi zambiri imapezeka mu mtundu wa wolemba ngati phukusi la R).

Chifukwa chiyani kwenikweni ndi chilankhulo chofunsa? Mu mawonekedwe omwe amapezeka nthawi zambiri, kwenikweni ndi pempho loti apange chitsanzo, kuphatikizapo kuwerenga deta ndi kukonza magawo a mafunso (chitsanzo), komanso kuwonetseratu deta m'maphukusi monga ggplot2 - iyinso ndi njira yolembera mafunso. .

Mafunso okhudza mawonekedwe

ggplot(data = beav, 
       aes(x = id, y = temp, 
           group = activ, color = activ)) +
  geom_line() + 
  geom_point() +
  scale_color_manual(values = c("red", "blue"))

Nthawi zambiri, malingaliro ambiri ochokera ku R asamukira ku mapaketi a python monga pandas, numpy kapena scipy, monga ma dataframes ndi data vectorization - motero zinthu zambiri mu R zitha kuwoneka zodziwika komanso zosavuta kwa inu.

Pali magwero ambiri oti muphunzire, mwachitsanzo, izi.

Ma grafu a chidziwitso

Apa ndili ndi chokumana nacho chachilendo pang'ono, chifukwa nthawi zambiri ndimayenera kugwira ntchito ndi ma graph a chidziwitso ndi zilankhulo zamafunso pazithunzi. Chifukwa chake, tiyeni tingoyang'ana mwachidule zoyambira, popeza gawo ili ndi lachilendo kwambiri.

M'mabuku akale okhudzana ndi ubale tili ndi schema yokhazikika, koma apa schema imasinthasintha, predicate iliyonse imakhala "gawo" ndi zina zambiri.

Tangoganizani kuti mukupanga chitsanzo cha munthu ndipo mukufuna kufotokoza zinthu zofunika, mwachitsanzo, tiyeni titenge munthu wina, Douglas Adams, ndikugwiritsa ntchito malongosoledwe awa ngati maziko.

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data
www.wikidata.org/wiki/Q42

Ngati titagwiritsa ntchito nkhokwe yachiyanjano, tikadayenera kupanga tebulo lalikulu kapena matebulo okhala ndi zipilala zambiri, zambiri zomwe zingakhale NULL kapena zodzazidwa ndi mtengo wina wabodza, mwachitsanzo, sizingatheke kuti ambiri aife tili ndi kulowa mu laibulale ya dziko la Korea - ndithudi, tikhoza kuwaika m'matebulo osiyana, koma izi zikanakhala kuyesa kuyesa dera losinthika lomveka ndi maulosi pogwiritsa ntchito chiyanjano chokhazikika.

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data
Ndiye taganizirani kuti deta yonse imasungidwa ngati graph kapena ngati mawu a binary komanso osasinthika.

Kodi mungapeze kuti izi? Choyamba, ntchito ndi data wiki, ndi nkhokwe za ma graph zilizonse kapena data yolumikizidwa.

Zotsatirazi ndi zilankhulo zazikuluzikulu zomwe ndidagwiritsapo ntchito ndikugwiritsa ntchito.

Mtengo wa magawo SPARQL

Wiki:
SPARQL (recursive acronym от Eng. SPARQL Protocol ndi RDF Query Language) - chilankhulo chafunso, choimiridwa ndi chitsanzo RDFndipo protocol kutumiza zopempha izi ndi kuyankha kwa iwo. SPARQL ndiupangiri W3C Consortium ndi imodzi mwa matekinoloje ukonde wa semantic.

Koma kwenikweni ndi chilankhulo chofunsa pazolinga zomveka komanso za binary. Mukungofotokoza zomwe zakhazikitsidwa mu mawu a Boolean ndi zomwe sizili (zosavuta kwambiri).

RDF (Resource Description Framework) payokha, pomwe mafunso a SPARQL amachitidwa, ndi katatu. object, predicate, subject - ndipo funso limasankha katatu kofunikira molingana ndi zoletsa zomwe zafotokozedwa mumzimu: pezani X kuti p_55(X, q_33) ndi yowona - pomwe, p_55 ndi mtundu wina wa ubale ndi ID 55, ndipo q_33 ndi chinthu chokhala ndi ID 33 (pano ndi nkhani yonse, ndikusiyanso zambiri).

Chitsanzo cha mafotokozedwe a data:

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data
Zithunzi ndi chitsanzo ndi mayiko pano kuchokera pano.

Basic Query Chitsanzo

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data

M'malo mwake, tikufuna kupeza mtengo wa ?dziko losinthika motere la chiganizo
membala_wa, ndizowona kuti membala_wa(?country,q458) ndi q458 ndi ID ya European Union.

Chitsanzo cha funso lenileni la SPARQL mkati mwa injini ya python:

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data

Nthawi zambiri, ndimayenera kuwerenga SPARQL m'malo molemba - zikatero, zitha kukhala luso lothandizira kumvetsetsa chilankhulocho pamlingo woyambira kuti mumvetsetse momwe deta imatengedwera. 

Pali zinthu zambiri zoti muphunzire pa intaneti: mwachitsanzo, apa izi и izi. Nthawi zambiri ndimakonda google mapangidwe ndi zitsanzo ndipo ndizokwanira pano.

Zilankhulo zamafunso zomveka

Mutha kuwerenga zambiri pamutuwu m'nkhani yanga apa. Ndipo apa, tingoyang'ana mwachidule chifukwa chake zilankhulo zomveka ndizoyenera kulemba mafunso. Kwenikweni, RDF ndi mawu omveka bwino amtundu p(X) ndi h(X,Y), ndipo funso lomveka lili ndi mawonekedwe awa:

output(X) :- country(X), member_of(X,“EU”).

Apa tikukamba za kupanga predicate linanena bungwe latsopano/1 (/1 zikutanthauza unary), malinga kuti X ndi zoona kuti dziko(X) - mwachitsanzo, X ndi dziko komanso membala_of(X,"EU ").

Ndiko kuti, mu nkhaniyi, zonse deta ndi malamulo amaperekedwa mofanana, zomwe zimatilola ife chitsanzo mavuto mosavuta ndi bwino.

Munakumana kuti pamakampani?: polojekiti yaikulu ndi kampani yomwe imalemba mafunso m'chinenero choterocho, komanso pa polojekiti yomwe ilipo pakatikati pa dongosolo - zikuwoneka kuti izi ndizovuta kwambiri, koma nthawi zina zimachitika.

Chitsanzo cha kachidutswa ka code mu wikidata yomveka bwino yokonza chilankhulo:

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data

Zipangizo: Ndipereka apa maulalo angapo a chilankhulo chamakono chokonzekera Yankho Set Programming - Ndikupangira kuti muwerenge:

Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data

Source: www.habr.com

Kuwonjezera ndemanga