Ndemanga za Data Scientist: Ndemanga Yaumwini ya Zinenero za Mafunso a Data
Ndikukuuzani kuchokera pa zomwe zinachitikira zanu zomwe zinali zothandiza pamene ndi liti. Ndichidule komanso malingaliro, kuti ziwonekere zomwe mungafune kukumba mopitilira apo - koma apa ndili ndi chidziwitso chaumwini, mwina zonse ndizosiyana kwa inu.
Chifukwa chiyani ndikofunikira kudziwa ndikutha kugwiritsa ntchito zilankhulo zamafunso? Pachimake, Data Science ili ndi magawo angapo ofunikira a ntchito, ndipo choyamba komanso chofunikira kwambiri (popanda izo, ndithudi palibe chomwe chidzagwire ntchito!) Ndikupeza kapena kuchotsa deta. Nthawi zambiri, deta imakhala kwinakwake mwa mawonekedwe ndipo iyenera "kubwezedwa" kuchokera pamenepo.
Zilankhulo zamafunso zimakupatsani mwayi wochotsa izi! Ndipo lero ndikuwuzani za zilankhulo zomwe zakhala zothandiza kwa ine ndipo ndikuwuzani ndikuwonetsani komwe ndi momwe ndendende - chifukwa chake ndikofunikira kuphunzira.
Tanthauzo lachidziwitso ndi motere: relational algebra ndi njira yotsekedwa ya machitidwe pa maubwenzi mu chitsanzo cha data yogwirizana. Kuti tifotokoze pang'ono zaumunthu, iyi ndi dongosolo la machitidwe pa matebulo kotero kuti zotsatira zake nthawi zonse zimakhala tebulo.
Onani machitidwe onse ogwirizana mu izi Nkhani yochokera ku Habr - apa tikufotokoza chifukwa chake muyenera kudziwa komanso komwe imathandizira.
M'malo mwake, tikuwona kuti codeyo ikugwirizana ndi SQL yachikale.
SELECT start_station_name, end_station_name, count(trip_duration_seconds) as size, …..
FROM dataset
WHERE trip_type = ‘return’
GROUPBY start_station_name, end_station_name
Koma chofunikira ndichakuti nambala iyi ndi gawo la script ndi mapaipi; m'malo mwake, tikuyika mafunso mupaipi ya Python. Zikatero, chilankhulo chafunso chimabwera kwa ife kuchokera ku malaibulale monga Pandas kapena pySpark.
Nthawi zambiri, mu pySpark tikuwona mtundu wofananira wakusintha kwa data kudzera muchilankhulo chafunso mu mzimu wa:
Pa Python palokha osati vuto pezani zida zophunzirira. Pali maphunziro ambiri pa intaneti panda, pySpark ndi maphunziro pa Kuthamanga (komanso palokha DS). Ponseponse, zomwe zili pano ndizabwino kugwiritsa ntchito googling, ndipo ndikadasankha phukusi limodzi kuti ndiyang'anepo, ingakhale pandas, inde. Ponena za kuphatikiza kwa zida za DS + Python nazonso kwambiri.
Shell ngati chilankhulo chofunsira
Mapulojekiti angapo okonza ndi kusanthula deta omwe ndagwira nawo ntchito, kwenikweni, ndi zolemba zachipolopolo zomwe zimayimba ma code ku Python, Java, ndi zipolopolo zimadzilamulira zokha. Chifukwa chake, nthawi zambiri, mutha kuwona mapaipi mu bash/zsh/etc ngati mtundu wina wamafunso apamwamba (mutha,,,,,,,, koma izi sizofanana ndi DS code m'zilankhulo zachipolopolo), tiyeni tipereke chitsanzo chosavuta - ndinafunika kupanga mapu a QID a wikidata ndi maulalo athunthu ku ma Russian ndi English wikis, chifukwa cha izi ndinalemba pempho losavuta kuchokera ku malamulo omwe ali mu bash komanso chifukwa cha zotsatira zomwe ndinalemba mu Python, zomwe ndinalemba. kupanga pamodzi motere:
unpigz -c adawerenga gawo lazosungidwa ndikuzipereka kwa jq
jq yokhala ndi kiyi - mtsinje udatulutsa zotsatira zake ndikuzipereka kwa positi processor (mofanana ndi chitsanzo choyambirira) mu Python
mkati, positiprosesa anali osavuta boma makina kuti formatted linanena bungwe
Pazonse, payipi yovuta ikugwira ntchito mumayendedwe oyenda pa data yayikulu (0.5TB), yopanda zida zofunikira komanso yopangidwa kuchokera ku payipi yosavuta ndi zida zingapo.
Langizo lina lofunikira: kutha kugwira ntchito bwino komanso moyenera mu terminal ndikulemba bash/zsh/etc.
Zikakhala zothandiza kuti? Inde, pafupifupi kulikonse - kachiwiri, pali zinthu zambiri zophunzirira pa intaneti. Makamaka, apa izi nkhani yanga yapita.
R kulemba
Apanso, wowerenga akhoza kufuula - chabwino, ichi ndi chinenero chonse cha mapulogalamu! Ndipo ndithudi, iye adzakhala wolondola. Komabe, nthawi zambiri ndimakumana ndi R mwanjira yoti, kwenikweni, inali yofanana kwambiri ndi chilankhulo chofunsa.
R ndi malo owerengera makompyuta ndi chilankhulo cha static computing ndi zowonera (malinga ndi izi).
kutengedwa kuchokera pano. Mwa njira, ndikupangira, zinthu zabwino.
Chifukwa chiyani wasayansi wa data akufunika kudziwa R? Osachepera, chifukwa pali gulu lalikulu la anthu omwe si a IT omwe amasanthula deta mu R. Ndinakumana nawo m'malo otsatirawa:
Gawo lazamankhwala.
Akatswiri a zamoyo.
Gawo lazachuma.
Anthu omwe ali ndi maphunziro a masamu okha omwe amachita ndi mawerengero.
Ziwerengero zapadera ndi mitundu yophunzirira yamakina (yomwe nthawi zambiri imapezeka mu mtundu wa wolemba ngati phukusi la R).
Chifukwa chiyani kwenikweni ndi chilankhulo chofunsa? Mu mawonekedwe omwe amapezeka nthawi zambiri, kwenikweni ndi pempho loti apange chitsanzo, kuphatikizapo kuwerenga deta ndi kukonza magawo a mafunso (chitsanzo), komanso kuwonetseratu deta m'maphukusi monga ggplot2 - iyinso ndi njira yolembera mafunso. .
Mafunso okhudza mawonekedwe
ggplot(data = beav,
aes(x = id, y = temp,
group = activ, color = activ)) +
geom_line() +
geom_point() +
scale_color_manual(values = c("red", "blue"))
Nthawi zambiri, malingaliro ambiri ochokera ku R asamukira ku mapaketi a python monga pandas, numpy kapena scipy, monga ma dataframes ndi data vectorization - motero zinthu zambiri mu R zitha kuwoneka zodziwika komanso zosavuta kwa inu.
Apa ndili ndi chokumana nacho chachilendo pang'ono, chifukwa nthawi zambiri ndimayenera kugwira ntchito ndi ma graph a chidziwitso ndi zilankhulo zamafunso pazithunzi. Chifukwa chake, tiyeni tingoyang'ana mwachidule zoyambira, popeza gawo ili ndi lachilendo kwambiri.
M'mabuku akale okhudzana ndi ubale tili ndi schema yokhazikika, koma apa schema imasinthasintha, predicate iliyonse imakhala "gawo" ndi zina zambiri.
Koma kwenikweni ndi chilankhulo chofunsa pazolinga zomveka komanso za binary. Mukungofotokoza zomwe zakhazikitsidwa mu mawu a Boolean ndi zomwe sizili (zosavuta kwambiri).
RDF (Resource Description Framework) payokha, pomwe mafunso a SPARQL amachitidwa, ndi katatu. object, predicate, subject - ndipo funso limasankha katatu kofunikira molingana ndi zoletsa zomwe zafotokozedwa mumzimu: pezani X kuti p_55(X, q_33) ndi yowona - pomwe, p_55 ndi mtundu wina wa ubale ndi ID 55, ndipo q_33 ndi chinthu chokhala ndi ID 33 (pano ndi nkhani yonse, ndikusiyanso zambiri).
M'malo mwake, tikufuna kupeza mtengo wa ?dziko losinthika motere la chiganizo
membala_wa, ndizowona kuti membala_wa(?country,q458) ndi q458 ndi ID ya European Union.
Chitsanzo cha funso lenileni la SPARQL mkati mwa injini ya python:
Pali zinthu zambiri zoti muphunzire pa intaneti: mwachitsanzo, apa izi и izi. Nthawi zambiri ndimakonda google mapangidwe ndi zitsanzo ndipo ndizokwanira pano.
Zilankhulo zamafunso zomveka
Mutha kuwerenga zambiri pamutuwu m'nkhani yanga apa. Ndipo apa, tingoyang'ana mwachidule chifukwa chake zilankhulo zomveka ndizoyenera kulemba mafunso. Kwenikweni, RDF ndi mawu omveka bwino amtundu p(X) ndi h(X,Y), ndipo funso lomveka lili ndi mawonekedwe awa:
output(X) :- country(X), member_of(X,“EU”).
Apa tikukamba za kupanga predicate linanena bungwe latsopano/1 (/1 zikutanthauza unary), malinga kuti X ndi zoona kuti dziko(X) - mwachitsanzo, X ndi dziko komanso membala_of(X,"EU ").
Ndiko kuti, mu nkhaniyi, zonse deta ndi malamulo amaperekedwa mofanana, zomwe zimatilola ife chitsanzo mavuto mosavuta ndi bwino.
Munakumana kuti pamakampani?: polojekiti yaikulu ndi kampani yomwe imalemba mafunso m'chinenero choterocho, komanso pa polojekiti yomwe ilipo pakatikati pa dongosolo - zikuwoneka kuti izi ndizovuta kwambiri, koma nthawi zina zimachitika.
Chitsanzo cha kachidutswa ka code mu wikidata yomveka bwino yokonza chilankhulo:
Zipangizo: Ndipereka apa maulalo angapo a chilankhulo chamakono chokonzekera Yankho Set Programming - Ndikupangira kuti muwerenge: