Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha
Ngikutshela ngesipiliyoni sakho siqu ukuthi yini ebelusizo lapho futhi nini. Uhlolojikelele kanye nethisisi, ukuze kucace ukuthi yini futhi lapho ungamba khona ngokuqhubekayo - kodwa lapha nginolwazi lomuntu siqu oluzimele, mhlawumbe yonke into ihluke ngokuphelele kuwe.

Kungani kubalulekile ukwazi kanye nokukwazi ukusebenzisa izilimi zemibuzo? Emgogodleni wayo, Isayensi Yedatha inezigaba ezimbalwa ezibalulekile zomsebenzi, futhi eyokuqala nokubaluleke kakhulu (ngaphandle kwayo, ngokuqinisekile akukho okuzosebenza!) ukuthola noma ukukhipha idatha. Ngokuvamile, idatha ihlezi ndawana thize ngandlela thize futhi idinga "ukubuyiswa" lapho. 

Izilimi zemibuzo zikuvumela ukuthi ukhiphe yona le datha! Futhi namuhla ngizokutshela ngalezo zilimi zemibuzo eziye zaba usizo kimi futhi ngizokutshela futhi ngikubonise ukuthi kuphi futhi kanjani kahle - kungani kudingekile ukutadisha.

Kuzoba namabhulokhi amathathu amakhulu ezinhlobo zemibuzo yedatha, esizoxoxa ngakho kulesi sihloko:

  • Izilimi zombuzo "ezijwayelekile" yizo ezivame ukuqondwa uma kukhulunywa ngolimi lombuzo, njenge-algebra ehlobene noma i-SQL.
  • Izilimi zombuzo wokubhala: isibonelo, iPython things pandas, numpy noma shell scripting.
  • Izilimi zokubuza zamagrafu olwazi nezizindalwazi zamagrafu.

Konke okubhalwe lapha kuwukuhlangenwe nakho komuntu siqu, okwakuwusizo, ngencazelo yezimo nokuthi "kungani kwakudingeka" - wonke umuntu angazama ukuthi izimo ezifanayo zingafika kanjani kuwe futhi uzame ukuzilungiselela kusengaphambili ngokuqonda lezi zilimi Ngaphambi kokuthi ufake isicelo (ngokuphuthumayo) kuphrojekthi noma ngisho nokufika kuphrojekthi lapho bedingeka khona.

Izilimi zombuzo "ezijwayelekile".

Izilimi ezijwayelekile zemibuzo zingomqondo wokuthi sivame ukucabanga ngazo lapho sikhuluma ngemibuzo.

I-algebra yobudlelwano

Kungani i-algebra ehlobene idingeka namuhla? Ukuze uqonde kahle ukuthi kungani izilimi zemibuzo zakhiwe ngendlela ethile futhi uzisebenzise ngokucophelela, udinga ukuqonda umnyombo wazo.

Yini i-algebra ehlobene?

Incazelo esemthethweni imi kanje: i-algebra ehlobene iwuhlelo oluvaliwe lokusebenza kubudlelwano kumodeli yedatha ehlobene. Ukukubeka kancane ngokobuntu, lolu uhlelo lokusebenza ematafuleni kangangokuthi umphumela uhlale uyitafula.

Bona yonke imisebenzi yobudlelwano ku lokhu i-athikili evela ku-Habr - lapha sichaza ukuthi kungani udinga ukwazi nokuthi isiza kuphi.

Kungani?

Ukuqala ukuqonda ukuthi iziphi izilimi zemibuzo emayelana nokuthi yiziphi izenzo ezingemuva kwezinkulumo zezilimi ezithile zemibuzo ngokuvamile kunikeza ukuqonda okujulile kokuthi yini esebenza ngezilimi zemibuzo nokuthi kanjani.

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha
Ithathwe ku lokhu izihloko. Isibonelo somsebenzi: hlanganisa, ohlanganisa amathebula.

Izinto zokufunda:

Izifundo ezinhle zesingeniso ezivela eStanford. Ngokuvamile, kunezinto eziningi ezisetshenziswayo ku-algebra yokuhlobana kanye nethiyori - i-Coursera, i-Udacity. Kukhona nenani elikhulu lezinto eziku-inthanethi, kufaka phakathi okuhle izifundo zemfundo. Iseluleko sami siqu: udinga ukuqonda i-algebra ehlobene kahle kakhulu - lesi yisisekelo sezisekelo.

SQL

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha
Ithathwe ku lokhu izindatshana.

I-SQL iwukuqaliswa kwe-algebra ehlobene - nge-caveat ebalulekile, i-SQL iyamemezela! Okusho ukuthi, lapho ubhala umbuzo ngolimi lwe-algebra ehlobene, empeleni usho indlela yokubala - kodwa nge-SQL ucacisa ukuthi yini ofuna ukuyikhipha, bese i-DBMS isivele ikhiqiza izinkulumo (ezisebenzayo) ngolimi lwe-algebra ehlobene (yabo). ukulingana kwaziwa kithi ngokuthi Ithiyori kaCodd).

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha
Ithathwe ku lokhu izindatshana.

Kungani?

Ama-DBMS Ahlobene: I-Oracle, i-Postgres, i-SQL Server, njll isekhona cishe yonke indawo futhi kunethuba eliphezulu kakhulu lokuthi uzoxhumana nabo, okusho ukuthi kuzodingeka ufunde i-SQL (okungenzeka kakhulu) noma uyibhale ( akunakwenzeka futhi).

Okufanele ukufunde nokufunda

Ngokusho kwezixhumanisi ezifanayo ngenhla (mayelana ne-algebra ehlobene), kunenani elimangalisayo lezinto ezibonakalayo, isibonelo, lokhu.

Nokho, yini i-NoSQL?

"Kufanele kugcizelelwe futhi ukuthi igama elithi "NoSQL" linemvelaphi ezenzakalelayo futhi alinayo incazelo eyamukelwa ngokuvamile noma isikhungo sesayensi ngemuva kwalo." Okuhambisanayo indatshana ngoHabr.

Eqinisweni, abantu baqaphela ukuthi imodeli egcwele yobudlelwano ayidingeki ukuze kuxazululwe izinkinga eziningi, ikakhulukazi kulabo lapho, ngokwesibonelo, ukusebenza kubaluleke kakhulu futhi imibuzo ethile elula ngokuhlanganisa ibusa - lapho kubalulekile ukubala ngokushesha amamethrikhi bese uwabhalela I-database, futhi izici eziningi zihlobene nobudlelwane abunasidingo nje kuphela, kodwa futhi yingozi - kungani ukwenza into evamile uma izokonakalisa into ebaluleke kakhulu kithi (ngomsebenzi othize) - ukukhiqiza?

Futhi, izikimu eziguquguqukayo zivame ukudingeka esikhundleni sezikimu zezibalo ezingaguquki zemodeli yobudlelwano yakudala - futhi lokhu kwenza ukuthuthukiswa kohlelo lokusebenza kube lula lapho kubalulekile ukuthi kusetshenziswe uhlelo futhi kuqale ukusebenza ngokushesha, kucutshungulwe imiphumela - noma i-schema nezinhlobo zedatha egciniwe. azibalulekile kangako.

Isibonelo, sakha isistimu yochwepheshe futhi sifuna ukugcina ulwazi esizindeni esithile kanye nolunye ulwazi lwe-meta - singase singazazi zonke izinkambu futhi simane sigcine i-JSON yerekhodi ngalinye - lokhu kusinika indawo evumelana nezimo kakhulu yokwandisa idatha. imodeli nokuphindaphinda ngokushesha - ngakho-ke Kulokhu, i-NoSQL izothandeka futhi ifundeke kakhudlwana. Isibonelo sokungena (kusuka kwenye yamaphrojekthi ami lapho i-NoSQL yayikhona lapho idingeka khona).

{"en_wikipedia_url":"https://en.wikipedia.org/wiki/Johnny_Cash",
"ru_wikipedia_url":"https://ru.wikipedia.org/wiki/?curid=301643",
"ru_wiki_pagecount":149616,
"entity":[42775,"Джонни Кэш","ru"],
"en_wiki_pagecount":2338861}

Ungafunda okwengeziwe lapha mayelana ne-NoSQL.

Yini okufanele uyifunde?

Lapha, kunalokho, udinga nje ukuhlaziya kahle umsebenzi wakho, ukuthi unaziphi izakhiwo nokuthi yiziphi izinhlelo ze-NoSQL ezitholakalayo ezingalingana nale ncazelo - bese uqala ukufunda lolu hlelo.

Izilimi Zombuzo Wokubhala

Ekuqaleni, kubonakala sengathi iPython ihlangene ngani nayo ngokujwayelekile - iwulimi lokuhlela, hhayi ngemibuzo nhlobo.

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha

  • I-Pandas iwummese Wezempi WaseSwitzerland Wesayensi Yedatha; inani elikhulu lokuguqulwa kwedatha, ukuhlanganisa, njll. kwenzeka kuyo.
  • Numpy - izibalo ze-vector, ama-matrices kanye ne-algebra yomugqa lapho.
  • I-Scipy - kunezibalo eziningi kule phakheji, ikakhulukazi izibalo.
  • Ilebhu ye-Jupyter - ukuhlaziya okuningi kwedatha yokuhlola kufanelana kahle namalaptop - kuyasiza ukwazi.
  • Izicelo - ukusebenza nenethiwekhi.
  • I-Pyspark idume kakhulu phakathi konjiniyela bedatha, cishe kuzodingeka uhlanganyele nalokhu noma i-Spark, ngenxa nje yokuthandwa kwabo.
  • *I-Selenium - iwusizo kakhulu ekuqoqeni idatha kumasayithi nezinsiza, ngezinye izikhathi ayikho enye indlela yokuthola idatha.

Iseluleko sami esiyinhloko: funda iPython!

AmaPandas

Ake sithathe ikhodi elandelayo njengesibonelo:

import pandas as pd
df = pd.read_csv(“data/dataset.csv”)
# Calculate and rename aggregations
all_together = (df[df[‘trip_type’] == “return”]
    .groupby(['start_station_name','end_station_name'])
                  	    .agg({'trip_duration_seconds': [np.size, np.mean, np.min, np.max]})
                           .rename(columns={'size': 'num_trips', 
           'mean': 'avg_duration_seconds',    
           'amin': min_duration_seconds', 
           ‘amax': 'max_duration_seconds'}))

Empeleni, siyabona ukuthi ikhodi ilingana nephethini ye-SQL yakudala.

SELECT start_station_name, end_station_name, count(trip_duration_seconds) as size, …..
FROM dataset
WHERE trip_type = ‘return’
GROUPBY start_station_name, end_station_name

Kodwa ingxenye ebalulekile ukuthi le khodi iyingxenye yeskripthi nepayipi; empeleni, sishumeka imibuzo epayipini lePython. Kulesi simo, ulimi lombuzo luza kithi luvela emitapweni yolwazi efana ne-Pandas noma i-pySpark.

Ngokuvamile, ku-pySpark sibona uhlobo olufanayo lokuguqulwa kwedatha ngolimi lombuzo ngomoya wokuthi:

df.filter(df.trip_type = “return”)
  .groupby(“day”)
  .agg({duration: 'mean'})
  .sort()

Kuphi futhi yini okufanele ufunde

Ku-Python ngokwayo ngokujwayelekile hhayi inkinga thola izinto zokufunda. Kunenani elikhulu lezifundo eziku-inthanethi pandas, i-pySpark kanye nezifundo Spark (futhi futhi ngokwayo DS). Sekukonke, okuqukethwe lapha kuhle kakhulu ku-googling, futhi uma bekufanele ngikhethe iphakethe elilodwa engizogxila kulo, bekungaba ama-panda, kunjalo. Mayelana nenhlanganisela yezinto ze-DS+Python futhi kakhulu kakhulu.

Igobolondo njengolimi lombuzo

Amaphrojekthi ambalwa wokucubungula nokuhlaziya idatha engisebenze nawo, empeleni, imibhalo yegobolondo ebiza ikhodi ngePython, Java, kanye neziyalo zegobolondo ngokwazo. Ngakho-ke, ngokuvamile, ungabheka amapayipi ku-bash/zsh/njll njengolunye uhlobo lwesicelo sezinga eliphezulu (ungakwazi, vele, ukufaka izihibe lapho, kodwa lokhu akuvamile ngekhodi ye-DS ngezilimi zegobolondo), asinikeze isibonelo esilula - ngangidinga ukwenza imephu ye-QID ye-wikidata kanye nezixhumanisi ezigcwele ze-wikis yesiRashiya nesiNgisi, ngenxa yalokhu ngibhale isicelo esilula esivela emiyalweni eku-bash kanye nokukhiphayo ngibhale umbhalo olula ku-Python, engiyibhale. kuhlanganiswe kanje:

pv “data/latest-all.json.gz” | 
unpigz -c  | 
jq --stream $JQ_QUERY | 
python3 scripts/post_process.py "output.csv"

kuphi

JQ_QUERY = 'select((.[0][1] == "sitelinks" and (.[0][2]=="enwiki" or .[0][2] =="ruwiki") and .[0][3] =="title") or .[0][1] == "id")' 

Lokhu, empeleni, bekuyipayipi lonke elakha imephu edingekayo; njengoba sibona, yonke into yasebenza kumodi yokugeleza:

  • pv filepath - inikeza ibha yokuqhubeka ngokusekelwe kusayizi wefayela futhi idlulisele okuqukethwe kwayo kuye phambili
  • unpigz -c ufunde ingxenye yengobo yomlando wayinikeza u-jq
  • jq ngokhiye - ukusakaza kuvele ngokushesha umphumela futhi wawudlulisela ku-postprocessor (ngokufanayo nangesibonelo sokuqala) ku-Python
  • ngaphakathi, i-postprocessor bekungumshini wesimo olula owafometha okukhiphayo 

Sekukonke, ipayipi eliyinkimbinkimbi elisebenza kumodi yokugeleza kudatha enkulu (0.5TB), ngaphandle kwezinsiza ezibalulekile futhi elenziwe ngepayipi elilula kanye namathuluzi ambalwa.

Elinye ithiphu elibalulekile: ukwazi ukusebenza kahle nangempumelelo ku-terminal bese ubhala bash/zsh/etc.

Kuyoba wusizo kuphi? Yebo, cishe yonke indawo - futhi, kunenqwaba yezinto zokufunda ku-inthanethi. Ikakhulukazi, lapha эта isihloko sami esandulele.

R scripting

Futhi, umfundi angase abaze - kahle, lolu wulimi lokuhlela lonke! Futhi-ke, uzobe eqinisile. Kodwa-ke, ngangivame ukuhlangana no-R esimweni sokuthi, empeleni, sasifana kakhulu nolimi lombuzo.

U-R imvelo yekhompuyutha yezibalo kanye nolimi lwekhompuyutha engashintshi kanye nokubona ngeso lengqondo (ngokusho lokhu).

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha
kuthathwe kusuka lapha. Ngendlela, ngiyayincoma, impahla enhle.

Kungani usosayensi wedatha edinga ukwazi u-R? Okungenani, ngenxa yokuthi kukhona ungqimba olukhulu lwabantu okungebona abe-IT abahlaziya idatha ku-R. Ngihlangabezane nayo ezindaweni ezilandelayo:

  • Umkhakha wezokwelapha.
  • Izazi zezinto eziphilayo.
  • Umkhakha wezezimali.
  • Abantu abanemfundo yezibalo kuphela ababhekana nezibalo.
  • Amamodeli ezibalo akhethekile namamodeli okufunda omshini (avame ukutholakala kuphela enguqulweni yombhali njengephakheji engu-R).

Kungani kuwulimi lwemibuzo ngempela? Ngendlela evame ukutholakala ngayo, empeleni kuyisicelo sokudala imodeli, kufaka phakathi idatha yokufunda nokulungisa amapharamitha wombuzo (imodeli), kanye nokubona ngeso lengqondo idatha kumaphakheji afana ne-ggplot2 - lokhu kubuye kube uhlobo lwemibuzo yokubhala. .

Isibonelo semibuzo yokuboniswa

ggplot(data = beav, 
       aes(x = id, y = temp, 
           group = activ, color = activ)) +
  geom_line() + 
  geom_point() +
  scale_color_manual(values = c("red", "blue"))

Ngokuvamile, imibono eminingi evela ku-R iye yathuthela kumaphakheji e-python afana ne-pandas, i-numpy noma i-scipy, njengama-dataframes kanye ne-vectorization yedatha - ngakho-ke ngokuvamile izinto eziningi ku-R zizobonakala zijwayelekile futhi zikulungele kuwe.

Miningi imithombo yokufunda, isibonelo, lokhu.

Amagrafu olwazi

Lapha nginokuhlangenwe nakho okungajwayelekile okungajwayelekile, ngoba kaningi kufanele ngisebenze ngamagrafu olwazi nezilimi zokubuza zamagrafu. Ngakho-ke, ake sihlole kafushane izinto eziyisisekelo, njengoba le ngxenye iyinqaba kancane.

Ezinqolobaneni zolwazi ezihlobana zasendulo sine-schema egxilile, kodwa lapha i-schema iyavumelana nezimo, isilandiso ngasinye empeleni “siyikholomu” nokunye okwengeziwe.

Cabanga ukuthi ubumodela umuntu futhi ufuna ukuchaza izinto ezibalulekile, isibonelo, ake sithathe umuntu othile, u-Douglas Adams, futhi sisebenzise le ncazelo njengesisekelo.

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha
www.wikidata.org/wiki/Q42

Uma sisebenzisa isizindalwazi esihlobene, bekuzodingeka sakhe ithebula elikhulu noma amatafula anenani elikhulu lamakholomu, amaningi awo abe NULL noma agcwaliswe ngenani elithile elizenzakalelayo lamanga, isibonelo, akunakwenzeka ukuthi abaningi bethu ukungena kumtapo wezincwadi kazwelonke waseKorea - vele, singawabeka ematafuleni ahlukene, kodwa lokhu ekugcineni kungaba umzamo wokumodela isekethe enengqondo eguquguqukayo enezilandiso kusetshenziswa isihlobo esingashintshi.

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha
Ngakho-ke cabanga ukuthi yonke idatha igcinwa njengegrafu noma njengezinkulumo ze-boolean kanambambili nezingajwayelekile.

Ungabhekana kuphi nalokhu? Okokuqala, ukusebenza nge idatha wiki, kanye nanoma iyiphi isizindalwazi segrafu noma idatha exhunyiwe.

Okulandelayo yizilimi zemibuzo esemqoka engisebenzise futhi ngasebenza ngayo.

I-SPARQL

I-Wiki:
I-SPARQL (i-recursive acronym kusukela Eng. Iphrothokholi ye-SPARQL kanye nolimi lombuzo lwe-RDF) - ulimi lombuzo wedatha, emelelwa imodeli I-RDFFuthi umthetho olandelwayo ukudlulisa lezi zicelo nokuziphendulela. I-SPARQL iyisincomo I-W3C Consortium kanye nobunye ubuchwepheshe iwebhu ye-semantic.

Kodwa empeleni kuwulimi lombuzo lwezilandiso ezinengqondo nezingabili. Umane ucacisa ngokwemibandela ukuthi yini egxilile kusisho se-Boolean nokuthi yini engekho (okwenziwa lula kakhulu).

Isisekelo se-RDF (Uhlaka Lwencazelo Yensiza) uqobo, lapho imibuzo ye-SPARQL isetshenziswa khona, iphindwe kathathu. object, predicate, subject - futhi umbuzo ukhetha okuphindwe kathathu okudingekayo ngokwemikhawulo eshiwo emoyeni: thola u-X ukuthi p_55(X, q_33) uyiqiniso - lapho, vele, u-p_55 kuwuhlobo oluthile lobudlelwano no-ID 55, futhi q_33 into eno-ID 33 (lapha nayo yonke indaba, futhi ishiya yonke imininingwane).

Isibonelo sokwethulwa kwedatha:

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha
Izithombe nesibonelo namazwe lapha kusuka lapha.

Umbuzo Oyisisekelo Isibonelo

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha

Eqinisweni, sifuna ukuthola inani lokuguquguquka ?kwezwe ukuze kusilandiso
member_of, kuyiqiniso ukuthi ilungu_le(?country,q458) kanye ne-q458 yi-ID ye-European Union.

Isibonelo sombuzo wangempela we-SPARQL ngaphakathi kwenjini ye-python:

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha

Ngokuvamile, kuye kwadingeka ngifunde i-SPARQL kunokuba ngiyibhale - kuleso simo, kungase kube ikhono eliwusizo ukuqonda ulimi okungenani ezingeni eliyisisekelo ukuze uqonde kahle ukuthi idatha ibuyiswa kanjani. 

Ziningi izinto ongazifundela ku-inthanethi: isibonelo, lapha lokhu и lokhu. Ngivame imiklamo nezibonelo ezithile ze-google futhi lokho kwanele okwamanje.

Izilimi zombuzo ezinengqondo

Ungafunda kabanzi ngesihloko esihlokweni sami lapha. Futhi lapha, sizohlola kafushane nje ukuthi kungani izilimi ezinengqondo zifaneleka kahle ekubhaleni imibuzo. Empeleni, i-RDF iyisethi nje yezitatimende ezinengqondo zefomu elithi p(X) kanye no-h(X,Y), futhi umbuzo onengqondo unaleli fomu elilandelayo:

output(X) :- country(X), member_of(X,“EU”).

Lapha sikhuluma ngokwakha isilandiso esisha esiphumayo/1 (/1 sisho unary), inqobo nje uma ku-X kuyiqiniso lelo zwe(X) - okungukuthi, u-X uyizwe futhi uyilungu_ka(X,"EU ").

Okusho ukuthi, kulokhu, kokubili idatha nemithetho yethulwa ngendlela efanayo, okusivumela ukuba sibonise izinkinga kalula futhi kahle.

Nahlangana kuphi embonini?: iphrojekthi enkulu yonke enenkampani ebhala imibuzo ngolimi olunjalo, kanye nephrojekthi yamanje emnyombweni wesistimu - kubonakala sengathi lokhu kuyinto engavamile, kodwa ngezinye izikhathi kuyenzeka.

Isibonelo socezu lwekhodi ku-wikidata yokucubungula ulimi oluphusile:

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha

Izinto: Ngizonikeza lapha izixhumanisi ezimbalwa zolimi lwesimanje lokuhlela Impendulo Setha Uhlelo - Ngincoma ukuthi uyifunde:

Amanothi Esayensi Yedatha: Ukubuyekezwa Okuqondene Nakho Kwezilimi Zombuzo Wedatha

Source: www.habr.com

Engeza amazwana