Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi
Ndikuxelela ngokusuka kumava akho oko kwakuluncedo apho kwaye nini. Ushwankathelo kunye nethisisi, ukuze kucace ukuba yintoni kwaye ungamba phi na ngakumbi - kodwa apha ndinamava obuqu obuqu, mhlawumbi yonke into yahluke ngokupheleleyo kuwe.

Kutheni le nto kubalulekile ukwazi kwaye ukwazi ukusebenzisa iilwimi zemibuzo? Embindini wayo, iNzululwazi yeDatha inezigaba ezininzi ezibalulekileyo zomsebenzi, kwaye eyona nto yokuqala nebaluleke kakhulu (ngaphandle kwayo, ngokuqinisekileyo akukho nto iya kusebenza!) Kukufumana okanye ukukhupha idatha. Amaxesha amaninzi, idatha ihleli kwindawo ethile ngendlela ethile kwaye ifuna "ukufunyanwa" apho. 

Iilwimi zemibuzo zikuvumela ukuba ukhuphe le datha! Kwaye namhlanje ndiza kukuxelela malunga nezo lwimi eziye zaluncedo kum kwaye ndiza kukuxelela kwaye ndikubonise apho kwaye njani kanye kanye - kutheni kufuneka ufunde.

Kuya kubakho iibhloko ezintathu eziphambili zeentlobo zemibuzo yedatha, esiza kuthetha ngayo kweli nqaku:

  • Iilwimi zemibuzo "eziqhelekileyo" zezona ziqhelekileyo ziqondwa xa kuthethwa ngolwimi lombuzo, olufana nealgebra yobudlelwane okanye iSQL.
  • Iilwimi zombuzo wokushicilelweyo: umzekelo, iPython things pandas, numpy okanye iqokobhe elishicilelweyo.
  • Iilwimi zokubuza kwiigrafu zolwazi kunye nogcino-lwazi lwegrafu.

Yonke into ebhalwe apha ngamava nje obuqu, oko kwaba luncedo, kunye nenkcazo yeemeko kwaye "kutheni kwakufuneka" - wonke umntu unokuzama ukuba iimeko ezifanayo zinokuza njani kwaye uzame ukuzilungiselela kwangaphambili ngokuqonda ezi lwimi. Ngaphambi kokuba ufake isicelo (ngokukhawuleza) kwiprojekthi okanye ufike kwiprojekthi apho zifuneka khona.

Iilwimi zombuzo "eziqhelekileyo".

Iilwimi zombuzo oqhelekileyo zichanekile ngendlela yokuba sihlala sicinga ngazo xa sithetha ngemibuzo.

Ialgebra yobudlelwane

Kutheni le nto ialgebra yobudlelwane ifuneka namhlanje? Ukuze ube nokuqonda kakuhle ukuba kutheni iilwimi zemibuzo zicwangciswe ngendlela ethile kwaye uzisebenzise ngononophelo, kuya kufuneka uqonde eyona nto iphambili kuzo.

Yintoni ialgebra yobudlelwane?

Inkcazo esesikweni ngolu hlobo lulandelayo: i-algebra yobudlelwane yinkqubo evaliweyo yokusebenza kubudlelwane kwimodeli yedatha yobudlelwane. Ukuyibeka kancinci ngakumbi ngobuntu, le yinkqubo yokusebenza kwiitafile kangangokuba isiphumo sihlala siyitafile.

Bona yonke imisebenzi yobudlelwane kwi oku Inqaku elivela kuHabr - apha sichaza ukuba kutheni kufuneka wazi kwaye apho isiza khona.

Kutheni?

Ukuqala ukuqonda ukuba yeyiphi imibuzo emalunga neelwimi kwaye yeyiphi imisebenzi esemva kwamabinzana kwiilwimi zombuzo othile ihlala inika ukuqonda okunzulu malunga nokuba yintoni esebenza kwiilwimi zemibuzo kwaye njani.

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi
Ithatyathwe kwi oku amanqaku. Umzekelo womsebenzi: dibanisa, odibanisa iitafile.

Izixhobo zokufunda:

Izifundo ezilungileyo zentshayelelo evela eStanford. Ngokubanzi, kukho izinto ezininzi kwi-algebra yobudlelwane kunye nethiyori - i-Coursera, i-Udacity. Kukwakho isixa esikhulu semathiriyeli kwi-intanethi, kubandakanya nelungileyo izifundo zemfundo. Ingcebiso yam yobuqu: kufuneka uqonde i-algebra yobudlelwane kakuhle kakhulu - esi sisiseko seziseko.

SQL

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi
Ithatyathwe kwi oku amanqaku.

I-SQL ngokusisiseko kukuphunyezwa kwe-algebra yobudlelwane - kunye ne-caveat ebalulekileyo, i-SQL iyabhengeza! Oko kukuthi, xa ubhala umbuzo ngolwimi lwe-algebra yobudlelwane, ngokwenene uthi indlela yokubala - kodwa nge-SQL uchaza into ofuna ukuyikhupha, kwaye emva koko i-DBMS sele ivelisa (esebenzayo) amabinzana ngolwimi lwe-algebra yobudlelwane (yabo). ukulingana yaziwa kuthi njenge Ithiyori kaCodd).

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi
Ithatyathwe kwi oku amanqaku.

Kutheni?

Ii-DBMS zobudlelwane: I-Oracle, i-Postgres, i-SQL Server, njl. njl akunakwenzeka nokuba).

Yintoni ekufuneka uyifunde kwaye uyifundisise

Ngokwamakhonkco afanayo apha ngasentla (malunga nealjebra yobudlelwane), kukho ubungakanani obungakholelekiyo bemathiriyeli, umzekelo, oku.

Ngapha koko, yintoni iNoSQL?

Kuyafaneleka ukuba kugxininiswe kwakhona ukuba igama elithi "NoSQL" linemvelaphi ezenzekelayo kwaye alinayo inkcazo eyamkelekileyo okanye iziko lesayensi emva kwayo. Iyahambelana inqaku kuHabr.

Enyanisweni, abantu baqaphela ukuba imodeli epheleleyo yobudlelwane ayidingekiyo ukusombulula iingxaki ezininzi, ngakumbi ezo apho, umzekelo, ukusebenza kubaluleke kakhulu kwaye imibuzo ethile elula kunye ne-aggregation ilawula - apho kubaluleke kakhulu ukubala ngokukhawuleza i-metrics kwaye uyibhale kwi-aggregation. ugcino lwedatha, kwaye uninzi lweempawu zinobudlelwane zajika azibalulekanga nje kuphela, kodwa zikwayingozi - kutheni lenze into eqhelekileyo ukuba iya konakalisa eyona nto ibalulekileyo kuthi (umsebenzi othile) - imveliso?

Kwakhona, i-schemas eguquguqukayo ihlala ifuneka endaweni yeschema semathematika esisigxina yemodeli yobudlelwane beklasikhi-kwaye oku kwenza lula uphuhliso lwesicelo xa kubaluleke kakhulu ukusasaza inkqubo kwaye uqalise ukusebenza ngokukhawuleza, ukusetyenzwa kweziphumo - okanye i-schema kunye neentlobo zedatha egciniweyo. azibalulekanga kangako.

Ngokomzekelo, sidala inkqubo yeengcali kwaye sifuna ukugcina ulwazi kwi-domain ethile kunye nolwazi oluthile lwe-meta - sisenokungazi zonke iindawo kwaye sigcine i-JSON kwirekhodi nganye - oku kusinika indawo eguquguqukayo kakhulu yokwandisa idatha. imodeli kunye nokuphinda-phinda ngokukhawuleza- ke kule meko, i-NoSQL iya kuthandeka kwaye ifundeke ngakumbi. Umzekelo wokungena (ukusuka kwenye yeeprojekthi zam apho iNoSQL yayilungile apho ifuneka khona).

{"en_wikipedia_url":"https://en.wikipedia.org/wiki/Johnny_Cash",
"ru_wikipedia_url":"https://ru.wikipedia.org/wiki/?curid=301643",
"ru_wiki_pagecount":149616,
"entity":[42775,"Джонни Кэш","ru"],
"en_wiki_pagecount":2338861}

Unokufunda ngakumbi apha malunga NoSQL.

Yintoni yokufunda?

Apha, endaweni yoko, kufuneka uhlalutye ngokucokisekileyo umsebenzi wakho, zeziphi iipropathi enazo kwaye zeziphi iinkqubo zeNoSQL ezikhoyo ezinokufanela le nkcazo-kwaye uqalise ukufunda le nkqubo.

Iilwimi zoMbuzo wokuBhalwa

Ekuqaleni, kubonakala ngathi, iPython inento yokwenza nayo ngokubanzi - lulwimi lwenkqubo, kwaye hayi malunga nemibuzo konke konke.

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi

  • I-Pandas ngokwenyani yimela yoMkhosi waseSwitzerland yeNzululwazi yeDatha; isixa esikhulu sokuguqulwa kwedatha, ukuhlanganiswa, njl.njl. kwenzeka kuyo.
  • Numpy - izibalo zevektha, iimatriki kunye nealgebra yomgca apho.
  • I-Scipy - zininzi iimathematika kule phakheji, ngakumbi izibalo.
  • Ilebhu yeJupyter - uninzi lohlalutyo lwedatha lungena kakuhle kwiilaptops - luncedo ukwazi.
  • Izicelo - ukusebenza kunye nenethiwekhi.
  • I-Pyspark idume kakhulu phakathi kweenjineli zedatha, ubukhulu becala kuya kufuneka unxibelelane nale okanye iSpark, ngenxa nje yokuthandwa kwabo.
  • * I-Selenium - iluncedo kakhulu ekuqokeleleni idatha kwiisayithi kunye nezixhobo, ngamanye amaxesha ayikho enye indlela yokufumana idatha.

Ingcebiso yam ephambili: funda iPython!

Iipandas

Masithathe le khowudi ilandelayo njengomzekelo:

import pandas as pd
df = pd.read_csv(“data/dataset.csv”)
# Calculate and rename aggregations
all_together = (df[df[‘trip_type’] == “return”]
    .groupby(['start_station_name','end_station_name'])
                  	    .agg({'trip_duration_seconds': [np.size, np.mean, np.min, np.max]})
                           .rename(columns={'size': 'num_trips', 
           'mean': 'avg_duration_seconds',    
           'amin': min_duration_seconds', 
           ‘amax': 'max_duration_seconds'}))

Ngokusisiseko, siyabona ukuba ikhowudi ingena kwipatheni ye-SQL yakudala.

SELECT start_station_name, end_station_name, count(trip_duration_seconds) as size, …..
FROM dataset
WHERE trip_type = ‘return’
GROUPBY start_station_name, end_station_name

Kodwa eyona nxalenye ibalulekileyo kukuba le khowudi yinxalenye yeskripthi kunye nombhobho; enyanisweni, sifaka imibuzo kumbhobho wePython. Kule meko, ulwimi lombuzo luza kuthi luvela kumathala eencwadi anjengePandas okanye ipySpark.

Ngokubanzi, kwi-pySpark sibona uhlobo olufanayo lokuguqulwa kwedatha ngolwimi lombuzo ngomoya we:

df.filter(df.trip_type = “return”)
  .groupby(“day”)
  .agg({duration: 'mean'})
  .sort()

Kuphi kwaye kufundwe ntoni

KwiPython ngokwayo ngokubanzi ayongxaki fumana izixhobo zokufunda. Kukho inani elikhulu lee-tutorials kwi-intanethi pandas, pySpark kunye nezifundo kwi spark (kwaye ngokwayo DS). Ngokubanzi, umxholo olapha ulungile kwi-googling, kwaye ukuba bekufuneka ndikhethe ipakethe enye ukuze ndigxile kuyo, iya kuba yi-pandas, ewe. Ngokumalunga nokudityaniswa kwezinto zeDS + Python nazo kakhulu.

Shell njengolwimi lombuzo

Iiprojekthi ezimbalwa zokusetyenzwa kwedatha kunye nokuhlalutya endisebenze nazo, eneneni, izikripthi zeqokobhe ezifowunela ikhowudi kwiPython, Java, kunye neqokobhe eliziyalela ngokwazo. Ke ngoko, ngokubanzi, unokuqwalasela imibhobho kwi-bash/zsh/etc njengohlobo oluthile lombuzo okwinqanaba eliphezulu (ungakwazi, ewe, izinto ezirhintyela apho, kodwa oku akuqhelekanga kwikhowudi ye-DS kwiilwimi zeqokobhe), masinike umzekelo olula - kwafuneka ndenze imephu ye-QID ye-wikidata kunye namakhonkco apheleleyo kwi-wikis yaseRashiya kunye nesiNgesi, ngenxa yoko ndabhala isicelo esilula kwimiyalelo ekwi-bash kunye nesiphumo ndibhale iscript esilula kwiPython, endiyibhale ngayo. dibanisa ngolu hlobo:

pv “data/latest-all.json.gz” | 
unpigz -c  | 
jq --stream $JQ_QUERY | 
python3 scripts/post_process.py "output.csv"

apho

JQ_QUERY = 'select((.[0][1] == "sitelinks" and (.[0][2]=="enwiki" or .[0][2] =="ruwiki") and .[0][3] =="title") or .[0][1] == "id")' 

Oku, enyanisweni, yayingumbhobho wonke odale imephu efunekayo; njengoko sibona, yonke into yasebenza kwimowudi yomsinga:

  • pv indlela yefayile - inika ibar yenkqubela phambili esekwe kubungakanani befayile kwaye igqithise imixholo yayo phambili
  • unpigz -c ufunde inxalenye yogcino kwaye wayinika u-jq
  • jq ngesitshixo-umjelo ngokukhawuleza uvelise isiphumo kwaye wasidlulisela kwi-postprocessor (kwafanayo nomzekelo wokuqala) kwiPython
  • ngaphakathi, i-postprocessor yayingumatshini welizwe olula ofomathe imveliso 

Iyonke, umbhobho onzima osebenza kwimodi yokuhamba kwidatha enkulu (0.5TB), ngaphandle kwemithombo ebalulekileyo kwaye yenziwe kumbhobho olula kunye nezixhobo ezimbalwa.

Enye ingcebiso ebalulekileyo: ukwazi ukusebenza kakuhle nangempumelelo kwi-terminal kwaye ubhale i-bash/zsh/etc.

Iya kuba luncedo phi? Ewe, phantse kuyo yonke indawo - kwakhona, kukho izinto ezininzi zokufunda kwi-Intanethi. Ngokukodwa, apha oku inqaku lam langaphambili.

R scripting

Kwakhona, umfundi unokudanduluka-kaloku, olu lulwimi lweprogram epheleleyo! Kwaye ke, uya kuba elungile. Nangona kunjalo, bendihlala ndidibana no-R kwimeko yokuba, enyanisweni, yayifana kakhulu nolwimi lombuzo.

R yindawo yokubala yekhompyutha kunye nolwimi lwekhompuyutha engatshintshiyo kunye nokubonwayo (ngokwe oku).

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi
ithathiwe kusuka apha. Ngendlela, ndiyincoma, izinto ezilungileyo.

Kutheni inzululwazi yedatha ifuna ukwazi i-R? Ubuncinci, kuba kukho umaleko omkhulu wabantu abangeyo-IT abahlalutya idatha kwi-R. Ndiyifumene kwezi ndawo zilandelayo:

  • Icandelo lezamachiza.
  • Iingcali zebhayoloji.
  • Icandelo lezemali.
  • Abantu abanemfundo yezibalo kuphela abajongana nezibalo.
  • Iimodeli zamanani ezikhethekileyo kunye neemodeli zokufunda koomatshini (ezinokuthi zifumaneke kuphela kwinguqulelo yombhali njengepakethe engu-R).

Kutheni ilulwimi lombuzo ngokwenene? Kwifom ehlala ifunyenwe, ngokwenene isicelo sokwenza imodeli, kubandakanywa idatha yokufunda kunye nokulungisa umbuzo (imodeli) iiparamitha, kunye nokubona idatha kwiipakethe ezifana ne-ggplot2 - oku kukwayindlela yokubhala imibuzo. .

Imizekelo yemibuzo yokujonga

ggplot(data = beav, 
       aes(x = id, y = temp, 
           group = activ, color = activ)) +
  geom_line() + 
  geom_point() +
  scale_color_manual(values = c("red", "blue"))

Ngokubanzi, iingcamango ezininzi ezivela kwi-R ziye zafudukela kwiipakethe ze-python ezifana ne-pandas, i-numpy okanye i-scipy, njenge-dataframes kunye ne-data vectorization - ngoko ngokubanzi izinto ezininzi kwi-R ziya kubonakala ziqhelekile kwaye zikulungele kuwe.

Kukho imithombo emininzi yokufunda, umzekelo, oku.

Iigrafu zolwazi

Apha ndinamava angaqhelekanga, kuba ndihlala ndisebenza ngeegrafu zolwazi kunye neelwimi zokubuza kwiigrafu. Ke ngoko, makhe sijonge ngokufutshane izinto ezisisiseko, njengoko le nxalenye ingaqhelekanga ngakumbi.

Kwiinkcukacha zolwazi olunxulumene neklasiki sine-schema esisigxina, kodwa apha i-schema iyaguquguquka, isivisa ngasinye ngokwenene "yikholomu" nangaphezulu.

Khawufane ucinge ukuba ubulinganisa umntu kwaye ufuna ukuchaza izinto eziphambili, umzekelo, masithathe umntu othile, uDouglas Adams, kwaye sisebenzise le nkcazo njengesiseko.

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi
www.wikidata.org/wiki/Q42

Ukuba sisebenzise isiseko sedatha esinxulumeneyo, kuya kufuneka senze itafile enkulu okanye iitafile ezinenani elikhulu leekholamu, uninzi lwayo luya kuba NULL okanye luzaliswe ngexabiso elithile elingagqibekanga, umzekelo, akunakwenzeka ukuba uninzi lwethu ukungena kwithala leencwadi lesizwe laseKorea - kunjalo, sinokuzibeka kwiitafile ezahlukeneyo, kodwa oku kuya kuba yinzame yokulinganisa isekethe enengqondo eguquguqukayo kunye nezibikezelo usebenzisa irelational esisigxina.

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi
Ke khawufane ucinge ukuba yonke idatha igcinwe njengegrafu okanye njengokubini kunye neentetho ze-boolean ezingaqhelekanga.

Ungadibana phi nale nto? Okokuqala, ukusebenza kunye data wiki, kunye naluphi na ugcino-lwazi lwegrafu okanye idatha eqhagamshelweyo.

Oku kulandelayo zezona lwimi ziphambili zombuzo endikhe ndazisebenzisa kwaye ndasebenza nazo.

SPARQL

Wiki:
SPARQL (i-recursive acronym ukusuka Eng. IProtokholi ye-SPARQL kunye noLwimi loMbuzo lwe-RDF) - ulwimi lombuzo wedatha, emelwe yimodeli I-RDF, kwakunye umthetho olandelwayo ukuhambisa ezi zicelo kunye nokuphendula kuzo. SPARQL sisindululo I-W3C Consortium kunye nenye yetekhnoloji iwebhu ye-semantic.

Kodwa eneneni lulwimi lombuzo kwi-logical unary and binary predicates. Ubalula ngokulula oko kulungisiweyo kwintetho ye-Boolean kwaye yintoni engeyiyo (eyenziwe lula kakhulu).

I-RDF (iSakhelo seNkcazelo yeZibonelelo) isiseko ngokwaso, apho imibuzo ye-SPARQL isenziwa khona, iphindwe kathathu. object, predicate, subject - kwaye umbuzo ukhetha ii-triple ezifunekayo ngokwezithintelo ezikhankanyiweyo emoyeni: fumana i-X enje ukuba p_55(X, q_33) yinyani - apho, ngokuqinisekileyo, p_55 luhlobo oluthile lobudlelwane kunye ne-ID 55, kwaye q_33 into nge-ID 33 (apha kunye nebali lonke, kwakhona ukushiya zonke iintlobo zeenkcukacha).

Umzekelo wonikezelo lwedatha:

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi
Imifanekiso kunye nomzekelo kunye namazwe apha kusuka apha.

Umbuzo osisiseko umzekelo

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi

Enyanisweni, sifuna ukufumana ixabiso le-?inguqu yelizwe ukuze isivisa
member_of, kuyinyani ukuba ilungu_le(?lizwe, q458) kunye ne-q458 yi-ID ye-European Union.

Umzekelo wombuzo wokwenyani we-SPARQL ngaphakathi kwinjini yepython:

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi

Ngokwesiqhelo, kuye kwafuneka ndifunde i-SPARQL endaweni yokuyibhala - kuloo meko, inokuba sisixhobo esiluncedo ukuqonda ulwimi ubuncinci kwinqanaba elisisiseko ukuqonda ngokuthe ngqo ukuba idatha ifunyanwa njani. 

Zininzi izinto zokufunda kwi-intanethi: umzekelo, apha oku и oku. Ndihlala ndisebenzisa uyilo oluthile lukaGoogle kunye nemizekelo kwaye kwanele ngoku.

Iilwimi zombuzo onengqondo

Unokufunda ngakumbi ngesihloko kwinqaku lam apha. Kwaye apha, siza kuphonononga ngokufutshane ukuba kutheni iilwimi ezinengqiqo ziyifanelekele imibuzo yokubhala. Ngokusisiseko, i-RDF yiseti nje yeengxelo ezinengqiqo zohlobo u-p(X) kunye no-h(X,Y), kwaye umbuzo onengqiqo unale fomu ilandelayo:

output(X) :- country(X), member_of(X,“EU”).

Apha sithetha ngokudala i-predicate output esitsha/1 (/1 ithetha unary), ngaphandle kokuba ku-X kuyinyani ukuba ilizwe(X) - o.k.t., X lilizwe kwaye ikwalungu_ka(X,"EU ").

Oko kukuthi, kule meko, zombini idatha kunye nemithetho ibekwe ngendlela efanayo, evumela ukuba senze umzekelo weengxaki ngokulula kwaye kakuhle.

Nadibana phi kolu shishino?: iprojekthi enkulu kunye nenkampani ebhala imibuzo ngolwimi olunjalo, kunye nakwiprojekthi yangoku kumbindi wenkqubo - kubonakala ngathi le nto yinto engaqhelekanga, kodwa ngamanye amaxesha iyenzeka.

Umzekelo weqhekeza lekhowudi kwiwikidata yolwimi olunengqiqo:

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi

Izixhobo: Ndiza kunika apha amakhonkco ambalwa kulwimi lwenkqubo olunengqiqo yale mihla Impendulo yokuSeta iNkqubo-Ndicebisa ukuba uyifunde:

Amanqaku eNzululwazi yeDatha: Uphononongo olulolwakho loMbuzo weDatha yeeLwimi

umthombo: www.habr.com

Yongeza izimvo