Ndikuxelela ngokusuka kumava akho oko kwakuluncedo apho kwaye nini. Ushwankathelo kunye nethisisi, ukuze kucace ukuba yintoni kwaye ungamba phi na ngakumbi - kodwa apha ndinamava obuqu obuqu, mhlawumbi yonke into yahluke ngokupheleleyo kuwe.
Kutheni le nto kubalulekile ukwazi kwaye ukwazi ukusebenzisa iilwimi zemibuzo? Embindini wayo, iNzululwazi yeDatha inezigaba ezininzi ezibalulekileyo zomsebenzi, kwaye eyona nto yokuqala nebaluleke kakhulu (ngaphandle kwayo, ngokuqinisekileyo akukho nto iya kusebenza!) Kukufumana okanye ukukhupha idatha. Amaxesha amaninzi, idatha ihleli kwindawo ethile ngendlela ethile kwaye ifuna "ukufunyanwa" apho.
Iilwimi zemibuzo zikuvumela ukuba ukhuphe le datha! Kwaye namhlanje ndiza kukuxelela malunga nezo lwimi eziye zaluncedo kum kwaye ndiza kukuxelela kwaye ndikubonise apho kwaye njani kanye kanye - kutheni kufuneka ufunde.
Kuya kubakho iibhloko ezintathu eziphambili zeentlobo zemibuzo yedatha, esiza kuthetha ngayo kweli nqaku:
- Iilwimi zemibuzo "eziqhelekileyo" zezona ziqhelekileyo ziqondwa xa kuthethwa ngolwimi lombuzo, olufana nealgebra yobudlelwane okanye iSQL.
- Iilwimi zombuzo wokushicilelweyo: umzekelo, iPython things pandas, numpy okanye iqokobhe elishicilelweyo.
- Iilwimi zokubuza kwiigrafu zolwazi kunye nogcino-lwazi lwegrafu.
Yonke into ebhalwe apha ngamava nje obuqu, oko kwaba luncedo, kunye nenkcazo yeemeko kwaye "kutheni kwakufuneka" - wonke umntu unokuzama ukuba iimeko ezifanayo zinokuza njani kwaye uzame ukuzilungiselela kwangaphambili ngokuqonda ezi lwimi. Ngaphambi kokuba ufake isicelo (ngokukhawuleza) kwiprojekthi okanye ufike kwiprojekthi apho zifuneka khona.
Iilwimi zombuzo "eziqhelekileyo".
Iilwimi zombuzo oqhelekileyo zichanekile ngendlela yokuba sihlala sicinga ngazo xa sithetha ngemibuzo.
Ialgebra yobudlelwane
Kutheni le nto ialgebra yobudlelwane ifuneka namhlanje? Ukuze ube nokuqonda kakuhle ukuba kutheni iilwimi zemibuzo zicwangciswe ngendlela ethile kwaye uzisebenzise ngononophelo, kuya kufuneka uqonde eyona nto iphambili kuzo.
Yintoni ialgebra yobudlelwane?
Inkcazo esesikweni ngolu hlobo lulandelayo: i-algebra yobudlelwane yinkqubo evaliweyo yokusebenza kubudlelwane kwimodeli yedatha yobudlelwane. Ukuyibeka kancinci ngakumbi ngobuntu, le yinkqubo yokusebenza kwiitafile kangangokuba isiphumo sihlala siyitafile.
Bona yonke imisebenzi yobudlelwane kwi
Kutheni?
Ukuqala ukuqonda ukuba yeyiphi imibuzo emalunga neelwimi kwaye yeyiphi imisebenzi esemva kwamabinzana kwiilwimi zombuzo othile ihlala inika ukuqonda okunzulu malunga nokuba yintoni esebenza kwiilwimi zemibuzo kwaye njani.
Ithatyathwe kwi
Izixhobo zokufunda:
SQL
Ithatyathwe kwi
I-SQL ngokusisiseko kukuphunyezwa kwe-algebra yobudlelwane - kunye ne-caveat ebalulekileyo, i-SQL iyabhengeza! Oko kukuthi, xa ubhala umbuzo ngolwimi lwe-algebra yobudlelwane, ngokwenene uthi indlela yokubala - kodwa nge-SQL uchaza into ofuna ukuyikhupha, kwaye emva koko i-DBMS sele ivelisa (esebenzayo) amabinzana ngolwimi lwe-algebra yobudlelwane (yabo). ukulingana yaziwa kuthi njenge
Ithatyathwe kwi
Kutheni?
Ii-DBMS zobudlelwane: I-Oracle, i-Postgres, i-SQL Server, njl. njl akunakwenzeka nokuba).
Yintoni ekufuneka uyifunde kwaye uyifundisise
Ngokwamakhonkco afanayo apha ngasentla (malunga nealjebra yobudlelwane), kukho ubungakanani obungakholelekiyo bemathiriyeli, umzekelo,
Ngapha koko, yintoni iNoSQL?
Kuyafaneleka ukuba kugxininiswe kwakhona ukuba igama elithi "NoSQL" linemvelaphi ezenzekelayo kwaye alinayo inkcazo eyamkelekileyo okanye iziko lesayensi emva kwayo. Iyahambelana
Enyanisweni, abantu baqaphela ukuba imodeli epheleleyo yobudlelwane ayidingekiyo ukusombulula iingxaki ezininzi, ngakumbi ezo apho, umzekelo, ukusebenza kubaluleke kakhulu kwaye imibuzo ethile elula kunye ne-aggregation ilawula - apho kubaluleke kakhulu ukubala ngokukhawuleza i-metrics kwaye uyibhale kwi-aggregation. ugcino lwedatha, kwaye uninzi lweempawu zinobudlelwane zajika azibalulekanga nje kuphela, kodwa zikwayingozi - kutheni lenze into eqhelekileyo ukuba iya konakalisa eyona nto ibalulekileyo kuthi (umsebenzi othile) - imveliso?
Kwakhona, i-schemas eguquguqukayo ihlala ifuneka endaweni yeschema semathematika esisigxina yemodeli yobudlelwane beklasikhi-kwaye oku kwenza lula uphuhliso lwesicelo xa kubaluleke kakhulu ukusasaza inkqubo kwaye uqalise ukusebenza ngokukhawuleza, ukusetyenzwa kweziphumo - okanye i-schema kunye neentlobo zedatha egciniweyo. azibalulekanga kangako.
Ngokomzekelo, sidala inkqubo yeengcali kwaye sifuna ukugcina ulwazi kwi-domain ethile kunye nolwazi oluthile lwe-meta - sisenokungazi zonke iindawo kwaye sigcine i-JSON kwirekhodi nganye - oku kusinika indawo eguquguqukayo kakhulu yokwandisa idatha. imodeli kunye nokuphinda-phinda ngokukhawuleza- ke kule meko, i-NoSQL iya kuthandeka kwaye ifundeke ngakumbi. Umzekelo wokungena (ukusuka kwenye yeeprojekthi zam apho iNoSQL yayilungile apho ifuneka khona).
{"en_wikipedia_url":"https://en.wikipedia.org/wiki/Johnny_Cash",
"ru_wikipedia_url":"https://ru.wikipedia.org/wiki/?curid=301643",
"ru_wiki_pagecount":149616,
"entity":[42775,"Джонни Кэш","ru"],
"en_wiki_pagecount":2338861}
Unokufunda ngakumbi
Yintoni yokufunda?
Apha, endaweni yoko, kufuneka uhlalutye ngokucokisekileyo umsebenzi wakho, zeziphi iipropathi enazo kwaye zeziphi iinkqubo zeNoSQL ezikhoyo ezinokufanela le nkcazo-kwaye uqalise ukufunda le nkqubo.
Iilwimi zoMbuzo wokuBhalwa
Ekuqaleni, kubonakala ngathi, iPython inento yokwenza nayo ngokubanzi - lulwimi lwenkqubo, kwaye hayi malunga nemibuzo konke konke.
- I-Pandas ngokwenyani yimela yoMkhosi waseSwitzerland yeNzululwazi yeDatha; isixa esikhulu sokuguqulwa kwedatha, ukuhlanganiswa, njl.njl. kwenzeka kuyo.
- Numpy - izibalo zevektha, iimatriki kunye nealgebra yomgca apho.
- I-Scipy - zininzi iimathematika kule phakheji, ngakumbi izibalo.
- Ilebhu yeJupyter - uninzi lohlalutyo lwedatha lungena kakuhle kwiilaptops - luncedo ukwazi.
- Izicelo - ukusebenza kunye nenethiwekhi.
- I-Pyspark idume kakhulu phakathi kweenjineli zedatha, ubukhulu becala kuya kufuneka unxibelelane nale okanye iSpark, ngenxa nje yokuthandwa kwabo.
- * I-Selenium - iluncedo kakhulu ekuqokeleleni idatha kwiisayithi kunye nezixhobo, ngamanye amaxesha ayikho enye indlela yokufumana idatha.
Ingcebiso yam ephambili: funda iPython!
Iipandas
Masithathe le khowudi ilandelayo njengomzekelo:
import pandas as pd
df = pd.read_csv(“data/dataset.csv”)
# Calculate and rename aggregations
all_together = (df[df[‘trip_type’] == “return”]
.groupby(['start_station_name','end_station_name'])
.agg({'trip_duration_seconds': [np.size, np.mean, np.min, np.max]})
.rename(columns={'size': 'num_trips',
'mean': 'avg_duration_seconds',
'amin': min_duration_seconds',
‘amax': 'max_duration_seconds'}))
Ngokusisiseko, siyabona ukuba ikhowudi ingena kwipatheni ye-SQL yakudala.
SELECT start_station_name, end_station_name, count(trip_duration_seconds) as size, …..
FROM dataset
WHERE trip_type = ‘return’
GROUPBY start_station_name, end_station_name
Kodwa eyona nxalenye ibalulekileyo kukuba le khowudi yinxalenye yeskripthi kunye nombhobho; enyanisweni, sifaka imibuzo kumbhobho wePython. Kule meko, ulwimi lombuzo luza kuthi luvela kumathala eencwadi anjengePandas okanye ipySpark.
Ngokubanzi, kwi-pySpark sibona uhlobo olufanayo lokuguqulwa kwedatha ngolwimi lombuzo ngomoya we:
df.filter(df.trip_type = “return”)
.groupby(“day”)
.agg({duration: 'mean'})
.sort()
Kuphi kwaye kufundwe ntoni
KwiPython ngokwayo ngokubanzi
Shell njengolwimi lombuzo
Iiprojekthi ezimbalwa zokusetyenzwa kwedatha kunye nokuhlalutya endisebenze nazo, eneneni, izikripthi zeqokobhe ezifowunela ikhowudi kwiPython, Java, kunye neqokobhe eliziyalela ngokwazo. Ke ngoko, ngokubanzi, unokuqwalasela imibhobho kwi-bash/zsh/etc njengohlobo oluthile lombuzo okwinqanaba eliphezulu (ungakwazi, ewe, izinto ezirhintyela apho, kodwa oku akuqhelekanga kwikhowudi ye-DS kwiilwimi zeqokobhe), masinike umzekelo olula - kwafuneka ndenze imephu ye-QID ye-wikidata kunye namakhonkco apheleleyo kwi-wikis yaseRashiya kunye nesiNgesi, ngenxa yoko ndabhala isicelo esilula kwimiyalelo ekwi-bash kunye nesiphumo ndibhale iscript esilula kwiPython, endiyibhale ngayo. dibanisa ngolu hlobo:
pv “data/latest-all.json.gz” |
unpigz -c |
jq --stream $JQ_QUERY |
python3 scripts/post_process.py "output.csv"
apho
JQ_QUERY = 'select((.[0][1] == "sitelinks" and (.[0][2]=="enwiki" or .[0][2] =="ruwiki") and .[0][3] =="title") or .[0][1] == "id")'
Oku, enyanisweni, yayingumbhobho wonke odale imephu efunekayo; njengoko sibona, yonke into yasebenza kwimowudi yomsinga:
- pv indlela yefayile - inika ibar yenkqubela phambili esekwe kubungakanani befayile kwaye igqithise imixholo yayo phambili
- unpigz -c ufunde inxalenye yogcino kwaye wayinika u-jq
- jq ngesitshixo-umjelo ngokukhawuleza uvelise isiphumo kwaye wasidlulisela kwi-postprocessor (kwafanayo nomzekelo wokuqala) kwiPython
- ngaphakathi, i-postprocessor yayingumatshini welizwe olula ofomathe imveliso
Iyonke, umbhobho onzima osebenza kwimodi yokuhamba kwidatha enkulu (0.5TB), ngaphandle kwemithombo ebalulekileyo kwaye yenziwe kumbhobho olula kunye nezixhobo ezimbalwa.
Enye ingcebiso ebalulekileyo: ukwazi ukusebenza kakuhle nangempumelelo kwi-terminal kwaye ubhale i-bash/zsh/etc.
Iya kuba luncedo phi? Ewe, phantse kuyo yonke indawo - kwakhona, kukho izinto ezininzi zokufunda kwi-Intanethi. Ngokukodwa, apha
R scripting
Kwakhona, umfundi unokudanduluka-kaloku, olu lulwimi lweprogram epheleleyo! Kwaye ke, uya kuba elungile. Nangona kunjalo, bendihlala ndidibana no-R kwimeko yokuba, enyanisweni, yayifana kakhulu nolwimi lombuzo.
R yindawo yokubala yekhompyutha kunye nolwimi lwekhompuyutha engatshintshiyo kunye nokubonwayo (ngokwe
ithathiwe
Kutheni inzululwazi yedatha ifuna ukwazi i-R? Ubuncinci, kuba kukho umaleko omkhulu wabantu abangeyo-IT abahlalutya idatha kwi-R. Ndiyifumene kwezi ndawo zilandelayo:
- Icandelo lezamachiza.
- Iingcali zebhayoloji.
- Icandelo lezemali.
- Abantu abanemfundo yezibalo kuphela abajongana nezibalo.
- Iimodeli zamanani ezikhethekileyo kunye neemodeli zokufunda koomatshini (ezinokuthi zifumaneke kuphela kwinguqulelo yombhali njengepakethe engu-R).
Kutheni ilulwimi lombuzo ngokwenene? Kwifom ehlala ifunyenwe, ngokwenene isicelo sokwenza imodeli, kubandakanywa idatha yokufunda kunye nokulungisa umbuzo (imodeli) iiparamitha, kunye nokubona idatha kwiipakethe ezifana ne-ggplot2 - oku kukwayindlela yokubhala imibuzo. .
Imizekelo yemibuzo yokujonga
ggplot(data = beav,
aes(x = id, y = temp,
group = activ, color = activ)) +
geom_line() +
geom_point() +
scale_color_manual(values = c("red", "blue"))
Ngokubanzi, iingcamango ezininzi ezivela kwi-R ziye zafudukela kwiipakethe ze-python ezifana ne-pandas, i-numpy okanye i-scipy, njenge-dataframes kunye ne-data vectorization - ngoko ngokubanzi izinto ezininzi kwi-R ziya kubonakala ziqhelekile kwaye zikulungele kuwe.
Kukho imithombo emininzi yokufunda, umzekelo,
Iigrafu zolwazi
Apha ndinamava angaqhelekanga, kuba ndihlala ndisebenza ngeegrafu zolwazi kunye neelwimi zokubuza kwiigrafu. Ke ngoko, makhe sijonge ngokufutshane izinto ezisisiseko, njengoko le nxalenye ingaqhelekanga ngakumbi.
Kwiinkcukacha zolwazi olunxulumene neklasiki sine-schema esisigxina, kodwa apha i-schema iyaguquguquka, isivisa ngasinye ngokwenene "yikholomu" nangaphezulu.
Khawufane ucinge ukuba ubulinganisa umntu kwaye ufuna ukuchaza izinto eziphambili, umzekelo, masithathe umntu othile, uDouglas Adams, kwaye sisebenzise le nkcazo njengesiseko.
Ukuba sisebenzise isiseko sedatha esinxulumeneyo, kuya kufuneka senze itafile enkulu okanye iitafile ezinenani elikhulu leekholamu, uninzi lwayo luya kuba NULL okanye luzaliswe ngexabiso elithile elingagqibekanga, umzekelo, akunakwenzeka ukuba uninzi lwethu ukungena kwithala leencwadi lesizwe laseKorea - kunjalo, sinokuzibeka kwiitafile ezahlukeneyo, kodwa oku kuya kuba yinzame yokulinganisa isekethe enengqondo eguquguqukayo kunye nezibikezelo usebenzisa irelational esisigxina.
Ke khawufane ucinge ukuba yonke idatha igcinwe njengegrafu okanye njengokubini kunye neentetho ze-boolean ezingaqhelekanga.
Ungadibana phi nale nto? Okokuqala, ukusebenza kunye
Oku kulandelayo zezona lwimi ziphambili zombuzo endikhe ndazisebenzisa kwaye ndasebenza nazo.
SPARQL
Wiki:
SPARQL (i-recursive acronym ukusukaEng. IProtokholi ye-SPARQL kunye noLwimi loMbuzo lwe-RDF) -ulwimi lombuzo wedatha , emelwe yimodeliI-RDF , kwakunyeumthetho olandelwayo ukuhambisa ezi zicelo kunye nokuphendula kuzo. SPARQL sisindululoI-W3C Consortium kunye nenye yetekhnolojiiwebhu ye-semantic .
Kodwa eneneni lulwimi lombuzo kwi-logical unary and binary predicates. Ubalula ngokulula oko kulungisiweyo kwintetho ye-Boolean kwaye yintoni engeyiyo (eyenziwe lula kakhulu).
I-RDF (iSakhelo seNkcazelo yeZibonelelo) isiseko ngokwaso, apho imibuzo ye-SPARQL isenziwa khona, iphindwe kathathu. object, predicate, subject
- kwaye umbuzo ukhetha ii-triple ezifunekayo ngokwezithintelo ezikhankanyiweyo emoyeni: fumana i-X enje ukuba p_55(X, q_33) yinyani - apho, ngokuqinisekileyo, p_55 luhlobo oluthile lobudlelwane kunye ne-ID 55, kwaye q_33 into nge-ID 33 (apha kunye nebali lonke, kwakhona ukushiya zonke iintlobo zeenkcukacha).
Umzekelo wonikezelo lwedatha:
Imifanekiso kunye nomzekelo kunye namazwe apha
Umbuzo osisiseko umzekelo
Enyanisweni, sifuna ukufumana ixabiso le-?inguqu yelizwe ukuze isivisa
member_of, kuyinyani ukuba ilungu_le(?lizwe, q458) kunye ne-q458 yi-ID ye-European Union.
Umzekelo wombuzo wokwenyani we-SPARQL ngaphakathi kwinjini yepython:
Ngokwesiqhelo, kuye kwafuneka ndifunde i-SPARQL endaweni yokuyibhala - kuloo meko, inokuba sisixhobo esiluncedo ukuqonda ulwimi ubuncinci kwinqanaba elisisiseko ukuqonda ngokuthe ngqo ukuba idatha ifunyanwa njani.
Zininzi izinto zokufunda kwi-intanethi: umzekelo, apha
Iilwimi zombuzo onengqondo
Unokufunda ngakumbi ngesihloko kwinqaku lam
output(X) :- country(X), member_of(X,“EU”).
Apha sithetha ngokudala i-predicate output esitsha/1 (/1 ithetha unary), ngaphandle kokuba ku-X kuyinyani ukuba ilizwe(X) - o.k.t., X lilizwe kwaye ikwalungu_ka(X,"EU ").
Oko kukuthi, kule meko, zombini idatha kunye nemithetho ibekwe ngendlela efanayo, evumela ukuba senze umzekelo weengxaki ngokulula kwaye kakuhle.
Nadibana phi kolu shishino?: iprojekthi enkulu kunye nenkampani ebhala imibuzo ngolwimi olunjalo, kunye nakwiprojekthi yangoku kumbindi wenkqubo - kubonakala ngathi le nto yinto engaqhelekanga, kodwa ngamanye amaxesha iyenzeka.
Umzekelo weqhekeza lekhowudi kwiwikidata yolwimi olunengqiqo:
Izixhobo: Ndiza kunika apha amakhonkco ambalwa kulwimi lwenkqubo olunengqiqo yale mihla Impendulo yokuSeta iNkqubo-Ndicebisa ukuba uyifunde:
http://peace.eas.asu.edu/aaai12tutorial/asp-tutorial-aaai.pdf http://ceur-ws.org/Vol-1145/tutorial1.pdf https://www.youtube.com/watch?v=gVQ0bP8zyHw https://www.youtube.com/watch?v=kdcd7Je2glc https://potassco.org/book/ http://potassco.sourceforge.net/teaching.html https://www.cs.uni-potsdam.de/~torsten/Potassco/Tutorials/fmcad12.pdf
umthombo: www.habr.com