Ndiri kukuudza kubva pane zvakaitika kwauri pachako izvo zvaibatsira kupi uye riinhi. Iri muchidimbu uye thesis, kuti zvive pachena kuti chii uye kupi kwaunogona kuchera mberi - asi pano ini ndine ruzivo rwemunhu oga, pamwe zvese zvakasiyana kwauri.
Sei zvakakosha kuziva uye kukwanisa kushandisa mitauro yemibvunzo? Pakati payo, Sainzi yeData ine akati wandei akakosha matanho ebasa, uye yekutanga uye yakanyanya kukosha (pasina iyo, zvirokwazvo hapana chichashanda!) Kuwana kana kutora data. Kazhinji, iyo data yakagara pane imwe nzvimbo mune imwe fomu uye inoda "kudzoserwa" kubva ipapo.
Mitauro yemubvunzo inokutendera kuti ubvise iyi data chaiyo! Uye nhasi ini ndichakuudza nezve iyo mitauro yemibvunzo yakandibatsira uye ini ndichakuudza uye kukuratidza kupi uye sei chaizvo - nei ichidikanwa kudzidza.
Pachave nemabhuru matatu makuru emhando dzemibvunzo yedata, yatichakurukura munyaya ino:
- "Yakajairika" mitauro yemibvunzo ndiyo inowanzo nzwisiswa kana uchitaura nezvemutauro wemubvunzo, senge relational algebra kana SQL.
- Mitauro yemibvunzo yekunyora: semuenzaniso, Python zvinhu pandas, numpy kana shell scripting.
- Kubvunza mitauro yeruzivo magirafu uye magirafu dhatabhesi.
Zvese zvakanyorwa pano zvinongova chiitiko chemunhu, chii chaibatsira, nekutsanangurwa kwemamiriro ezvinhu uye "nei zvaidiwa" - munhu wese anogona kuedza kuti mamiriro akafanana angauya sei uye edza kuzvigadzirira pamberi nekunzwisisa mitauro iyi. usati wanyorera (nekukurumidza) pane purojekiti kana kutosvika kune purojekiti painodiwa.
"Yakajairika" mitauro yemibvunzo
Mitauro yemubvunzo yakajairwa iri chaizvo mupfungwa yekuti tinowanzofunga nezvayo kana tichitaura nezvemibvunzo.
Relational algebra
Sei algebra yehukama ichidikanwa nhasi? Kuti uve nekunzwisisa kwakanaka kwechikonzero nei mitauro yemubvunzo yakaumbwa neimwe nzira uye nekuishandisa nekuziva, iwe unofanirwa kunzwisisa iyo yakakosha pazviri.
Chii chinonzi relational algebra?
Tsanangudzo yepamutemo ndeiyi: relational algebra inzira yakavharwa yekushanda pahukama mune relational data model. Kuti uzviise zvishoma sevanhu, iyi ihurongwa hwekushanda pamatafura zvekuti mhedzisiro inogara iri tafura.
Ona zvese zvehukama mashandiro mukati
Sei?
Kutanga kunzwisisa kuti ndezvipi mitauro yemibvunzo uye ndezvipi mashandiro ari kumashure kwemazwi mumitauro yemibvunzo inowanzopa kunzwisisa kwakadzama kwezvinoshanda mumitauro yemubvunzo uye sei.
Yakatorwa kubva
Zvishandiso zvekudzidza:
SQL
Yakatorwa kubva
SQL ndeyekuitwa kwehukama algebra - ine caveat yakakosha, SQL inozivisa! Ndokunge, kana uchinyora mubvunzo mumutauro wehukama algebra, unototaura maverengero - asi neSQL unotsanangura zvaunoda kuburitsa, uyezve iyo DBMS inotogadzira (inoshanda) mataurirwo mumutauro wehukama algebra (yavo. kuenzana kunozivikanwa kwatiri se
Yakatorwa kubva
Sei?
Relational DBMSs: Oracle, Postgres, SQL Server, etc zvichiri kwese kwese uye pane mukana wakakura unoshamisa wekuti uchafanira kutaurirana navo, zvinoreva kuti uchafanira kuverenga SQL (iyo ingangoita) kana kuinyora ( hazvigoneki futi).
Zvekuverenga nekudzidza
Zvinoenderana neaya malink ari pamusoro (ane relational algebra), kune huwandu hunoshamisa hwezvinhu, semuenzaniso,
Nenzira, chii chinonzi NoSQL?
"Zvakakosha kusimbisa zvakare kuti izwi rekuti "NoSQL" rine mabviro chaiwo uye harina tsananguro inogamuchirwa kana sainzi seri kwaro." Zvinoenderana
Muchokwadi, vanhu vakaziva kuti yakazara yehukama modhi haidiwe kugadzirisa matambudziko mazhinji, kunyanya kune ayo, semuenzaniso, kuita kwakakosha uye mimwe mibvunzo yakapusa ine aggregation inotonga - uko kwakakosha kukurumidza kuverenga metric uye kuinyorera kune dhatabhesi, uye akawanda maficha ane hukama akave kwete chete zvisina kufanira, asi zvakare zvinokuvadza - nei kuenzanisa chinhu kana chichizokanganisa chinhu chakakosha kwatiri (kune rimwe basa chairo) - kubereka?
Zvakare, masisitimu anochinjika anowanzo kudiwa pachinzvimbo cheakamisikidzwa masvomhu schemas eiyo classical relational modhi - uye izvi zvinorerutsa kusimudzira application kana zvichikosha kuendesa sisitimu uye kutanga kushanda nekukurumidza, kugadzirisa mhedzisiro - kana schema nemhando dze data rakachengetwa. hazvina kunyanya kukosha.
Semuyenzaniso, tiri kugadzira hurongwa hwenyanzvi uye tinoda kuchengetedza ruzivo pane imwe nzvimbo pamwe nerumwe ruzivo rwemeta - isu tingasaziva minda yese uye tinongochengeta JSON yerekodhi yega yega - izvi zvinotipa nharaunda inochinjika yekuwedzera iyo data. modhi uye nekukurumidza kudzokorodza - saka mune iyi kesi, NoSQL ichave yakasarudzika uye inoverengeka. Muenzaniso wekupinda (kubva kune imwe yemapurojekiti angu uko NoSQL yaive chaipo payaidiwa).
{"en_wikipedia_url":"https://en.wikipedia.org/wiki/Johnny_Cash",
"ru_wikipedia_url":"https://ru.wikipedia.org/wiki/?curid=301643",
"ru_wiki_pagecount":149616,
"entity":[42775,"Джонни Кэш","ru"],
"en_wiki_pagecount":2338861}
Unogona kuverenga zvakawanda
Chii chokudzidza?
Pano, pane kudaro, iwe unongoda kunyatso ongorora basa rako, kuti ndezvipi zvivakwa uye ndeapi maNoSQL masisitimu aripo angakodzera iyi tsananguro - wobva watanga kudzidza iyi system.
Mitauro Yemibvunzo Yekunyora
Pakutanga, zvinoita sekunge, Python inei chekuita nazvo zvachose - mutauro wekuronga, uye kwete nezvemibvunzo zvachose.
- Pandas iSwiss Army banga reData Science; huwandu hukuru hwekushandurwa kwedata, kuunganidzwa, nezvimwe zvinoitika mairi.
- Numpy - vector kuverenga, matrices uye linear algebra ipapo.
- Scipy - kune akawanda masvomhu mupakeji iyi, kunyanya manhamba.
- Jupyter lab - yakawanda yekuongorora data yekuongorora inoenderana nemalaptops - inobatsira kuziva.
- Zvikumbiro - kushanda netiweki.
- Pyspark inonyanya kufarirwa pakati peinjiniya yedata, kazhinji iwe uchafanirwa kupindirana neizvi kana Spark, nekuda kwekuzivikanwa kwavo.
- *Selenium - inobatsira kwazvo kuunganidza data kubva kumasaiti nezviwanikwa, dzimwe nguva hapana imwe nzira yekuwana iyo data.
Zano rangu guru: dzidza Python!
pandas
Ngatitorei kodhi inotevera semuenzaniso:
import pandas as pd
df = pd.read_csv(“data/dataset.csv”)
# Calculate and rename aggregations
all_together = (df[df[‘trip_type’] == “return”]
.groupby(['start_station_name','end_station_name'])
.agg({'trip_duration_seconds': [np.size, np.mean, np.min, np.max]})
.rename(columns={'size': 'num_trips',
'mean': 'avg_duration_seconds',
'amin': min_duration_seconds',
‘amax': 'max_duration_seconds'}))
Chaizvoizvo, tinoona kuti iyo kodhi inokwana mune yekare SQL pateni.
SELECT start_station_name, end_station_name, count(trip_duration_seconds) as size, …..
FROM dataset
WHERE trip_type = ‘return’
GROUPBY start_station_name, end_station_name
Asi chikamu chakakosha ndechekuti iyi kodhi chikamu chezvinyorwa uye pombi; kutaura zvazviri, tiri kupinza mibvunzo muPython pombi. Mumamiriro ezvinhu aya, mutauro wemubvunzo unouya kwatiri kubva kumaraibhurari akadai sePandas kana pySpark.
Kazhinji, mupySpark tinoona yakafanana mhando yekushandura data kuburikidza nemutauro wemubvunzo mumweya we:
df.filter(df.trip_type = “return”)
.groupby(“day”)
.agg({duration: 'mean'})
.sort()
Kupi uye chii chekuverenga
PaPython pachayo mune zvese
Shell semutauro wekubvunza
Mamwe mashoma ekugadzirisa data uye mapurojekiti ekuongorora andakashanda nawo, ari kutaura zvazviri, zvinyorwa zvemabhomba zvinodaidza kodhi muPython, Java, uye goko rinozviraira. Naizvozvo, kazhinji, iwe unogona kufunga mapaipi mubash/zsh/etc seimwe mhando yemubvunzo wepamusoro-soro (iwe unogona, hongu, zvinhu zvishwe mukati, asi izvi hazviwanzo kuDS kodhi mumitauro yegoko), ngatipei. muenzaniso wakapfava - ndaifanira kuita QID mepu yewikidata uye zvizere zvinongedzo kuRussia neChirungu wikis, nekuda kweizvi ndakanyora chikumbiro chiri nyore kubva kumirairo iri mubash uye nekubuda kwandakanyora script iri nyore muPython, yandakanyora. sangana seizvi:
pv “data/latest-all.json.gz” |
unpigz -c |
jq --stream $JQ_QUERY |
python3 scripts/post_process.py "output.csv"
apo
JQ_QUERY = 'select((.[0][1] == "sitelinks" and (.[0][2]=="enwiki" or .[0][2] =="ruwiki") and .[0][3] =="title") or .[0][1] == "id")'
Izvi zvaive, chokwadi, iyo yese pombi yakagadzira iyo inodiwa mepu; sezvatinoona, zvese zvakashanda muhova modhi:
- pv filepath - inopa kufambira mberi bar zvichienderana nesaizi yefaira uye inopfuudza zvirimo zvichienda mberi
- unpigz -c verenga chikamu chedura ndokuchipa jq
- jq ine kiyi - rukova rwakabva rwaburitsa mhedzisiro ndokuipfuudza kune postprocessor (sezvakangoita nemuenzaniso wekutanga) muPython.
- mukati, iyo postprocessor yaive iri nyore mamiriro muchina waigadzira zvakabuda
Pakazara, pombi yakaoma inoshanda mukuyerera modhi pane yakakura data (0.5TB), isina yakakosha zviwanikwa uye yakagadzirwa kubva kune nyore pombi uye akati wandei maturusi.
Imwe matipi akakosha: kugona kushanda nemazvo uye zvinobudirira mune terminal uye nyora bash/zsh/etc.
Zvichabatsira kupi? Ehe, kunenge kwese kwese - zvakare, kune ZVINHU zvezvinhu zvekudzidza paInternet. Kunyanya, pano
R kunyora
Zvekare, muverengi anogona kudaidzira - zvakanaka, uyu ndiwo mutauro wechirongwa! Uye zvechokwadi, iye achava akarurama. Nekudaro, ini ndaiwanzosangana neR mumamiriro ezvinhu akadaro, zvekuti, yaive yakafanana nemutauro wemubvunzo.
R inzvimbo yekuverenga komputa uye mutauro weiyo static komputa uye kuona (maererano ne
kutorwa
Sei musayendisiti wedata achida kuziva R? Zvirinani, nekuti kune yakakura yevasiri-IT vanhu vanoongorora data muR. Ndakasangana nazvo munzvimbo dzinotevera:
- Pharmaceutical sector.
- Biologists.
- Chikamu chezvemari.
- Vanhu vane dzidzo yemasvomhu chete vanobata nenhamba.
- Nyanzvi dzezviverengero modhi uye modhi yekudzidza yemuchina (iyo inowanzowanikwa chete mushanduro yemunyori seR package).
Sei uri chaizvo mutauro wekubvunza? Muchimiro icho chinowanzo kuwanikwa, chiri chikumbiro chekugadzira modhi, kusanganisira kuverenga data uye kugadzirisa query (modhi) paramita, pamwe nekuona data mumapakeji akadai seggplot2 - iyi zvakare inzira yekunyora mibvunzo. .
Mienzaniso yemibvunzo yekuona
ggplot(data = beav,
aes(x = id, y = temp,
group = activ, color = activ)) +
geom_line() +
geom_point() +
scale_color_manual(values = c("red", "blue"))
Kazhinji, mazano mazhinji kubva kuR akatamira mumapakeji epython senge pandas, numpy kana scipy, senge dataframes uye data vectorization - saka kazhinji zvinhu zvakawanda muR zvinoita sekujaira uye zvakakunakira iwe.
Kune akawanda masosi ekudzidza, semuenzaniso,
Magirafu ezivo
Pano ndine chiitiko chisina kujairika, nekuti ini kazhinji ndinofanira kushanda neruzivo magirafu uye mitauro yekubvunza magirafu. Naizvozvo, ngatingopfuurai muchidimbu pamusoro pezvakakosha, sezvo chikamu ichi chiri chidiki chidiki.
Mune classical relational dhatabhesi isu tine yakagadziriswa schema, asi pano schema inoshanduka, imwe neimwe predicate ichokwadi "column" uye zvakatowanda.
Fungidzira kuti wanga uchifananidza munhu uye waida kutsanangura zvinhu zvakakosha, semuenzaniso, ngatitorei munhu chaiye, Douglas Adams, uye tishandise tsananguro iyi sehwaro.
Kana tikashandisa dhatabhesi rehukama, taizofanira kugadzira tafura hombe kana matafura ane nhamba huru yemakoramu, mazhinji acho angave NULL kana akazadzwa neimwe default Nhema kukosha, semuenzaniso, hazvigoneke kuti vazhinji vedu tine kupinda muraibhurari yenyika yeKorea - hongu, tinogona kuvaisa mumatafura akasiyana, asi izvi zvinozopedzisira zvave kuedza kuenzanisira dunhu rinonzwisisika rine zvirevo uchishandisa yakagadziriswa hukama.
Saka fungidzira kuti data rese rakachengetwa segirafu kana sebhinari uye unary boolean kutaura.
Ndekupi kwaungatombosangana nazvo? Kutanga, kushanda pamwe
Iyi inotevera mitauro mikuru yemubvunzo yandakashandisa nekushanda nayo.
SPARQL
Wiki:
SPARQL (recursive acronym отShona SPARQL Protocol uye RDF Mubvunzo Mutauro) -mutauro wekubvunza data , inomiririrwa nemuenzanisoR.F.D. , pamwe cheteprotocol kutumira zvikumbiro izvi uye kuzvipindura. SPARQL ikurudziroW3C Consortium uye imwe yetekinorojisemantic web .
Asi muchokwadi iwo mutauro wekubvunza kune zvine musoro unary uye binary predicates. Uri kungotaura nemamiriro ezvinhu izvo zvakagadziriswa mukutaura kweBoolean uye izvo zvisiri (zvakarerutswa kwazvo).
Iyo RDF (Resource Description Framework) base pachayo, pamusoro peiyo SPARQL mibvunzo inoitwa, inopetwa katatu. object, predicate, subject
- uye mubvunzo unosarudza zvakapetwa katatu maererano nezvinorambidzwa mumweya: tsvaga X yakadai p_55(X, q_33) ichokwadi - uko, hongu, p_55 imhando yehukama neID 55, uye q_33 i chinhu chine ID 33 (pano uye iyo nyaya yese, zvakare kusiya marudzi ese ezvinyorwa).
Muenzaniso wekupa data:
Mifananidzo uye muenzaniso nenyika dziri pano
Basic Query Muenzaniso
Chokwadi, tinoda kuwana kukosha kwe ?nyika kusiyanisa zvekuti kune chivakashure
member_of, ichokwadi kuti member_of(?country,q458) uye q458 ndiyo ID yeEuropean Union.
Muenzaniso weiyo chaiyo SPARQL mubvunzo mukati meinjini yepython:
Kazhinji, ndaifanira kuverenga SPARQL pane kuinyora - mumamiriro ezvinhu akadaro, ingangove hunyanzvi hunobatsira kunzwisisa mutauro padanho rekutanga kuti unzwisise kuti data rinotorwa sei.
Pane zvakawanda zvezvinhu zvekudzidza online: semuenzaniso, pano
Mitauro yemubvunzo inonzwisisika
Iwe unogona kuverenga zvakawanda pamusoro pechinyorwa munyaya yangu
output(X) :- country(X), member_of(X,“EU”).
Pano tiri kutaura nezve kugadzira chivakashure chitsva/1 (/1 zvinoreva unary), chero ku X chiri chokwadi kuti nyika(X) - kureva kuti, X inyika uye zvakare inhengo_ye(X,"EU ").
Ndiko kuti, munyaya iyi, zvose data nemitemo zvinoratidzwa nenzira imwechete, iyo inotibvumira kuenzanisira matambudziko nyore nyore uye zvakanaka.
Makasangana kupi muindasitiri?: purojekiti yakakura nekambani inonyora mibvunzo mumutauro wakadaro, pamwe chete nepurojekiti yazvino iri pakati pegadziriro - zvingaita sokuti ichi chinhu chinoshamisa, asi dzimwe nguva zvinoitika.
Muenzaniso wechidimbu chekodhi mumutauro unonzwisisika kugadzirisa wikidata:
Zvishandiso: Ini ndinopa pano akati wandei ekubatanidza kune yemazuva ano inonzwisisika programming mutauro Mhinduro Seta Chirongwa - Ini ndinokurudzira kuidzidza:
http://peace.eas.asu.edu/aaai12tutorial/asp-tutorial-aaai.pdf http://ceur-ws.org/Vol-1145/tutorial1.pdf https://www.youtube.com/watch?v=gVQ0bP8zyHw https://www.youtube.com/watch?v=kdcd7Je2glc https://potassco.org/book/ http://potassco.sourceforge.net/teaching.html https://www.cs.uni-potsdam.de/~torsten/Potassco/Tutorials/fmcad12.pdf
Source: www.habr.com