Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro
Ndiri kukuudza kubva pane zvakaitika kwauri pachako izvo zvaibatsira kupi uye riinhi. Iri muchidimbu uye thesis, kuti zvive pachena kuti chii uye kupi kwaunogona kuchera mberi - asi pano ini ndine ruzivo rwemunhu oga, pamwe zvese zvakasiyana kwauri.

Sei zvakakosha kuziva uye kukwanisa kushandisa mitauro yemibvunzo? Pakati payo, Sainzi yeData ine akati wandei akakosha matanho ebasa, uye yekutanga uye yakanyanya kukosha (pasina iyo, zvirokwazvo hapana chichashanda!) Kuwana kana kutora data. Kazhinji, iyo data yakagara pane imwe nzvimbo mune imwe fomu uye inoda "kudzoserwa" kubva ipapo. 

Mitauro yemubvunzo inokutendera kuti ubvise iyi data chaiyo! Uye nhasi ini ndichakuudza nezve iyo mitauro yemibvunzo yakandibatsira uye ini ndichakuudza uye kukuratidza kupi uye sei chaizvo - nei ichidikanwa kudzidza.

Pachave nemabhuru matatu makuru emhando dzemibvunzo yedata, yatichakurukura munyaya ino:

  • "Yakajairika" mitauro yemibvunzo ndiyo inowanzo nzwisiswa kana uchitaura nezvemutauro wemubvunzo, senge relational algebra kana SQL.
  • Mitauro yemibvunzo yekunyora: semuenzaniso, Python zvinhu pandas, numpy kana shell scripting.
  • Kubvunza mitauro yeruzivo magirafu uye magirafu dhatabhesi.

Zvese zvakanyorwa pano zvinongova chiitiko chemunhu, chii chaibatsira, nekutsanangurwa kwemamiriro ezvinhu uye "nei zvaidiwa" - munhu wese anogona kuedza kuti mamiriro akafanana angauya sei uye edza kuzvigadzirira pamberi nekunzwisisa mitauro iyi. usati wanyorera (nekukurumidza) pane purojekiti kana kutosvika kune purojekiti painodiwa.

"Yakajairika" mitauro yemibvunzo

Mitauro yemubvunzo yakajairwa iri chaizvo mupfungwa yekuti tinowanzofunga nezvayo kana tichitaura nezvemibvunzo.

Relational algebra

Sei algebra yehukama ichidikanwa nhasi? Kuti uve nekunzwisisa kwakanaka kwechikonzero nei mitauro yemubvunzo yakaumbwa neimwe nzira uye nekuishandisa nekuziva, iwe unofanirwa kunzwisisa iyo yakakosha pazviri.

Chii chinonzi relational algebra?

Tsanangudzo yepamutemo ndeiyi: relational algebra inzira yakavharwa yekushanda pahukama mune relational data model. Kuti uzviise zvishoma sevanhu, iyi ihurongwa hwekushanda pamatafura zvekuti mhedzisiro inogara iri tafura.

Ona zvese zvehukama mashandiro mukati izvi chinyorwa kubva kuHabr - pano tinotsanangura chikonzero nei iwe uchifanira kuziva uye kwainouya nenzira inobatsira.

Sei?

Kutanga kunzwisisa kuti ndezvipi mitauro yemibvunzo uye ndezvipi mashandiro ari kumashure kwemazwi mumitauro yemibvunzo inowanzopa kunzwisisa kwakadzama kwezvinoshanda mumitauro yemubvunzo uye sei.

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro
Yakatorwa kubva izvi zvinyorwa. Muenzaniso wekushanda: kubatanidza, iyo inobatanidza matafura.

Zvishandiso zvekudzidza:

Yakanaka yekutanga kosi kubva kuStanford. Kazhinji, kune zvakawanda zvekushandisa pane relational algebra uye dzidziso - Coursera, Udacity. Kune zvakare huwandu hukuru hwezvinhu online, kusanganisira zvakanaka zvidzidzo zvedzidzo. Zano rangu pachangu: iwe unofanirwa kunzwisisa relational algebra zvakanyanya - iyi ndiyo hwaro hwezvakakosha.

SQL

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro
Yakatorwa kubva izvi zvinyorwa.

SQL ndeyekuitwa kwehukama algebra - ine caveat yakakosha, SQL inozivisa! Ndokunge, kana uchinyora mubvunzo mumutauro wehukama algebra, unototaura maverengero - asi neSQL unotsanangura zvaunoda kuburitsa, uyezve iyo DBMS inotogadzira (inoshanda) mataurirwo mumutauro wehukama algebra (yavo. kuenzana kunozivikanwa kwatiri se Codd's theorem).

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro
Yakatorwa kubva izvi zvinyorwa.

Sei?

Relational DBMSs: Oracle, Postgres, SQL Server, etc zvichiri kwese kwese uye pane mukana wakakura unoshamisa wekuti uchafanira kutaurirana navo, zvinoreva kuti uchafanira kuverenga SQL (iyo ingangoita) kana kuinyora ( hazvigoneki futi).

Zvekuverenga nekudzidza

Zvinoenderana neaya malink ari pamusoro (ane relational algebra), kune huwandu hunoshamisa hwezvinhu, semuenzaniso, izvi.

Nenzira, chii chinonzi NoSQL?

"Zvakakosha kusimbisa zvakare kuti izwi rekuti "NoSQL" rine mabviro chaiwo uye harina tsananguro inogamuchirwa kana sainzi seri kwaro." Zvinoenderana chinyorwa pana Habr.

Muchokwadi, vanhu vakaziva kuti yakazara yehukama modhi haidiwe kugadzirisa matambudziko mazhinji, kunyanya kune ayo, semuenzaniso, kuita kwakakosha uye mimwe mibvunzo yakapusa ine aggregation inotonga - uko kwakakosha kukurumidza kuverenga metric uye kuinyorera kune dhatabhesi, uye akawanda maficha ane hukama akave kwete chete zvisina kufanira, asi zvakare zvinokuvadza - nei kuenzanisa chinhu kana chichizokanganisa chinhu chakakosha kwatiri (kune rimwe basa chairo) - kubereka?

Zvakare, masisitimu anochinjika anowanzo kudiwa pachinzvimbo cheakamisikidzwa masvomhu schemas eiyo classical relational modhi - uye izvi zvinorerutsa kusimudzira application kana zvichikosha kuendesa sisitimu uye kutanga kushanda nekukurumidza, kugadzirisa mhedzisiro - kana schema nemhando dze data rakachengetwa. hazvina kunyanya kukosha.

Semuyenzaniso, tiri kugadzira hurongwa hwenyanzvi uye tinoda kuchengetedza ruzivo pane imwe nzvimbo pamwe nerumwe ruzivo rwemeta - isu tingasaziva minda yese uye tinongochengeta JSON yerekodhi yega yega - izvi zvinotipa nharaunda inochinjika yekuwedzera iyo data. modhi uye nekukurumidza kudzokorodza - saka mune iyi kesi, NoSQL ichave yakasarudzika uye inoverengeka. Muenzaniso wekupinda (kubva kune imwe yemapurojekiti angu uko NoSQL yaive chaipo payaidiwa).

{"en_wikipedia_url":"https://en.wikipedia.org/wiki/Johnny_Cash",
"ru_wikipedia_url":"https://ru.wikipedia.org/wiki/?curid=301643",
"ru_wiki_pagecount":149616,
"entity":[42775,"Джонни Кэш","ru"],
"en_wiki_pagecount":2338861}

Unogona kuverenga zvakawanda pano nezveNoSQL.

Chii chokudzidza?

Pano, pane kudaro, iwe unongoda kunyatso ongorora basa rako, kuti ndezvipi zvivakwa uye ndeapi maNoSQL masisitimu aripo angakodzera iyi tsananguro - wobva watanga kudzidza iyi system.

Mitauro Yemibvunzo Yekunyora

Pakutanga, zvinoita sekunge, Python inei chekuita nazvo zvachose - mutauro wekuronga, uye kwete nezvemibvunzo zvachose.

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro

  • Pandas iSwiss Army banga reData Science; huwandu hukuru hwekushandurwa kwedata, kuunganidzwa, nezvimwe zvinoitika mairi.
  • Numpy - vector kuverenga, matrices uye linear algebra ipapo.
  • Scipy - kune akawanda masvomhu mupakeji iyi, kunyanya manhamba.
  • Jupyter lab - yakawanda yekuongorora data yekuongorora inoenderana nemalaptops - inobatsira kuziva.
  • Zvikumbiro - kushanda netiweki.
  • Pyspark inonyanya kufarirwa pakati peinjiniya yedata, kazhinji iwe uchafanirwa kupindirana neizvi kana Spark, nekuda kwekuzivikanwa kwavo.
  • *Selenium - inobatsira kwazvo kuunganidza data kubva kumasaiti nezviwanikwa, dzimwe nguva hapana imwe nzira yekuwana iyo data.

Zano rangu guru: dzidza Python!

pandas

Ngatitorei kodhi inotevera semuenzaniso:

import pandas as pd
df = pd.read_csv(“data/dataset.csv”)
# Calculate and rename aggregations
all_together = (df[df[‘trip_type’] == “return”]
    .groupby(['start_station_name','end_station_name'])
                  	    .agg({'trip_duration_seconds': [np.size, np.mean, np.min, np.max]})
                           .rename(columns={'size': 'num_trips', 
           'mean': 'avg_duration_seconds',    
           'amin': min_duration_seconds', 
           ‘amax': 'max_duration_seconds'}))

Chaizvoizvo, tinoona kuti iyo kodhi inokwana mune yekare SQL pateni.

SELECT start_station_name, end_station_name, count(trip_duration_seconds) as size, …..
FROM dataset
WHERE trip_type = ‘return’
GROUPBY start_station_name, end_station_name

Asi chikamu chakakosha ndechekuti iyi kodhi chikamu chezvinyorwa uye pombi; kutaura zvazviri, tiri kupinza mibvunzo muPython pombi. Mumamiriro ezvinhu aya, mutauro wemubvunzo unouya kwatiri kubva kumaraibhurari akadai sePandas kana pySpark.

Kazhinji, mupySpark tinoona yakafanana mhando yekushandura data kuburikidza nemutauro wemubvunzo mumweya we:

df.filter(df.trip_type = “return”)
  .groupby(“day”)
  .agg({duration: 'mean'})
  .sort()

Kupi uye chii chekuverenga

PaPython pachayo mune zvese kwete dambudziko tsvaga zvekushandisa pakudzidza. Kune huwandu hukuru hwezvidzidzo online pandas, pySpark uye makosi on chimvari (uye zvakare pachayo DS) Pakazere, zviri pano zvakanakira googling, uye dai ndaifanira kusarudza pasuru imwe kuti nditarise pairi, ingave pandas, hongu. Nezve kusanganiswa kweDS + Python zvinhu zvakare zvikuru kwazvo.

Shell semutauro wekubvunza

Mamwe mashoma ekugadzirisa data uye mapurojekiti ekuongorora andakashanda nawo, ari kutaura zvazviri, zvinyorwa zvemabhomba zvinodaidza kodhi muPython, Java, uye goko rinozviraira. Naizvozvo, kazhinji, iwe unogona kufunga mapaipi mubash/zsh/etc seimwe mhando yemubvunzo wepamusoro-soro (iwe unogona, hongu, zvinhu zvishwe mukati, asi izvi hazviwanzo kuDS kodhi mumitauro yegoko), ngatipei. muenzaniso wakapfava - ndaifanira kuita QID mepu yewikidata uye zvizere zvinongedzo kuRussia neChirungu wikis, nekuda kweizvi ndakanyora chikumbiro chiri nyore kubva kumirairo iri mubash uye nekubuda kwandakanyora script iri nyore muPython, yandakanyora. sangana seizvi:

pv “data/latest-all.json.gz” | 
unpigz -c  | 
jq --stream $JQ_QUERY | 
python3 scripts/post_process.py "output.csv"

apo

JQ_QUERY = 'select((.[0][1] == "sitelinks" and (.[0][2]=="enwiki" or .[0][2] =="ruwiki") and .[0][3] =="title") or .[0][1] == "id")' 

Izvi zvaive, chokwadi, iyo yese pombi yakagadzira iyo inodiwa mepu; sezvatinoona, zvese zvakashanda muhova modhi:

  • pv filepath - inopa kufambira mberi bar zvichienderana nesaizi yefaira uye inopfuudza zvirimo zvichienda mberi
  • unpigz -c verenga chikamu chedura ndokuchipa jq
  • jq ine kiyi - rukova rwakabva rwaburitsa mhedzisiro ndokuipfuudza kune postprocessor (sezvakangoita nemuenzaniso wekutanga) muPython.
  • mukati, iyo postprocessor yaive iri nyore mamiriro muchina waigadzira zvakabuda 

Pakazara, pombi yakaoma inoshanda mukuyerera modhi pane yakakura data (0.5TB), isina yakakosha zviwanikwa uye yakagadzirwa kubva kune nyore pombi uye akati wandei maturusi.

Imwe matipi akakosha: kugona kushanda nemazvo uye zvinobudirira mune terminal uye nyora bash/zsh/etc.

Zvichabatsira kupi? Ehe, kunenge kwese kwese - zvakare, kune ZVINHU zvezvinhu zvekudzidza paInternet. Kunyanya, pano izvi nyaya yangu yapfuura.

R kunyora

Zvekare, muverengi anogona kudaidzira - zvakanaka, uyu ndiwo mutauro wechirongwa! Uye zvechokwadi, iye achava akarurama. Nekudaro, ini ndaiwanzosangana neR mumamiriro ezvinhu akadaro, zvekuti, yaive yakafanana nemutauro wemubvunzo.

R inzvimbo yekuverenga komputa uye mutauro weiyo static komputa uye kuona (maererano ne izvi).

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro
kutorwa kubva pano. Nenzira, ndinoikurudzira, zvinhu zvakanaka.

Sei musayendisiti wedata achida kuziva R? Zvirinani, nekuti kune yakakura yevasiri-IT vanhu vanoongorora data muR. Ndakasangana nazvo munzvimbo dzinotevera:

  • Pharmaceutical sector.
  • Biologists.
  • Chikamu chezvemari.
  • Vanhu vane dzidzo yemasvomhu chete vanobata nenhamba.
  • Nyanzvi dzezviverengero modhi uye modhi yekudzidza yemuchina (iyo inowanzowanikwa chete mushanduro yemunyori seR package).

Sei uri chaizvo mutauro wekubvunza? Muchimiro icho chinowanzo kuwanikwa, chiri chikumbiro chekugadzira modhi, kusanganisira kuverenga data uye kugadzirisa query (modhi) paramita, pamwe nekuona data mumapakeji akadai seggplot2 - iyi zvakare inzira yekunyora mibvunzo. .

Mienzaniso yemibvunzo yekuona

ggplot(data = beav, 
       aes(x = id, y = temp, 
           group = activ, color = activ)) +
  geom_line() + 
  geom_point() +
  scale_color_manual(values = c("red", "blue"))

Kazhinji, mazano mazhinji kubva kuR akatamira mumapakeji epython senge pandas, numpy kana scipy, senge dataframes uye data vectorization - saka kazhinji zvinhu zvakawanda muR zvinoita sekujaira uye zvakakunakira iwe.

Kune akawanda masosi ekudzidza, semuenzaniso, izvi.

Magirafu ezivo

Pano ndine chiitiko chisina kujairika, nekuti ini kazhinji ndinofanira kushanda neruzivo magirafu uye mitauro yekubvunza magirafu. Naizvozvo, ngatingopfuurai muchidimbu pamusoro pezvakakosha, sezvo chikamu ichi chiri chidiki chidiki.

Mune classical relational dhatabhesi isu tine yakagadziriswa schema, asi pano schema inoshanduka, imwe neimwe predicate ichokwadi "column" uye zvakatowanda.

Fungidzira kuti wanga uchifananidza munhu uye waida kutsanangura zvinhu zvakakosha, semuenzaniso, ngatitorei munhu chaiye, Douglas Adams, uye tishandise tsananguro iyi sehwaro.

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro
www.wikidata.org/wiki/Q42

Kana tikashandisa dhatabhesi rehukama, taizofanira kugadzira tafura hombe kana matafura ane nhamba huru yemakoramu, mazhinji acho angave NULL kana akazadzwa neimwe default Nhema kukosha, semuenzaniso, hazvigoneke kuti vazhinji vedu tine kupinda muraibhurari yenyika yeKorea - hongu, tinogona kuvaisa mumatafura akasiyana, asi izvi zvinozopedzisira zvave kuedza kuenzanisira dunhu rinonzwisisika rine zvirevo uchishandisa yakagadziriswa hukama.

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro
Saka fungidzira kuti data rese rakachengetwa segirafu kana sebhinari uye unary boolean kutaura.

Ndekupi kwaungatombosangana nazvo? Kutanga, kushanda pamwe data wiki, uye nechero girafu dhatabhesi kana data rakabatana.

Iyi inotevera mitauro mikuru yemubvunzo yandakashandisa nekushanda nayo.

SPARQL

Wiki:
SPARQL (recursive acronym от Shona SPARQL Protocol uye RDF Mubvunzo Mutauro) - mutauro wekubvunza data, inomiririrwa nemuenzaniso R.F.D., pamwe chete protocol kutumira zvikumbiro izvi uye kuzvipindura. SPARQL ikurudziro W3C Consortium uye imwe yetekinoroji semantic web.

Asi muchokwadi iwo mutauro wekubvunza kune zvine musoro unary uye binary predicates. Uri kungotaura nemamiriro ezvinhu izvo zvakagadziriswa mukutaura kweBoolean uye izvo zvisiri (zvakarerutswa kwazvo).

Iyo RDF (Resource Description Framework) base pachayo, pamusoro peiyo SPARQL mibvunzo inoitwa, inopetwa katatu. object, predicate, subject - uye mubvunzo unosarudza zvakapetwa katatu maererano nezvinorambidzwa mumweya: tsvaga X yakadai p_55(X, q_33) ichokwadi - uko, hongu, p_55 imhando yehukama neID 55, uye q_33 i chinhu chine ID 33 (pano uye iyo nyaya yese, zvakare kusiya marudzi ese ezvinyorwa).

Muenzaniso wekupa data:

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro
Mifananidzo uye muenzaniso nenyika dziri pano kubva pano.

Basic Query Muenzaniso

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro

Chokwadi, tinoda kuwana kukosha kwe ?nyika kusiyanisa zvekuti kune chivakashure
member_of, ichokwadi kuti member_of(?country,q458) uye q458 ndiyo ID yeEuropean Union.

Muenzaniso weiyo chaiyo SPARQL mubvunzo mukati meinjini yepython:

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro

Kazhinji, ndaifanira kuverenga SPARQL pane kuinyora - mumamiriro ezvinhu akadaro, ingangove hunyanzvi hunobatsira kunzwisisa mutauro padanho rekutanga kuti unzwisise kuti data rinotorwa sei. 

Pane zvakawanda zvezvinhu zvekudzidza online: semuenzaniso, pano izvi и izvi. Ini ndinowanzo google madhizaini nemienzaniso uye zvakaringana izvozvi.

Mitauro yemubvunzo inonzwisisika

Iwe unogona kuverenga zvakawanda pamusoro pechinyorwa munyaya yangu pano. Uye pano, isu tichangoongorora muchidimbu nei mitauro ine musoro yakanyatsokodzera kunyora mibvunzo. Chaizvoizvo, RDF ingori seti yezvirevo zvine musoro zvefomu p(X) uye h(X,Y), uye mubvunzo une musoro une fomu rinotevera:

output(X) :- country(X), member_of(X,“EU”).

Pano tiri kutaura nezve kugadzira chivakashure chitsva/1 (/1 zvinoreva unary), chero ku X chiri chokwadi kuti nyika(X) - kureva kuti, X inyika uye zvakare inhengo_ye(X,"EU ").

Ndiko kuti, munyaya iyi, zvose data nemitemo zvinoratidzwa nenzira imwechete, iyo inotibvumira kuenzanisira matambudziko nyore nyore uye zvakanaka.

Makasangana kupi muindasitiri?: purojekiti yakakura nekambani inonyora mibvunzo mumutauro wakadaro, pamwe chete nepurojekiti yazvino iri pakati pegadziriro - zvingaita sokuti ichi chinhu chinoshamisa, asi dzimwe nguva zvinoitika.

Muenzaniso wechidimbu chekodhi mumutauro unonzwisisika kugadzirisa wikidata:

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro

Zvishandiso: Ini ndinopa pano akati wandei ekubatanidza kune yemazuva ano inonzwisisika programming mutauro Mhinduro Seta Chirongwa - Ini ndinokurudzira kuidzidza:

Data Scientist's Notes: Wongororo Yakasarudzika yeData Query Mitauro

Source: www.habr.com

Voeg