Sanibonani bahlali bakwaHabr. Ngaphambi kokwethulwa kwesifundo. Silungiselele ukuhunyushwa kokwaziswa okuthakazelisayo.

Ezimweni eziningi ezisetshenziswayo thina, njenge Sibona kumawebhusayithi amaklayenti ethu ukuthi ulwazi olubalulekile lufihliwe ebudlelwaneni obuphakathi kwezinhlangano, isibonelo, lapho sihlaziya ubudlelwano phakathi kwabasebenzisi, ukuncika phakathi kwezinto, noma ukuxhumana phakathi kwezinzwa. Izimo ezinjalo zokusetshenziswa ngokuvamile zimodela kugrafu. Ngasekuqaleni konyaka, i-Amazon ikhiphe i-database entsha yegrafu, iNeptune. Kulokhu okuthunyelwe, sifuna ukwabelana ngemibono yethu yokuqala, imikhuba engcono kakhulu, nokuthi yini engathuthukiswa ngokuhamba kwesikhathi.
Kungani Sidinga i-Amazon Neptune
Imininingo egciniwe yegrafu ithembisa ukuphatha amadathasethi axhumeke kakhulu kangcono kunozakwabo abahlobene. Kumadathasethi anjalo, ulwazi olufanele luvame ukugcinwa ebudlelwaneni obuphakathi kwezinto. Ukuhlola i-Neptune, sisebenzise iphrojekthi yedatha evulekile emangalisayo. I-MusicBrainz iqoqa yonke imethadatha ecatshangwayo emayelana nomculo, efana nolwazi olumayelana nabaculi, izingoma, ukukhishwa kwe-albhamu, noma amakhonsathi, kanye nokuthi umculi usebenzisane nobani, noma ukuthi i-albhamu ikhishwe nini kuliphi izwe. I-MusicBrainz ingacatshangwa njengenethiwekhi enkulu yezinhlangano ezixhumeke ngandlela thize embonini yomculo.
Idathasethi ye-MusicBrainz inikezwa njengokulahlwa kwe-CSV yesizindalwazi esihlobene. Sekukonke, ukulahlwa kuqukethe cishe imigqa eyizigidi ezingama-93 kumathebula angu-157. Yize amanye alawa mathebula equkethe idatha ewumongo njengabaculi, imicimbi, okurekhodiwe, ukukhishwa, noma amathrekhi, okunye amatafula wokuxhumanisa - ubudlelwano besitolo phakathi kwabaculi nokurekhodiwe, abanye abaculi noma okukhishiwe, njalo njalo. Abonisa ukwakheka kwegrafu yedathasethi. Lapho siguqulela idathasethi ibe kathathu ye-RDF, sithole cishe izehlakalo eziyizigidi ezingama-500.
Ngokusekelwe kokuhlangenwe nakho nempendulo evela kozakwethu bephrojekthi esisebenza nabo, sibona ngeso lengqondo ukusethwa lapho lesi sisekelo solwazi sisetshenziswa khona ukuthola ulwazi olusha. Ngaphezu kwalokho, sicabanga ukuthi ibuyekezwa njalo, ngokwesibonelo, ngokwengeza ukukhishwa okusha noma ukubuyekeza amalungu eqembu.
Yenza ngokwezifiso
Njengoba kulindelekile, ukufaka i-Amazon Neptune kulula. Inemininingwane eminingi. Ungakwazi ukuqalisa isizindalwazi segrafu ngokuchofoza okumbalwa nje. Nokho, uma kuziwa ekucushweni okunemininingwane eyengeziwe, Kunzima ukuthola. Ngakho-ke, sifuna ukukhomba ipharamitha yokucushwa eyodwa.

Ukucushwa kwesithombe-skrini samaqembu epharamitha
I-Amazon ithi iNeptune igxile ekulayisheni okuphansi kwe-latency transaction, ngakho-ke isikhathi sokuvala isicelo esizenzakalelayo siyimizuzwana eyi-120. Kodwa-ke, sihlole izimo eziningi zokusetshenziswa kokuhlaziya lapho sifika khona njalo kulo mkhawulo. Lesi sikhathi sokuvala singalungiswa ngokwakha iqembu elisha lepharamitha le-Neptune bese ulilungiselela neptune_query_timeout umkhawulo ohambisanayo.
Ilayisha Idatha
Ngezansi sizoxoxa ngokuningiliziwe ukuthi siyilayishe kanjani idatha ye-MusicBrainz ku-Neptune.
Ubudlelwano ngobuthathu
Okokuqala, siguqule idatha ye-MusicBrainz yaba i-RDF kathathu. Ngakho-ke, kuthebula ngalinye, sichaze isifanekiso esinquma ukuthi ikholomu ngayinye imelelwa kanjani kunxantathu. Kulesi sibonelo, umugqa ngamunye osuka kuthebula leciko ufakwe imephu yaba ngokuphindwe kathathu kwe-RDF.
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .
Ukulayisha ngobuningi
Indlela ephakanyisiwe yokulayisha inani elikhulu ledatha ku-Neptune inqubo yokulayisha ngobuningi nge-S3. Ngemva kokulayisha amafayela akho amathathu ku-S3, uqala ukulayisha ngesicelo OKUTHUMELA. Esimweni sethu, lokhu kuthathe cishe amahora angama-24 ngezigidi ezingama-500 eziphindwe kathathu. Besilindele ukuthi izoshesha.
curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
"source" : "s3://your-s3-bucket",
"format" : "ntriples",
"iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
"region" : "eu-west-1",
"failOnError" : "FALSE"
}'Ukuze sigweme le nqubo ende njalo uma sethula i-Neptune, sinqume ukubuyisela isibonelo kusuka kusifinyezo ngalawa mahlandla amathathu asevele alayishiwe. Ukwethula kusuka kusifinyezo kuyashesha kakhulu, kodwa kusathatha cishe ihora ukuze i-Neptune itholakalele izicelo.
Lapho silayisha okokuqala amawele amathathu ku-Neptune, sihlangabezane namaphutha ahlukahlukene.
{
"errorCode" : "PARSING_ERROR",
"errorMessage" : "Content after '.' is not allowed",
"fileName" : [...],
"recordNum" : 25
}Ezinye zazo bekungamaphutha okuhlaziya, njengoba kuboniswe ngenhla. Kuze kube manje, asikatholi kahle ukuthi yini ehambe kabi ngalesi sikhathi. Imininingwane embalwa eyengeziwe ingasiza. Leli phutha lenzekile cishe ku-1% wokufakwa okuphindwe kathathu. Kodwa uma kuziwa ekuhloleni i-Neptune, silamukele iqiniso lokuthi sisebenza kuphela ngo-99% wedatha evela ku-MusicBrainz.
Noma lokhu kungeyona inkinga kubantu abajwayelene ne-SPARQL, khumbula ukuthi i-RDF ephindwe kathathu kufanele ichazwe ngezinhlobo zedatha ezisobala, ezingaphinda zethule amaphutha.
Ukulanda kokusakaza
Njengoba kushiwo ngenhla, asifuni ukusebenzisa i-Neptune njengesitolo sedatha esimile, kodwa njengesisekelo solwazi esiguquguqukayo nesithuthukayo. Ngakho-ke, besidinga ukuthola izindlela zokwethula ngokuphindwe kathathu okusha njengoba isisekelo solwazi sishintsha, isibonelo, lapho kushicilelwa i-albhamu entsha noma lapho sifuna ukwenza ulwazi olutholiwe.
I-Neptune isekela ama-opharetha okokufaka ngemibuzo ye-SPARQL, kokubili ngedatha engahluziwe nangokusekelwe kokukhethiwe. Sizoxoxa ngazo zombili izindlela ngezansi.
Enye yezinhloso zethu kwakuwukufaka idatha ngendlela yokusakaza. Cabangela ukukhishwa kwe-albhamu ezweni elisha. Ngokombono we-MusicBrainz, lokhu kusho ukuthi ekukhishweni, okuhlanganisa ama-albhamu, ama-single, ama-EP, njll., irekhodi elisha liyengezwa etafuleni. izwe lokukhululwaKu-RDF, sifaka lolu lwazi kuma-triple amabili amasha.
INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };Omunye umgomo wawuwukukhipha ulwazi olusha kugrafu. Ake sithi sifuna ukubuyisa isibalo sokukhishiwe umculi ngamunye akukhiphile emsebenzini wakhe. Lo mbuzo uyinkimbinkimbi futhi uthatha imizuzu engaphezu kwengu-20 e-Neptune, ngakho-ke sidinga ukwenza umphumela ukuze siphinde sisebenzise lolu lwazi olusha komunye umbuzo. Ngakho-ke, sengeza ama-triple aqukethe lolu lwazi emuva kugrafu ngokufaka umphumela we-subquery.
INSERT {
?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
} WHERE {
SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
WHERE {
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
}
GROUP BY ?artist_credit
}Ukwengeza okuphindwe kathathu kugrafu kuthatha ama-millisecond ambalwa, kuyilapho isikhathi sokwenza sokufaka umphumela wombuzo omncane sincike esikhathini sokwenziwa sombuzo ngokwawo.
Nakuba besingazange siyisebenzise njalo, i-Neptune iphinde ikuvumela ukuthi ususe ama-triplets ngokusekelwe kumasampuli noma idatha ebekela sobala, engasetshenziswa ukuze kubuyekezwe ulwazi.
SPARQL imibuzo
Ngokwethula isethi engaphansi yangaphambilini, ebuyisela inani lokukhishwa komdlali ngamunye, sesivele sethule uhlobo lokuqala lombuzo esifuna ukuwuphendula sisebenzisa i-Neptune. Ukwakha umbuzo nge-Neptune kulula—thumela isicelo se-POST endaweni yokugcina ye-SPARQL, njengoba kukhonjisiwe ngezansi:
curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparqlSiphinde sasebenzisa umbuzo obuyisela amaphrofayela eciko aqukethe ulwazi mayelana namagama abo, ubudala, nezwe lendabuko. Khumbula ukuthi abaculi bangaba abantu ngabanye, amaqembu, noma ama-orchestra. Futhi sengeza le datha ngolwazi mayelana nenani lokukhishwa kweciko ngalinye elikhishwe phakathi nonyaka. Kubaculi ababodwa, sifaka nolwazi mayelana namaqembu abebeyingxenye yawo unyaka nonyaka.
SELECT
?artist_name ?year
?releases_in_year ?releases_up_year
?artist_type_name ?releases
?artist_gender ?artist_country_name
?artist_begin_date ?bands
?bands_in_year
WHERE {
# Bands for each artist
{
SELECT
?year
?first_artist
(group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
(COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?first_artist <http://musicbrainz.foo/name> "Elton John" .
?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
?l_artist_artist <http://musicbrainz.foo/link> ?link .
optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
}
GROUP BY ?first_artist ?year
}
# Releases up to a year
{
SELECT
?artist
?year
(group_concat(DISTINCT ?release_name;separator=",") as ?releases)
(COUNT(*) as ?releases_up_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
?release <http://musicbrainz.foo/release-group> ?release_group .
?release <http://musicbrainz.foo/name> ?release_name .
?release_country <http://musicbrainz.foo/release> ?release .
?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
FILTER (?release_country_year <= ?year)
}
GROUP BY ?artist ?year
}
# Releases in a year
{
SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
?release <http://musicbrainz.foo/release-group> ?release_group .
?release_country <http://musicbrainz.foo/release> ?release .
?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
FILTER (?release_country_year = ?year)
}
GROUP BY ?artist ?year
}
# Master data
{
SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
WHERE {
?artist <http://musicbrainz.foo/name> ?artist_name .
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
?artist <http://musicbrainz.foo/area> ?birth_area .
?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
FILTER(datatype(?artist_begin_date) = xsd:int)
}Ngenxa yobunzima balo mbuzo, singakwazi ukubuza kuphela iciko elithile, elifana no-Elton John, kodwa hhayi kubo bonke abaculi. I-Neptune ayibonakali ilungiselela lo mbuzo ngokukhipha izihlungi kumasethi angaphansi. Ngakho-ke, isethi encane ngayinye kufanele ihlungwe ngesandla ngegama lomculi.
I-Neptune inokubili amanani entengo yehora kanye ne-IO ngayinye. Ekuhloleni kwethu, sisebenzise isibonelo esincane kakhulu se-Neptune, esibiza u-$0,384/ihora. Embuzweni ongenhla, ohlanganisa iphrofayili yesisebenzi esisodwa, i-Amazon isikhokhisa amashumi ezinkulungwane zemisebenzi ye-I/O, okusho izindleko ze-$0.02.
isiphetho
Okokuqala, i-Amazon Neptune iletha iningi lezithembiso zayo. Njengesevisi ephethwe, iyisizindalwazi segrafu okulula ngendlela emangalisayo ukuyifaka futhi esingasebenza ngaphandle kokucushwa okuningi. Nazi izinto ezinhlanu ezibalulekile esizithathayo:
- Ukulayisha ngobuningi kulula kodwa kuhamba kancane. Kungaba nzima ngemilayezo yephutha engasizi kakhulu.
- Ukulanda okulandwayo kusekela konke ebesikulindele futhi bekushesha kakhulu.
- Imibuzo ilula kodwa ayihlanganisi ngokwanele ukwenza imibuzo yokuhlaziya.
- Imibuzo ye-SPARQL kufanele ithuthukiswe mathupha
- Izindleko ze-Amazon kunzima ukuzilinganisela ngoba kunzima ukulinganisa ivolumu yedatha eskenwe ngombuzo we-SPARQL.
Yilokho kuphela okwamanje. Bhalisela .
Source: www.habr.com
