FĂ ilte, luchd-còmhnaidh Habr. Ron thòiseachadh air aâ chĂšrsa. Tha sinn air eadar-theangachadh de stuth inntinneach ullachadh.

Ann am mòran chĂšisean cleachdaidh a bhios sinn, mar ChĂŹ sinn air lĂ raichean-lĂŹn ar luchd-dèiligidh gu bheil fiosrachadh buntainneach falaichte anns na dĂ imhean eadar eintiteasan, mar eisimpleir, nuair a thathar aâ dèanamh anailis air dĂ imhean eadar luchd-cleachdaidh, eisimeileachd eadar eileamaidean, no ceanglaichean eadar mothachairean. Mar as trice bidh cĂšisean cleachdaidh mar sin air am modaladh air graf. Na bu thrĂ ithe am-bliadhna, leig Amazon a-mach stòr-dĂ ta grafa Ăšr, Neptune. Anns aâ phost seo, tha sinn airson ar beachdan tĂšsail, ar cleachdaidhean as fheĂ rr, agus na dhâ fhaodadh a bhith air a leasachadh thar Ăšine a cho-roinn.
Carson a dhâ fheumas sinn Amazon Neptune
Tha stòran-dĂ ta grafa aâ gealltainn gum bi iad aâ lĂ imhseachadh seataichean dĂ ta ceangailte nas fheĂ rr na an co-aoisean dĂ imheil. Ann an seataichean dĂ ta mar sin, mar as trice bidh fiosrachadh buntainneach air a stòradh anns na dĂ imhean eadar nithean. Gus Neptune a dhearbhadh, chleachd sinn pròiseact dĂ ta fosgailte iongantach. Bidh MusicBrainz aâ tional a h-uile meata-dhĂ ta a ghabhas smaoineachadh mu cheòl, leithid fiosrachadh mu luchd-ealain, òrain, clĂ ran-foillseachaidh, no cuirmean-ciĂšil, a bharrachd air cò leis a bha an neach-ealain ag obair còmhla, no cuin a chaidh an clĂ r fhoillseachadh agus anns dè an dĂšthaich. Faodar smaoineachadh air MusicBrainz mar lĂŹonra mòr de bhuidhnean a tha ceangailte ri gnĂŹomhachas aâ chiĂšil ann an dòigh air choireigin.
Tha an seata dĂ ta MusicBrainz air a thoirt seachad mar dhump CSV de stòr-dĂ ta dĂ imheil. Gu h-iomlan, tha timcheall air 93 millean sreath anns an dump ann an 157 clĂ r. Ged a tha dĂ ta bunaiteach ann an cuid de na clĂ ran seo leithid luchd-ealain, tachartasan, clĂ raidhean, fiosan, no slighean, tha cuid eile⌠clĂ ran ceangail â aâ stòradh dhĂ imhean eadar luchd-ealain agus clĂ raidhean, luchd-ealain no fiosan eile, agus mar sin air adhart. Bidh iad aâ sealltainn structar grafa an t-seata dĂ ta. Nuair a thionndaidh sinn an seata dĂ ta gu trĂŹ-fhilltean RDF, fhuair sinn timcheall air 500 millean eisimpleir.
Stèidhichte air an eòlas agus am fios-air-ais bho na com-pĂ irtichean pròiseict leis a bheil sinn ag obair, tha sinn aâ faicinn suidheachadh anns am bi am bunait eòlais seo air a chleachdadh gus fiosrachadh Ăšr fhaighinn. A bharrachd air sin, tha sinn aâ faicinn gun tèid Ăšrachadh a dhèanamh air gu cunbhalach, mar eisimpleir, le bhith aâ cur fiosan Ăšra ris no ag Ăšrachadh buill na buidhne.
adjustment
Mar a bha dÚil, tha stà ladh Amazon Neptune sÏmplidh. Tha e gu math mionaideach. Faodaidh tu stòr-dà ta grafa a chur air bhog le dÏreach beagan chlican. Ach, nuair a thig e gu rèiteachadh nas mionaidiche, Tha e duilich a lorg. Mar sin, tha sinn airson aon pharaimeadar rèiteachaidh a chomharrachadh.

Dealbh-sgrÏn rèiteachaidh airson buidhnean paramadair
Tha Amazon ag rĂ dh gu bheil Neptune ag amas air luchdan obrach malairteach le dĂ il ĂŹosal, agus mar sin is e 120 diogan an Ăšine-ama bunaiteach airson iarrtasan. Ach, rinn sinn deuchainn air grunn chĂšisean cleachdaidh anailis anns an robh sinn aâ ruighinn na crĂŹche seo gu cunbhalach. Faodar an Ăšine-ama seo atharrachadh le bhith aâ cruthachadh buidheann paramadair Ăšr airson Neptune agus ga shuidheachadh gu neptune_query_timeout an cuingealachadh co-fhreagarrach.
Aâ luchdachadh dĂ ta
Gu h-ĂŹosal bruidhnidh sinn gu mionaideach air mar a luchdaich sinn dĂ ta MusicBrainz a-steach do Neptune.
DĂ imhean ann an triĂšir
An toiseach, thionndaidh sinn dĂ ta MusicBrainz gu trĂŹ-fhilltean RDF. Mar sin, airson gach clĂ r, mhĂŹnich sinn teamplaid a dhâinnseas mar a tha gach colbh air a riochdachadh san trĂŹ-fhillte. San eisimpleir seo, tha gach sreath bhon chlĂ r neach-ealain air a mhapadh gu dĂ -fhilltean RDF deug.
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .
Luchdaich suas ann am mòr-chuid
Is e am pròiseas luchdachadh suas mòr tro S3 an dòigh a thathar aâ moladh airson meudan mòra dĂ ta a luchdachadh suas gu Neptune. Ăs dèidh dhut na faidhlichean trĂŹ-fhillte agad a luchdachadh suas gu S3, tòisichidh tu air an luchdachadh suas le iarrtas POST. Nar cĂšis-ne, thug seo timcheall air 24 uair a thĂŹde airson 500 millean trĂŹ-fhillte. Bha dĂšil againn gum biodh e nas luaithe.
curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
"source" : "s3://your-s3-bucket",
"format" : "ntriples",
"iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
"region" : "eu-west-1",
"failOnError" : "FALSE"
}'Gus am pròiseas fada seo a sheachnadh a h-uile uair a chuireas sinn Neptune air bhog, chuir sinn romhainn an eisimpleir ath-nuadhachadh bho dealbh-sgrÏn leis na trÏ-fhilltean seo air an luchdachadh mu thrà th. Tha cur air bhog bho dealbh-sgrÏn gu math nas luaithe, ach bheir e fhathast timcheall air uair a thÏde airson Neptune a bhith ri fhaighinn airson iarrtasan.
Nuair a bha sinn aâ luchdachadh trĂŹ-fhilltean a-steach do Neptune an toiseach, thachair sinn ri grunn mhearachdan.
{
"errorCode" : "PARSING_ERROR",
"errorMessage" : "Content after '.' is not allowed",
"fileName" : [...],
"recordNum" : 25
}Bha cuid dhiubh nan mearachdan parsaidh, mar a chithear gu h-Ă rd. Gus an-diugh, chan eil sinn air faighinn a-mach dè dĂŹreach a chaidh ceĂ rr aig an ĂŹre seo. Bhiodh beagan a bharrachd fiosrachaidh gu cinnteach na chuideachadh. Thachair aâ mhearachd seo airson timcheall air 1% de na trĂŹ-fhilltean a chaidh a chur a-steach. Ach nuair a thig e gu bhith aâ dèanamh deuchainn air Neptune, tha sinn air gabhail ris nach eil sinn ag obair ach le 99% den dĂ ta bho MusicBrainz.
Ged nach eil seo na dhuilgheadas dhaibhsan a tha eòlach air SPARQL, cumaibh cuimhne gum feumar trÏ-fhilltean RDF a bhith air an comharrachadh le seòrsachan dà ta soilleir, agus faodaidh seo mearachdan adhbhrachadh a-rithist.
Luchdaich sĂŹos sruthadh
Mar a chaidh ainmeachadh gu h-à rd, chan eil sinn airson Neptune a chleachdadh mar stòr dà ta statach, ach mar bhunait eòlais sÚbailte is mean-fhà sach. Mar sin, bha feum againn dòighean a lorg airson trÏ-fhilltean Úra a thoirt a-steach mar a bhios am bunait eòlais ag atharrachadh, mar eisimpleir, nuair a thèid clà r Úr fhoillseachadh no nuair a tha sinn airson eòlas a chaidh a thoirt a-mach a thoirt gu buil.
Tha Neptune aâ toirt taic do ghnĂŹomhaichean cuir-a-steach tro cheistean SPARQL, an dĂ chuid le dĂ ta amh agus stèidhichte air taghaidhean. Bruidhnidh sinn air an dĂ dhòigh-obrach gu h-ĂŹosal.
Bâ e aon de na h-amasan againn dĂ ta a chuir a-steach ann an dòigh sruthadh. Smaoinich air foillseachadh clĂ r ann an dĂšthaich Ăšr. Bho shealladh MusicBrainz, tha seo aâ ciallachadh, airson foillseachadh, anns a bheil clĂ ran, singiltean, EPan, msaa., gu bheil clĂ r Ăšr air a chur ris aâ chlĂ r. dĂšthaich-foillseachaidhAnn an RDF, bidh sinn aâ mapadh an fhiosrachaidh seo gu dĂ thriple Ăšr.
INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };Bâ e amas eile eòlas Ăšr a thoirt a-mach Ă s aâ ghraf. Abair gu bheil sinn airson an Ă ireamh de fhoillseachaidhean a tha gach neach-ealain air fhoillseachadh nan cĂšrsa-beatha fhaighinn air ais. Tha an ceist seo gu math iom-fhillte agus bheir e còrr is 20 mionaid ann an Neptune, agus mar sin feumaidh sinn an toradh a thoirt gu buil gus an t-eòlas Ăšr seo ath-chleachdadh ann an ceist eile. Mar sin, cuiridh sinn na trĂŹ-fhilltean anns a bheil am fiosrachadh seo air ais ris aâ ghraf le bhith aâ cur a-steach toradh an fho-cheiste.
INSERT {
?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
} WHERE {
SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
WHERE {
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
}
GROUP BY ?artist_credit
}Bheir e beagan mhÏle-diogan trÏ-fhilltean singilte a chur ri graf, agus tha an Úine cur gu bàs airson toradh fo-cheist a chuir a-steach an urra ri Úine cur gu bàs an fho-cheist fhèin.
Ged nach do chleachd sinn e gu tric, leigidh Neptune leat cuideachd trÏ-fhilltean a thoirt air falbh stèidhichte air sampallan no dà ta soilleir, agus faodar an cleachdadh gus am fiosrachadh Úrachadh.
Ceistean SPARQL
Le bhith aâ toirt a-steach an fho-sheata roimhe, a bhios aâ tilleadh an Ă ireamh de fhoillseachaidhean airson gach cleasaiche, tha sinn air aâ chiad sheòrsa ceist a tha sinn airson a fhreagairt a thoirt a-steach le bhith aâ cleachdadh Neptune. Tha togail ceist ann an Neptune sĂŹmplidhâcuir iarrtas POST chun cheann-uidhe SPARQL, mar a chithear gu h-ĂŹosal:
curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparqlTha sinn cuideachd air ceist a chur an gnĂŹomh a thilleas pròifilean luchd-ealain anns a bheil fiosrachadh mun ainm, aois agus dĂšthaich thĂšsail aca. Cumaibh cuimhne gum faod luchd-ealain a bhith nan daoine fa leth, buidhnean no orcastra. Bidh sinn cuideachd aâ cur ris an dĂ ta seo le fiosrachadh mu Ă ireamh nan sgaoilidhean a chuir gach neach-ealain a-mach tron ââbhliadhna. Airson luchd-ealain aonaranach, bidh sinn cuideachd aâ toirt a-steach fiosrachadh mu na buidhnean anns an robh iad nam pĂ irt gach bliadhna.
SELECT
?artist_name ?year
?releases_in_year ?releases_up_year
?artist_type_name ?releases
?artist_gender ?artist_country_name
?artist_begin_date ?bands
?bands_in_year
WHERE {
# Bands for each artist
{
SELECT
?year
?first_artist
(group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
(COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?first_artist <http://musicbrainz.foo/name> "Elton John" .
?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
?l_artist_artist <http://musicbrainz.foo/link> ?link .
optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
}
GROUP BY ?first_artist ?year
}
# Releases up to a year
{
SELECT
?artist
?year
(group_concat(DISTINCT ?release_name;separator=",") as ?releases)
(COUNT(*) as ?releases_up_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
?release <http://musicbrainz.foo/release-group> ?release_group .
?release <http://musicbrainz.foo/name> ?release_name .
?release_country <http://musicbrainz.foo/release> ?release .
?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
FILTER (?release_country_year <= ?year)
}
GROUP BY ?artist ?year
}
# Releases in a year
{
SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
?release <http://musicbrainz.foo/release-group> ?release_group .
?release_country <http://musicbrainz.foo/release> ?release .
?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
FILTER (?release_country_year = ?year)
}
GROUP BY ?artist ?year
}
# Master data
{
SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
WHERE {
?artist <http://musicbrainz.foo/name> ?artist_name .
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
?artist <http://musicbrainz.foo/area> ?birth_area .
?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
FILTER(datatype(?artist_begin_date) = xsd:int)
}Air sgà th cho iom-fhillte 's a bha an ceist seo, cha b' urrainn dhuinn ach ceistean puing a dhèanamh airson neach-ealain sònraichte, leithid Elton John, ach chan ann airson a h-uile neach-ealain. Chan eil coltas gu bheil Neptune a' leasachadh an iarrtais seo le bhith a' fà gail a-mach sÏoltachain ann am fo-sheataichean. Mar sin, feumar gach fo-sheata a shÏoladh le là imh a rèir ainm an neach-ealain.
Tha prĂŹsean aig Neptune gach uair agus gach IO. Airson ar deuchainnean, chleachd sinn an eisimpleir Neptune as lugha, a chosgas $0,384/uair. Airson aâ cheist gu h-Ă rd, a bhios aâ tomhas pròifil airson aon neach-obrach, bidh Amazon aâ cur cosgais oirnn airson deichean de mhĂŹltean de ghnĂŹomhachdan I/O, aâ ciallachadh cosgais de $0.02.
co-dhĂšnadh
An toiseach, bidh Amazon Neptune aâ coileanadh aâ mhòr-chuid de na geallaidhean aige. Mar sheirbheis stiĂširichte, âs e stòr-dĂ ta grafa a thâ ann a tha air leth furasta a stĂ ladh agus a ghabhas a chur an gnĂŹomh gun mòran rèiteachaidh. Seo na còig prĂŹomh phuingean againn:
- Tha luchdachadh suas mòr-chuid sÏmplidh ach slaodach. Faodaidh e a bhith iom-fhillte le teachdaireachdan mearachd nach eil glè fheumail.
- Tha luchdachadh sĂŹos sruthadh aâ toirt taic do gach rud a bha sinn an dĂšil agus bha iad gu math luath.
- Tha na ceistean sÏmplidh ach chan eil iad eadar-ghnÏomhach gu leòr airson ceistean anailis a dhèanamh.
- Feumar ceistean SPARQL a bharrachadh le lĂ imh
- Tha e doirbh cÏsean Amazon a thomhas leis gu bheil e doirbh meud an dà ta a thèid a sganadh le ceist SPARQL a thomhas.
Sin agad e an-drĂ sta. ClĂ raich airson .
Source: www.habr.com
