Salye, rezidan Khabrovsk. Nan patisipe nan kòmansman an nan kou a
Nan anpil ka itilize ke nou renmen
Poukisa nou te bezwen Amazon Neptune
Baz done graf yo pwomèt yo okipe seri done ki trè konekte pi byen pase ekivalan relasyon yo. Nan seri done sa yo, enfòmasyon ki enpòtan yo anjeneral estoke nan relasyon ant objè yo. Nou te itilize yon etonan pwojè done louvri pou teste Neptune
Done MusicBrainz yo bay kòm yon pil fatra CSV nan yon baz done relasyon. An total, pil fatra a gen apeprè 93 milyon ranje nan 157 tab. Pandan ke kèk nan tablo sa yo gen done debaz tankou atis, evènman, anrejistreman, degaje oswa tracks, lòt moun. lyen tab yo — magazen relasyon ant atis ak anrejistreman, lòt atis oswa divilgasyon, elatriye... Yo demontre estrikti graf yon seri done. Lè konvèti done a nan RDF trip, nou te jwenn apeprè 500 milyon ka.
Dapre eksperyans ak enpresyon patnè pwojè yo ak ke nou travay, nou prezante yon anviwònman kote baz konesans sa a yo itilize pou jwenn nouvo enfòmasyon. Anplis de sa, nou espere li mete ajou regilyèman, pou egzanp lè nou ajoute nouvo lage oswa mete ajou manm gwoup la.
ajisteman
Kòm espere, enstale Amazon Neptune se senp. Li byen detaye
Ekran konfigirasyon pou gwoup paramèt
Amazon di Neptune konsantre sou chaj travay tranzaksyon ki ba latansi, se poutèt sa delè demann default la se 120 segonn. Sepandan, nou te teste anpil ka itilizasyon analyse kote nou regilyèman rive nan limit sa a. Tan sa a ka chanje lè w kreye yon nouvo gwoup paramèt pou Neptune ak anviwònman neptune_query_timeout
restriksyon ki koresponn lan.
Chaje Done
Anba a nou pral diskite an detay ki jan nou chaje done MusicBrainz nan Neptune.
Relasyon an twa
Premyèman, nou konvèti done MusicBrainz yo nan trip RDF. Se poutèt sa, pou chak tab, nou defini yon modèl ki defini kijan chak kolòn reprezante nan trip la. Nan egzanp sa a, chak ranje ki soti nan tablo pèfòmè a trase nan douz trip RDF.
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .
telechaje an gwo
Fason yo sijere pou chaje gwo kantite done nan Neptune se atravè pwosesis upload la atravè S3. Apre w fin telechaje fichye tripl ou yo nan S3, ou kòmanse telechaje a lè l sèvi avèk yon demann POST. Nan ka nou an, li te pran apeprè 24 èdtan pou 500 milyon triple. Nou te espere li pi vit.
curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
"source" : "s3://your-s3-bucket",
"format" : "ntriples",
"iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
"region" : "eu-west-1",
"failOnError" : "FALSE"
}'
Pou evite pwosesis long sa a chak fwa nou lanse Neptune, nou deside retabli egzanp lan nan yon snapshot kote triple sa yo te deja chaje. Kouri soti nan yon snapshot se siyifikativman pi vit, men yo toujou pran apeprè inèdtan jiskaske Neptune disponib pou demann.
Lè okòmansman chaje triple nan Neptune, nou te rankontre plizyè erè.
{
"errorCode" : "PARSING_ERROR",
"errorMessage" : "Content after '.' is not allowed",
"fileName" : [...],
"recordNum" : 25
}
Kèk nan yo te analize erè, jan yo montre pi wo a. Jouk jounen jodi a, nou poko konnen kisa egzakteman te ale mal nan moman sa a. Yon ti kras plis detay ta definitivman ede isit la. Erè sa a te fèt pou apeprè 1% nan trip yo mete. Men, osi lwen ke tès Neptune ale, nou aksepte lefèt ke nou travay sèlman ak 99% nan enfòmasyon ki soti nan MusicBrainz.
Menmsi sa fasil pou moun ki abitye ak SPARQL, ou dwe konnen ke trip RDF yo dwe anote ak kalite done eksplisit, ki ankò ka lakòz erè.
Telechaje Streaming
Kòm mansyone pi wo a, nou pa vle sèvi ak Neptune kòm yon magazen done estatik, men pito kòm yon baz konesans fleksib ak evolye. Se konsa, nou te bezwen jwenn fason yo prezante nouvo trip lè baz la konesans chanje, pou egzanp lè yo pibliye yon nouvo album oswa lè nou vle konkretize konesans ki sòti.
Neptune sipòte operatè opinyon atravè demann SPARQL, tou de kri ak echantiyon ki baze sou. Nou pral diskite sou tou de apwòch anba a.
Youn nan objektif nou se te antre done nan yon fason difizyon. Konsidere lage yon album nan yon nouvo peyi. Soti nan pèspektiv MusicBrainz, sa vle di ke pou yon lage ki gen ladann albòm, selibatè, EP, elatriye, yo ajoute yon nouvo antre sou tab la. lage-peyi. Nan RDF, nou matche enfòmasyon sa a ak de nouvo trip.
INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };
Yon lòt objektif se te jwenn nouvo konesans nan graf la. Ann di nou vle jwenn kantite lage chak atis pibliye nan karyè yo. Yon rechèch konsa se byen konplèks epi li pran plis pase 20 minit nan Neptune, kidonk nou bezwen konkretize rezilta a nan lòd yo re-itilize nouvo konesans sa a nan kèk lòt rechèch. Se konsa, nou ajoute trip ak enfòmasyon sa a tounen nan graf la, antre nan rezilta a nan subquery la.
INSERT {
?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
} WHERE {
SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
WHERE {
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
}
GROUP BY ?artist_credit
}
Ajoute yon sèl trip nan graf la pran kèk milisgond, pandan ke tan an ekzekisyon pou mete rezilta a nan yon subquery depann de tan an ekzekisyon nan subquery nan tèt li.
Malgre ke nou pa t 'sèvi ak li souvan, Neptune tou pèmèt ou retire triple ki baze sou echantiyon oswa done eksplisit, ki ka itilize yo mete ajou enfòmasyon.
Rekèt SPARQL
Lè nou prezante sou-echantiyon anvan an, ki retounen kantite lage pou chak atis, nou deja prezante premye kalite demann nou vle reponn lè l sèvi avèk Neptune. Bati yon rechèch nan Neptune se fasil - voye yon demann POST nan pwen final SPARQL la, jan yo montre anba a:
curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparql
Anplis de sa, nou te aplike yon rechèch ki retounen yon pwofil atis ki gen enfòmasyon sou non yo, laj yo, oswa peyi orijin yo. Kenbe nan tèt ou ke pèfòmè yo ka endividi, gwoup, oswa òkès. Anplis de sa, nou konplete done sa yo ak enfòmasyon sou kantite atis pibliye pandan ane a. Pou atis solo, nou ajoute tou enfòmasyon sou gwoup atis yo te patisipe nan chak ane.
SELECT
?artist_name ?year
?releases_in_year ?releases_up_year
?artist_type_name ?releases
?artist_gender ?artist_country_name
?artist_begin_date ?bands
?bands_in_year
WHERE {
# Bands for each artist
{
SELECT
?year
?first_artist
(group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
(COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?first_artist <http://musicbrainz.foo/name> "Elton John" .
?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
?l_artist_artist <http://musicbrainz.foo/link> ?link .
optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
}
GROUP BY ?first_artist ?year
}
# Releases up to a year
{
SELECT
?artist
?year
(group_concat(DISTINCT ?release_name;separator=",") as ?releases)
(COUNT(*) as ?releases_up_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
?release <http://musicbrainz.foo/release-group> ?release_group .
?release <http://musicbrainz.foo/name> ?release_name .
?release_country <http://musicbrainz.foo/release> ?release .
?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
FILTER (?release_country_year <= ?year)
}
GROUP BY ?artist ?year
}
# Releases in a year
{
SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
?release <http://musicbrainz.foo/release-group> ?release_group .
?release_country <http://musicbrainz.foo/release> ?release .
?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
FILTER (?release_country_year = ?year)
}
GROUP BY ?artist ?year
}
# Master data
{
SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
WHERE {
?artist <http://musicbrainz.foo/name> ?artist_name .
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
?artist <http://musicbrainz.foo/area> ?birth_area .
?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
FILTER(datatype(?artist_begin_date) = xsd:int)
}
Akòz konpleksite yon rechèch konsa, nou te kapab sèlman fè demann pwen pou yon atis espesifik, tankou Elton John, men se pa pou tout atis. Neptune pa sanble yo optimize tankou yon rechèch pa jete filtè nan subselects. Se poutèt sa, chak seleksyon dwe manyèlman filtre pa non atis.
Neptune gen tou de chaj èdtan ak chak I/O. Pou tès nou an, nou te itilize egzanp Neptune minimòm, ki koute $0,384/èdtan. Nan ka a nan rechèch ki anwo a, ki kalkile pwofil la pou yon sèl travayè, Amazon chaje nou dè dizèn de milye de operasyon I/O, sa vle di yon pri $ 0.02.
Sòti
Premyèman, Amazon Neptune kenbe pi fò nan pwomès li yo. Kòm yon sèvis jere, li se yon baz done graf ki trè fasil pou enstale epi li ka kanpe ak kouri san anpil konfigirasyon. Men senk rezilta kle nou yo:
- Téléchargement en se fasil men dousman. Men, li ka vin konplike ak mesaj erè ki pa trè itil.
- Streaming telechaje sipòte tout sa nou te espere e li te byen vit
- Rekèt yo senp, men yo pa entèaktif ase yo kouri demann analyse
- Rekèt SPARQL yo dwe manyèlman optimize
- Peman Amazon yo difisil pou estime paske li difisil pou estime kantite done yon rechèch SPARQL analize.
Se tout. Enskri pou
Sous: www.habr.com