Premye enpresyon Amazon Neptune

Salye, rezidan Khabrovsk. Nan patisipe nan kòmansman an nan kou a "AWS pou devlopè" Nou te prepare yon tradiksyon nan materyèl enteresan.

Premye enpresyon Amazon Neptune

Nan anpil ka itilize ke nou renmen bakdataKòm nou wè sou sit entènèt kliyan nou yo, enfòmasyon enpòtan kache nan koneksyon ant antite, pou egzanp lè analize relasyon ant itilizatè yo, depandans ant eleman, oswa koneksyon ant detèktè. Ka itilizasyon sa yo anjeneral modelize sou yon graf. Nan kòmansman ane sa a, Amazon te pibliye nouvo baz done graf li yo, Neptune. Nan pòs sa a nou vle pataje premye lide nou yo, bon pratik ak sa ki ka amelyore sou tan.

Poukisa nou te bezwen Amazon Neptune

Baz done graf yo pwomèt yo okipe seri done ki trè konekte pi byen pase ekivalan relasyon yo. Nan seri done sa yo, enfòmasyon ki enpòtan yo anjeneral estoke nan relasyon ant objè yo. Nou te itilize yon etonan pwojè done louvri pou teste Neptune MusicBrainz. MusicBrainz kolekte tout kalite metadata mizik imajinab, tankou enfòmasyon sou atis, chante, lage album oswa konsè, ansanm ak ki moun ki atis ki dèyè chante a kolabore ak ki lè album la te pibliye nan ki peyi. MusicBrainz ka wè kòm yon rezo gwo antite ki yon jan kanmenm konekte ak endistri mizik la.

Done MusicBrainz yo bay kòm yon pil fatra CSV nan yon baz done relasyon. An total, pil fatra a gen apeprè 93 milyon ranje nan 157 tab. Pandan ke kèk nan tablo sa yo gen done debaz tankou atis, evènman, anrejistreman, degaje oswa tracks, lòt moun. lyen tab yo — magazen relasyon ant atis ak anrejistreman, lòt atis oswa divilgasyon, elatriye... Yo demontre estrikti graf yon seri done. Lè konvèti done a nan RDF trip, nou te jwenn apeprè 500 milyon ka.

Dapre eksperyans ak enpresyon patnè pwojè yo ak ke nou travay, nou prezante yon anviwònman kote baz konesans sa a yo itilize pou jwenn nouvo enfòmasyon. Anplis de sa, nou espere li mete ajou regilyèman, pou egzanp lè nou ajoute nouvo lage oswa mete ajou manm gwoup la.

ajisteman

Kòm espere, enstale Amazon Neptune se senp. Li byen detaye dokimante. Ou ka lanse yon baz done graf nan jis kèk klik. Sepandan, lè li rive konfigirasyon plis detay, enfòmasyon ki nesesè yo difisil pou jwenn. Se poutèt sa, nou vle lonje dwèt sou yon paramèt konfigirasyon.

Premye enpresyon Amazon Neptune
Ekran konfigirasyon pou gwoup paramèt

Amazon di Neptune konsantre sou chaj travay tranzaksyon ki ba latansi, se poutèt sa delè demann default la se 120 segonn. Sepandan, nou te teste anpil ka itilizasyon analyse kote nou regilyèman rive nan limit sa a. Tan sa a ka chanje lè w kreye yon nouvo gwoup paramèt pou Neptune ak anviwònman neptune_query_timeout restriksyon ki koresponn lan.

Chaje Done

Anba a nou pral diskite an detay ki jan nou chaje done MusicBrainz nan Neptune.

Relasyon an twa

Premyèman, nou konvèti done MusicBrainz yo nan trip RDF. Se poutèt sa, pou chak tab, nou defini yon modèl ki defini kijan chak kolòn reprezante nan trip la. Nan egzanp sa a, chak ranje ki soti nan tablo pèfòmè a trase nan douz trip RDF.

<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .

telechaje an gwo

Fason yo sijere pou chaje gwo kantite done nan Neptune se atravè pwosesis upload la atravè S3. Apre w fin telechaje fichye tripl ou yo nan S3, ou kòmanse telechaje a lè l sèvi avèk yon demann POST. Nan ka nou an, li te pran apeprè 24 èdtan pou 500 milyon triple. Nou te espere li pi vit.

curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
 
 
 "source" : "s3://your-s3-bucket",
 
 "format" : "ntriples",
 
 "iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
 
 "region" : "eu-west-1",
 
 "failOnError" : "FALSE"
 
}'

Pou evite pwosesis long sa a chak fwa nou lanse Neptune, nou deside retabli egzanp lan nan yon snapshot kote triple sa yo te deja chaje. Kouri soti nan yon snapshot se siyifikativman pi vit, men yo toujou pran apeprè inèdtan jiskaske Neptune disponib pou demann.

Lè okòmansman chaje triple nan Neptune, nou te rankontre plizyè erè.

{
 
 
 "errorCode" : "PARSING_ERROR",
 
 "errorMessage" : "Content after '.' is not allowed",
 
 "fileName" : [...],
 
 "recordNum" : 25
 
}

Kèk nan yo te analize erè, jan yo montre pi wo a. Jouk jounen jodi a, nou poko konnen kisa egzakteman te ale mal nan moman sa a. Yon ti kras plis detay ta definitivman ede isit la. Erè sa a te fèt pou apeprè 1% nan trip yo mete. Men, osi lwen ke tès Neptune ale, nou aksepte lefèt ke nou travay sèlman ak 99% nan enfòmasyon ki soti nan MusicBrainz.

Menmsi sa fasil pou moun ki abitye ak SPARQL, ou dwe konnen ke trip RDF yo dwe anote ak kalite done eksplisit, ki ankò ka lakòz erè.

Telechaje Streaming

Kòm mansyone pi wo a, nou pa vle sèvi ak Neptune kòm yon magazen done estatik, men pito kòm yon baz konesans fleksib ak evolye. Se konsa, nou te bezwen jwenn fason yo prezante nouvo trip lè baz la konesans chanje, pou egzanp lè yo pibliye yon nouvo album oswa lè nou vle konkretize konesans ki sòti.

Neptune sipòte operatè opinyon atravè demann SPARQL, tou de kri ak echantiyon ki baze sou. Nou pral diskite sou tou de apwòch anba a.

Youn nan objektif nou se te antre done nan yon fason difizyon. Konsidere lage yon album nan yon nouvo peyi. Soti nan pèspektiv MusicBrainz, sa vle di ke pou yon lage ki gen ladann albòm, selibatè, EP, elatriye, yo ajoute yon nouvo antre sou tab la. lage-peyi. Nan RDF, nou matche enfòmasyon sa a ak de nouvo trip.

INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };

Yon lòt objektif se te jwenn nouvo konesans nan graf la. Ann di nou vle jwenn kantite lage chak atis pibliye nan karyè yo. Yon rechèch konsa se byen konplèks epi li pran plis pase 20 minit nan Neptune, kidonk nou bezwen konkretize rezilta a nan lòd yo re-itilize nouvo konesans sa a nan kèk lòt rechèch. Se konsa, nou ajoute trip ak enfòmasyon sa a tounen nan graf la, antre nan rezilta a nan subquery la.

INSERT {
 
 
  ?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
 
} WHERE {
 
  SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
 
  WHERE {
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
  }
 
  GROUP BY ?artist_credit
 
}

Ajoute yon sèl trip nan graf la pran kèk milisgond, pandan ke tan an ekzekisyon pou mete rezilta a nan yon subquery depann de tan an ekzekisyon nan subquery nan tèt li.

Malgre ke nou pa t 'sèvi ak li souvan, Neptune tou pèmèt ou retire triple ki baze sou echantiyon oswa done eksplisit, ki ka itilize yo mete ajou enfòmasyon.

Rekèt SPARQL

Lè nou prezante sou-echantiyon anvan an, ki retounen kantite lage pou chak atis, nou deja prezante premye kalite demann nou vle reponn lè l sèvi avèk Neptune. Bati yon rechèch nan Neptune se fasil - voye yon demann POST nan pwen final SPARQL la, jan yo montre anba a:

curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparql

Anplis de sa, nou te aplike yon rechèch ki retounen yon pwofil atis ki gen enfòmasyon sou non yo, laj yo, oswa peyi orijin yo. Kenbe nan tèt ou ke pèfòmè yo ka endividi, gwoup, oswa òkès. Anplis de sa, nou konplete done sa yo ak enfòmasyon sou kantite atis pibliye pandan ane a. Pou atis solo, nou ajoute tou enfòmasyon sou gwoup atis yo te patisipe nan chak ane.

SELECT
 
 
 ?artist_name ?year
 
 ?releases_in_year ?releases_up_year
 
 ?artist_type_name ?releases
 
 ?artist_gender ?artist_country_name
 
 ?artist_begin_date ?bands
 
 ?bands_in_year
 
WHERE {
 
 # Bands for each artist
 
 {
 
   SELECT
 
     ?year
 
     ?first_artist
 
     (group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
 
     (COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)     
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018
 
     }   
 
     ?first_artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
 
     ?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
 

 
 
     ?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
 
     ?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
 
     optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
 
     optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
 

 
 
     ?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/link> ?link .
 

 
 
     optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
 
     optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
 

 
 
     FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
 
     FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
 
     FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
 
     FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
 
     FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
     FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
   }
 
   GROUP BY ?first_artist ?year
 
 }
 
 # Releases up to a year
 
 {
 
   SELECT
 
     ?artist
 
     ?year
 
     (group_concat(DISTINCT ?release_name;separator=",") as ?releases)
 
     (COUNT(*) as ?releases_up_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release <http://musicbrainz.foo/name> ?release_name .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year <= ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Releases in a year
 
 {
 
   SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year = ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Master data
 
 {
 
   SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
 
   WHERE {
 
     ?artist <http://musicbrainz.foo/name> ?artist_name .
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
 
     ?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
 
     ?artist <http://musicbrainz.foo/area> ?birth_area .
 
     ?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
 
     ?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
 

 
 
     FILTER(datatype(?artist_begin_date) = xsd:int)
 
   }

Akòz konpleksite yon rechèch konsa, nou te kapab sèlman fè demann pwen pou yon atis espesifik, tankou Elton John, men se pa pou tout atis. Neptune pa sanble yo optimize tankou yon rechèch pa jete filtè nan subselects. Se poutèt sa, chak seleksyon dwe manyèlman filtre pa non atis.

Neptune gen tou de chaj èdtan ak chak I/O. Pou tès nou an, nou te itilize egzanp Neptune minimòm, ki koute $0,384/èdtan. Nan ka a nan rechèch ki anwo a, ki kalkile pwofil la pou yon sèl travayè, Amazon chaje nou dè dizèn de milye de operasyon I/O, sa vle di yon pri $ 0.02.

Sòti

Premyèman, Amazon Neptune kenbe pi fò nan pwomès li yo. Kòm yon sèvis jere, li se yon baz done graf ki trè fasil pou enstale epi li ka kanpe ak kouri san anpil konfigirasyon. Men senk rezilta kle nou yo:

  • Téléchargement en se fasil men dousman. Men, li ka vin konplike ak mesaj erè ki pa trè itil.
  • Streaming telechaje sipòte tout sa nou te espere e li te byen vit
  • Rekèt yo senp, men yo pa entèaktif ase yo kouri demann analyse
  • Rekèt SPARQL yo dwe manyèlman optimize
  • Peman Amazon yo difisil pou estime paske li difisil pou estime kantite done yon rechèch SPARQL analize.

Se tout. Enskri pou webinar gratis sou sijè "Ekilib chaj la".


Sous: www.habr.com

Add nouvo kòmantè