Kesan pisanan saka Amazon Neptune

Salam, warga Khabrovsk. Ing nunggu wiwitan kursus "AWS kanggo Pangembang" Kita wis nyiapake terjemahan materi sing menarik.

Kesan pisanan saka Amazon Neptune

Ing akeh kasus panggunaan sing kita seneng bakdataKaya sing kita deleng ing situs web klien, informasi sing relevan didhelikake ing sambungan antarane entitas, contone nalika nganalisa hubungan antarane pangguna, dependensi antarane unsur, utawa sambungan antarane sensor. Kasus panggunaan kasebut biasane dimodelake ing grafik. Awal taun iki, Amazon ngrilis database grafik anyar, Neptunus. Ing kirim iki kita arep kanggo nuduhake gagasan pisanan kita, laku apik lan apa bisa nambah liwat wektu.

Napa kita butuh Amazon Neptunus

Basis data grafik janji bakal nangani set data sing disambungake luwih apik tinimbang sing padha karo hubungane. Ing set data kasebut, informasi sing relevan biasane disimpen ing hubungan antarane obyek. Kita nggunakake proyek data mbukak sing luar biasa kanggo nyoba Neptunus MusicBrainz. MusicBrainz ngumpulake kabeh jinis metadata musik sing bisa dibayangake, kayata informasi babagan artis, lagu, rilis album utawa konser, uga karo sapa artis konco lagu kasebut kolaborasi karo utawa nalika album kasebut dirilis ing negara endi. MusicBrainz bisa dideleng minangka jaringan entitas gedhe sing ana hubungane karo industri musik.

Dataset MusicBrainz diwenehake minangka dump CSV saka database relasional. Secara total, mbucal ngemot sekitar 93 yuta larik ing 157 tabel. Nalika sawetara tabel kasebut ngemot data dhasar kayata seniman, acara, rekaman, rilis utawa trek, lan liya-liyane tabel link - hubungan nyimpen antarane seniman lan rekaman, seniman liyane utawa rilis, etc ... Padha nduduhake struktur grafik saka pesawat data. Nalika ngowahi set data dadi triple RDF, kita entuk kira-kira 500 yuta conto.

Adhedhasar pengalaman lan tayangan saka partners project karo kang kita kerjo, kita saiki setelan kang basis kawruh iki digunakake kanggo njupuk informasi anyar. Kajaba iku, kita ngarepake bakal dianyari kanthi rutin, contone kanthi nambah rilis anyar utawa nganyari anggota grup.

imbuhan

Kaya sing dikarepake, nginstal Amazon Neptune pancen gampang. Dheweke cukup rinci didokumentasikan. Sampeyan bisa miwiti database grafik mung sawetara klik. Nanging, nalika nerangake konfigurasi sing luwih rinci, informasi sing dibutuhake angel ditemokake. Mulane, kita pengin nuding siji parameter konfigurasi.

Kesan pisanan saka Amazon Neptune
Gambar konfigurasi kanggo grup parameter

Amazon ujar manawa Neptunus fokus ing beban kerja transaksional latensi rendah, mula wektu panjaluk standar yaiku 120 detik. Nanging, kita wis nyoba akeh kasus panggunaan analitis sing ajeg tekan watesan iki. Wektu entek iki bisa diganti kanthi nggawe grup parameter anyar kanggo Neptunus lan setelan neptune_query_timeout watesan sing cocog.

Loading Data

Ing ngisor iki kita bakal ngrembug kanthi rinci babagan carane ngunggah data MusicBrainz menyang Neptunus.

Sesambetan ing telu

Pisanan, kita ngowahi data MusicBrainz dadi triple RDF. Mulane, kanggo saben tabel, kita ditetepake cithakan sing nemtokake carane saben kolom dituduhake ing telung. Ing conto iki, saben baris saka meja pemain dipetakan menyang rolas triple RDF.

<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .

Unggahan massal

Cara sing disaranake kanggo mbukak akeh data menyang Neptunus yaiku liwat proses upload akeh liwat S3. Sawise ngunggah file triple menyang S3, sampeyan miwiti upload nggunakake request POST. Ing kasus kita, butuh udakara 24 jam kanggo 500 yuta triplet. Kita ngarepake supaya luwih cepet.

curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
 
 
 "source" : "s3://your-s3-bucket",
 
 "format" : "ntriples",
 
 "iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
 
 "region" : "eu-west-1",
 
 "failOnError" : "FALSE"
 
}'

Kanggo ngindhari proses sing dawa iki saben-saben kita miwiti Neptunus, kita mutusake kanggo mulihake conto kasebut saka gambar sing triplet kasebut wis dimuat. Mlaku saka gambar asli luwih cepet, nanging isih njupuk bab siji jam nganti Neptunus kasedhiya kanggo panjalukan.

Nalika wiwitan ngemot triplet menyang Neptunus, kita nemoni macem-macem kesalahan.

{
 
 
 "errorCode" : "PARSING_ERROR",
 
 "errorMessage" : "Content after '.' is not allowed",
 
 "fileName" : [...],
 
 "recordNum" : 25
 
}

Sawetara ana kesalahan parsing, kaya sing dituduhake ing ndhuwur. Nganti saiki, kita isih durung ngerti apa sing salah ing wektu iki. A sethitik liyane rinci mesthi bakal bantuan kene. Kesalahan iki kedadeyan kira-kira 1% saka triple sing dipasang. Nanging nalika nyoba Neptunus, kita nampa kasunyatan manawa kita mung nggarap 99% informasi saka MusicBrainz.

Sanajan iki gampang kanggo wong sing kenal karo SPARQL, elinga yen triple RDF kudu dianotasi nganggo jinis data sing jelas, sing bisa nyebabake kesalahan maneh.

Streaming download

Kaya sing kasebut ing ndhuwur, kita ora pengin nggunakake Neptunus minangka toko data statis, nanging minangka basis pengetahuan sing fleksibel lan berkembang. Dadi, kita kudu golek cara kanggo ngenalake triple anyar nalika basis pengetahuan diganti, contone nalika album anyar diterbitake utawa nalika kita pengin nggawe kawruh sing asale.

Neptunus ndhukung operator input liwat pitakon SPARQL, loro mentah lan adhedhasar sampel. Kita bakal ngrembug loro pendekatan ing ngisor iki.

Salah sawijining tujuan kita yaiku ngetik data kanthi cara streaming. Coba ngeculake album ing negara anyar. Saka perspektif MusicBrainz, iki tegese kanggo rilis sing kalebu album, singel, EP, lan sapiturute, entri anyar ditambahake ing meja. negara release. Ing RDF, kita cocog informasi iki karo rong triple anyar.

INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };

Tujuan liyane yaiku entuk kawruh anyar saka grafik. Contone, kita pengin entuk jumlah rilis saben artis sing diterbitake ing karire. Pitakonan kasebut cukup rumit lan butuh luwih saka 20 menit ing Neptunus, mula kita kudu ngetrapake asil kasebut supaya bisa nggunakake maneh kawruh anyar iki ing sawetara pitakon liyane. Dadi kita nambah triple karo informasi iki bali menyang grafik, ngetik asil subquery.

INSERT {
 
 
  ?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
 
} WHERE {
 
  SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
 
  WHERE {
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
  }
 
  GROUP BY ?artist_credit
 
}

Nambahake triple siji menyang grafik mbutuhake sawetara milidetik, dene wektu eksekusi kanggo nglebokake asil subquery gumantung saka wektu eksekusi subquery dhewe.

Sanajan ora asring digunakake, Neptunus uga ngidini sampeyan mbusak triplet adhedhasar conto utawa data eksplisit, sing bisa digunakake kanggo nganyari informasi.

pitakon SPARQL

Kanthi ngenalake subsample sadurunge, sing ngasilake jumlah rilis kanggo saben artis, kita wis ngenalake jinis pitakon pisanan sing pengin dijawab nggunakake Neptunus. Nggawe pitakon ing Neptunus gampang - kirim panjaluk POST menyang titik pungkasan SPARQL, kaya sing ditampilake ing ngisor iki:

curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparql

Kajaba iku, kita wis ngetrapake pitakon sing ngasilake profil artis sing ngemot informasi babagan jeneng, umur, utawa negara asal. Elinga yen pemain bisa dadi individu, band, utawa orkestra. Kajaba iku, kita nambah data iki kanthi informasi babagan jumlah rilis sing dirilis dening seniman sajrone taun. Kanggo seniman solo, kita uga nambah informasi babagan band-band sing dirawuhi artis saben taun.

SELECT
 
 
 ?artist_name ?year
 
 ?releases_in_year ?releases_up_year
 
 ?artist_type_name ?releases
 
 ?artist_gender ?artist_country_name
 
 ?artist_begin_date ?bands
 
 ?bands_in_year
 
WHERE {
 
 # Bands for each artist
 
 {
 
   SELECT
 
     ?year
 
     ?first_artist
 
     (group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
 
     (COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)     
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018
 
     }   
 
     ?first_artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
 
     ?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
 

 
 
     ?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
 
     ?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
 
     optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
 
     optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
 

 
 
     ?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/link> ?link .
 

 
 
     optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
 
     optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
 

 
 
     FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
 
     FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
 
     FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
 
     FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
 
     FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
     FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
   }
 
   GROUP BY ?first_artist ?year
 
 }
 
 # Releases up to a year
 
 {
 
   SELECT
 
     ?artist
 
     ?year
 
     (group_concat(DISTINCT ?release_name;separator=",") as ?releases)
 
     (COUNT(*) as ?releases_up_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release <http://musicbrainz.foo/name> ?release_name .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year <= ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Releases in a year
 
 {
 
   SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year = ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Master data
 
 {
 
   SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
 
   WHERE {
 
     ?artist <http://musicbrainz.foo/name> ?artist_name .
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
 
     ?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
 
     ?artist <http://musicbrainz.foo/area> ?birth_area .
 
     ?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
 
     ?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
 

 
 
     FILTER(datatype(?artist_begin_date) = xsd:int)
 
   }

Amarga kerumitan pitakon kasebut, kita mung bisa nindakake pitakon titik kanggo artis tartamtu, kayata Elton John, nanging ora kanggo kabeh seniman. Neptunus kayane ora ngoptimalake pitakon kasebut kanthi nyelehake saringan menyang subselect. Mula, saben pilihan kudu disaring kanthi manual miturut jeneng artis.

Neptunus nduweni biaya saben jam lan saben-I/O. Kanggo tes kita, kita nggunakake conto Neptunus minimal, sing regane $ 0,384 / jam. Ing kasus pitakon ing ndhuwur, sing ngetung profil kanggo siji buruh, Amazon ngisi kita puluhan ewu operasi I/O, tegese biaya $0.02.

kesimpulan

Kaping pisanan, Amazon Neptune netepi janjine. Minangka layanan sing dikelola, iki minangka basis data grafik sing gampang banget diinstal lan bisa digunakake tanpa akeh konfigurasi. Mangkene limang temuan utama:

  • Upload massal gampang nanging alon. Nanging bisa dadi rumit karo pesen kesalahan sing ora migunani banget.
  • Download streaming ndhukung kabeh sing dikarepake lan cukup cepet
  • Pitakon iku prasaja, nanging ora cukup interaktif kanggo mbukak pitakon analitis
  • Pitakonan SPARQL kudu dioptimalake kanthi manual
  • Pembayaran Amazon angel dikira amarga angel ngira jumlah data sing dipindai dening pitakon SPARQL.

Mekaten. Ndaftar kanggo webinar gratis babagan topik "Load Balancing".


Source: www.habr.com

Add a comment