Salam, warga Khabrovsk. Dina antisipasi mimiti kursus
Dina loba kasus pamakéan nu urang resep
Naha urang peryogi Amazon Néptunus
Basis data grafik janji pikeun nanganan set data anu disambungkeun langkung saé tibatan sarimbag relasionalna. Dina set data sapertos kitu, inpormasi anu relevan biasana disimpen dina hubungan antara objék. Kami nganggo proyék data kabuka anu luar biasa pikeun nguji Néptunus
The MusicBrainz dataset disadiakeun salaku dump CSV tina database relational. Dina total, dump ngandung ngeunaan 93 juta jajar dina 157 tabel. Bari sababaraha tabel ieu ngandung data dasar kayaning artis, acara, rekaman, release atanapi lagu, batur tabél link - hubungan toko antara seniman jeung rékaman, seniman séjén atawa Kaluaran, jsb ... Aranjeunna demonstrate struktur grafik susunan data. Nalika ngarobih set data kana triple RDF, kami nampi kirang langkung 500 juta instansi.
Dumasar pangalaman sareng tayangan mitra proyék anu kami damel, kami nampilkeun setting dimana dasar pangaweruh ieu dianggo pikeun kéngingkeun inpormasi anyar. Salaku tambahan, kami ngarepkeun éta bakal diropéa sacara teratur, contona ku nambihan rilis énggal atanapi ngamutahirkeun anggota grup.
carana ngatur
Sapertos anu diharapkeun, masang Amazon Néptunus saderhana. Manehna cukup rinci
Potret layar konfigurasi pikeun grup parameter
Amazon nyatakeun yén Néptunus museurkeun kana beban kerja transaksional low-latency, naha éta paménta standar waktuna nyaéta 120 detik. Kami parantos nguji seueur kasus pamakean analitis dimana kami rutin ngahontal wates ieu. Timeout ieu bisa dirobah ku nyieun grup parameter anyar pikeun Néptunus jeung setelan neptune_query_timeout
pangwatesan pakait.
Ngamuat Data
Handap urang bakal ngabahas di jéntré kumaha urang dimuat data MusicBrainz kana Néptunus.
Hubungan di tilu
Kahiji, urang ngarobah data MusicBrainz kana RDF triples. Kituna, pikeun tiap tabel, urang diartikeun template nu ngahartikeun kumaha unggal kolom digambarkeun dina tripel. Dina conto ieu, unggal baris ti tabel palaku dipetakeun kana dua belas RDF triples.
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .
Unggahan massal
Cara anu disarankeun pikeun ngamuat data anu ageung kana Néptunus nyaéta ngaliwatan prosés unggah bulk via S3. Saatos unggah file triples anjeun ka S3, anjeun ngamimitian unggah nganggo pamundut POST. Dina kasus urang, éta nyandak ngeunaan 24 jam pikeun 500 juta triplets. Kami ngarepkeun éta langkung gancang.
curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
"source" : "s3://your-s3-bucket",
"format" : "ntriples",
"iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
"region" : "eu-west-1",
"failOnError" : "FALSE"
}'
Pikeun ngahindarkeun prosés anu panjang ieu unggal waktos urang ngaluncurkeun Néptunus, kami mutuskeun pikeun mulangkeun conto tina snapshot dimana triplet ieu parantos dimuat. Ngajalankeun tina snapshot nyata gancang, tapi masih nyokot ngeunaan sajam nepi ka Néptunus sadia pikeun requests.
Nalika mimitina ngamuat triplets kana Néptunus, urang ngalaman sagala rupa kasalahan.
{
"errorCode" : "PARSING_ERROR",
"errorMessage" : "Content after '.' is not allowed",
"fileName" : [...],
"recordNum" : 25
}
Sababaraha di antarana éta kasalahan parsing, sakumaha ditémbongkeun di luhur. Nepi ka ayeuna, urang masih henteu terang naon anu salah dina waktos ieu. Sakedik langkung rinci pasti bakal ngabantosan di dieu. Kasalahan ieu lumangsung salila kurang leuwih 1% tina triples diselapkeun. Tapi sajauh uji coba Néptunus, kami nampi kanyataan yén kami ngan ukur damel sareng 99% inpormasi ti MusicBrainz.
Sanajan ieu gampang pikeun jalma akrab jeung SPARQL, jadi sadar yen RDF triple kudu annotated kalawan tipe data eksplisit, nu deui bisa ngabalukarkeun kasalahan.
Streaming download
Sakumaha didadarkeun di luhur, kami henteu hoyong nganggo Néptunus salaku toko data statik, tapi salaku basis pangaweruh anu fleksibel sareng berkembang. Ku kituna urang diperlukeun pikeun manggihan cara pikeun ngawanohkeun triple anyar lamun dasar pangaweruh robah, contona lamun albeum anyar diterbitkeun atawa lamun urang hayang materialize pangaweruh turunan.
Néptunus ngarojong operator input ngaliwatan queries SPARQL, duanana atah jeung sampel basis. Urang bakal ngabahas duanana pendekatan di handap.
Salah sahiji tujuan kami nyaéta ngalebetkeun data dina cara streaming. Pertimbangkeun ngaleupaskeun albeum di nagara anyar. Tina sudut pandang MusicBrainz, ieu ngandung harti yén pikeun sékrési anu kalebet albeum, single, EP, sareng sajabana, éntri énggal ditambah kana méja. release-nagara. Dina RDF, urang cocog informasi ieu dua triples anyar.
INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };
Tujuanana séjén nyaéta pikeun meunangkeun pangaweruh anyar tina grafik. Sebutkeun urang hoyong kéngingkeun jumlah sékrési unggal artis anu parantos diterbitkeun dina karirna. Paménta sapertos kitu rada rumit sareng nyandak langkung ti 20 menit di Néptunus, janten urang kedah ngawujudkeun hasilna supados tiasa nganggo deui pangaweruh énggal ieu dina sababaraha pamundut sanés. Ku kituna urang tambahkeun triples kalawan informasi ieu balik ka grafik, ngasupkeun hasil subquery nu.
INSERT {
?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
} WHERE {
SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
WHERE {
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
}
GROUP BY ?artist_credit
}
Nambahkeun triple tunggal ka grafik butuh sababaraha milliseconds, sedengkeun waktu palaksanaan pikeun inserting hasil subquery gumantung kana waktu palaksanaan subquery sorangan.
Sanaos urang henteu sering dianggo, Néptunus ogé ngamungkinkeun anjeun ngahapus triplet dumasar kana conto atanapi data eksplisit, anu tiasa dianggo pikeun ngapdet inpormasi.
patarosan SPARQL
Ku ngawanohkeun subsample saméméhna, nu balik jumlah release pikeun tiap artis, kami geus diwanohkeun tipe mimiti query urang hayang ngajawab maké Néptunus. Ngawangun pamundut di Néptunus gampang - kirimkeun pamundut POST ka titik SPARQL, sapertos anu dipidangkeun di handap ieu:
curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparql
Salaku tambahan, kami parantos ngalaksanakeun pamundut anu ngabalikeun profil artis anu ngandung inpormasi ngeunaan nami, umur, atanapi nagara asalna. Émut yén palaku tiasa individu, band, atanapi orkestra. Salaku tambahan, kami nambihan data ieu sareng inpormasi ngeunaan jumlah rilis anu dikaluarkeun ku seniman salami sataun. Pikeun seniman solo, kami ogé nambihan inpormasi ngeunaan band anu diiluan ku seniman unggal taun.
SELECT
?artist_name ?year
?releases_in_year ?releases_up_year
?artist_type_name ?releases
?artist_gender ?artist_country_name
?artist_begin_date ?bands
?bands_in_year
WHERE {
# Bands for each artist
{
SELECT
?year
?first_artist
(group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
(COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?first_artist <http://musicbrainz.foo/name> "Elton John" .
?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
?l_artist_artist <http://musicbrainz.foo/link> ?link .
optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
}
GROUP BY ?first_artist ?year
}
# Releases up to a year
{
SELECT
?artist
?year
(group_concat(DISTINCT ?release_name;separator=",") as ?releases)
(COUNT(*) as ?releases_up_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
?release <http://musicbrainz.foo/release-group> ?release_group .
?release <http://musicbrainz.foo/name> ?release_name .
?release_country <http://musicbrainz.foo/release> ?release .
?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
FILTER (?release_country_year <= ?year)
}
GROUP BY ?artist ?year
}
# Releases in a year
{
SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
WHERE {
VALUES ?year {
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018
}
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
?release_group <http://musicbrainz.foo/name> ?release_group_name .
?release <http://musicbrainz.foo/release-group> ?release_group .
?release_country <http://musicbrainz.foo/release> ?release .
?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
FILTER (?release_country_year = ?year)
}
GROUP BY ?artist ?year
}
# Master data
{
SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
WHERE {
?artist <http://musicbrainz.foo/name> ?artist_name .
?artist <http://musicbrainz.foo/name> "Elton John" .
?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
?artist <http://musicbrainz.foo/area> ?birth_area .
?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
FILTER(datatype(?artist_begin_date) = xsd:int)
}
Kusabab pajeulitna patarosan sapertos kitu, urang ngan ukur tiasa ngalaksanakeun patarosan titik pikeun seniman khusus, sapertos Elton John, tapi henteu pikeun sadaya seniman. Néptunus sigana henteu ngaoptimalkeun pamundut sapertos kitu ku ngagentos saringan kana subselects. Ku alatan éta, unggal pilihan kudu disaring sacara manual ku ngaran artis.
Neptunus boga duanana waragad per-jam jeung per-I/O. Pikeun uji kami, kami nganggo conto Néptunus minimum anu kosong, anu hargana $ 0,384 / jam. Dina kasus patarosan di luhur, anu ngitung profil pikeun hiji pagawe tunggal, Amazon ngecas kami puluhan rébu operasi I / O, nunjukkeun biaya $0.02.
kacindekan
Kahiji, Amazon Néptunus ngajaga lolobana janji na. Salaku jasa anu diurus, éta mangrupikeun pangkalan data grafik anu gampang pisan dipasang sareng tiasa dijalankeun tanpa seueur konfigurasi. Ieu lima pamanggihan konci kami:
- Unggah massal gampang tapi lambat. Tapi tiasa pajeulit sareng pesen kasalahan anu henteu ngabantosan pisan.
- Unduh streaming ngadukung sadayana anu kami duga sareng lumayan gancang
- Patarosan anu basajan, tapi teu cukup interaktif pikeun ngajalankeun queries analitik
- Patarosan SPARQL kedah dioptimalkeun sacara manual
- Pangmayaran Amazon hese dikira-kira sabab hese estimasi jumlah data anu diseken ku pamundut SPARQL.
Éta hungkul. Ngadaptar pikeun
sumber: www.habr.com