Maikutlo a pele a Amazon Neptune

Lumela, baahi ba Khabrovsk. Ka tebello ea ho qala thupelo "AWS bakeng sa Bahlahisi" Re lokiselitse phetolelo ea boitsebiso bo thahasellisang.

Maikutlo a pele a Amazon Neptune

Maemong a mangata a tšebeliso eo re e ratang bakdataJoalo ka ha re bona liwebsaeteng tsa bareki ba rona, tlhaiso-leseling e nepahetseng e patiloe likhokahanong lipakeng tsa mekhatlo, mohlala ha ho hlahlojoa likamano lipakeng tsa basebelisi, ho its'etleha lipakeng tsa likarolo, kapa likhokahano lipakeng tsa li-sensor. Maemo a joalo a ts'ebeliso hangata a etsoa mohlala oa kerafo. Pejana selemong sena, Amazon e ile ea lokolla database ea eona e ncha ea graph, Neptune. Ka poso ena re batla ho arolelana maikutlo a rona a pele, mekhoa e metle le se ka ntlafatsoang ha nako e ntse e ea.

Hobaneng re ne re hloka Amazon Neptune

Li-database tsa kerafo li ts'episa ho sebetsana le li-data tse hokahaneng haholo ho feta tse amanang le tsona. Likhokahanong tse joalo, tlhaiso-leseling e nepahetseng hangata e bolokoa likamanong lipakeng tsa lintho. Re sebelisitse projeke e makatsang ea data e bulehileng ho leka Neptune Lebogang Mashiane. MusicBrainz e bokella mefuta eohle ea metadata ea 'mino e ka nahanoang, joalo ka tlhahisoleseling mabapi le litaki, lipina, lialbamo tse lokollotsoeng kapa likonsarete, hammoho le hore na sebini sa pina se sebelisane le mang kapa ha albamo e lokolloa naheng efe. MusicBrainz e ka bonoa e le marang-rang a maholo a mekhatlo e amanang le indasteri ea 'mino ka tsela e itseng.

Lethathamo la data la MusicBrainz le fanoe e le thotobolo ea CSV ea polokelo ea litaba tsa likamano. Ka kakaretso, thotobolo e na le mela e ka bang limilione tse 93 litafoleng tse 157. Le ha tse ling tsa litafole tsena li na le data ea mantlha joalo ka litaki, liketsahalo, lirekoto, litokollo kapa lipina, tse ling litafole tsa khokahanyo - boloka likamano pakeng tsa baetsi ba litšoantšo le lirekoto, litsebi tse ling kapa litokollo, joalo-joalo ... Li bonts'a sebopeho sa graph ea sete ea data. Ha re fetolela dataset ho RDF ka makhetlo a mararo, re fumane makhetlo a ka bang limilione tse 500.

Ho ipapisitsoe le boiphihlelo le maikutlo a balekane ba morero bao re sebetsang le bona, re hlahisa maemo ao ho ona motheo ona oa tsebo o sebelisoang ho fumana tlhaiso-leseling e ncha. Ntle le moo, re lebelletse hore e tla ntlafatsoa khafetsa, mohlala, ka ho kenyelletsa likhatiso tse ncha kapa ho ntlafatsa litho tsa sehlopha.

phetoho

Joalokaha ho lebelletsoe, ho kenya Amazon Neptune ho bonolo. O na le lintlha tse ngata ngolisoa. O ka thakhola database ea graph ka ho tobetsa tse 'maloa feela. Leha ho le joalo, ha ho tluoa tabeng ea tlhophiso e qaqileng haholoanyane, tlhahisoleseding e hlokahalang ho thata ho fumana. Ka hona, re batla ho supa parameter e le 'ngoe ea tlhophiso.

Maikutlo a pele a Amazon Neptune
Configuration screenshot bakeng sa lihlopha tsa paramethara

Amazon e re Neptune e shebane haholo le meroalo e tlase ea latency transaction, ke ka lebaka leo nako ea kopo ea kamehla e leng metsotsoana e 120. Leha ho le joalo, re lekile linyeoe tse ngata tsa ts'ebeliso eo ho eona re neng re fihlela moeli ona khafetsa. Nako ena ea nako e ka fetoloa ka ho theha sehlopha se secha sa paramethara bakeng sa Neptune le litlhophiso neptune_query_timeout thibelo e tsamaellanang.

Loading Data

Ka tlase re tla tšohla ka botlalo hore na re kentse data ea MusicBrainz joang ho Neptune.

Likamano ka boraro

Taba ea pele, re fetoletse data ea MusicBrainz ho RDF ka makhetlo a mararo. Ka hona, bakeng sa tafole e 'ngoe le e' ngoe, re hlalositse template e hlalosang hore na kholomo ka 'ngoe e emetsoe joang hararo. Mohlaleng ona, mola o mong le o mong ho tloha tafoleng ea libapali o entsoe 'mapa oa makhetlo a leshome le metso e 'meli a RDF.

<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .

kenya ka bongata

Mokhoa o khothaletsoang oa ho kenya lintlha tse ngata ho Neptune ke ka ts'ebetso ea ho kenya ka bongata ka S3. Kamora ho kenya lifaele tsa hau tse tharo ho S3, o qala ho kenya o sebelisa kopo ea POST. Tabeng ea rona, ho nkile lihora tse ka bang 24 bakeng sa makhetlo a mararo a limilione tse 500. Re ne re lebeletse hore e tla potlaka.

curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
 
 
 "source" : "s3://your-s3-bucket",
 
 "format" : "ntriples",
 
 "iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
 
 "region" : "eu-west-1",
 
 "failOnError" : "FALSE"
 
}'

Ho qoba ts'ebetso ena e telele nako le nako ha re qala Neptune, re nkile qeto ea ho khutlisetsa mohlala ho tsoa setšoantšong moo li-triplets tsena li neng li se li kentsoe. Ho matha ho tsoa ho senepe ho potlakile haholo, empa ho ntse ho nka hora ho fihlela Neptune e fumaneha bakeng sa likopo.

Ha re qala re kenya li-triples ho Neptune, re ile ra kopana le liphoso tse fapaneng.

{
 
 
 "errorCode" : "PARSING_ERROR",
 
 "errorMessage" : "Content after '.' is not allowed",
 
 "fileName" : [...],
 
 "recordNum" : 25
 
}

Tse ling tsa tsona e ne e le liphoso tse hlalosang, joalokaha ho bontšitsoe ka holimo. Ho fihlela joale, ha re e-so bone hore na hantle-ntle phoso e etsahetse joang nakong ena. Lintlha tse ling tse eketsehileng li tla thusa mona. Phoso ena e etsahetse hoo e ka bang 1% ea makhetlo a mararo a kentsoeng. Empa mabapi le tlhahlobo ea Neptune, re amohetse taba ea hore re sebetsa feela ka 99% ea tlhaiso-leseling e tsoang ho MusicBrainz.

Leha sena se le bonolo ho batho ba tloaelaneng le SPARQL, hlokomela hore RDF e tlameha ho hlakisoa ka makhetlo a mararo ka mefuta ea data e hlakileng, e ka bakang liphoso hape.

Khoasolla ka ho phallela

Joalo ka ha ho boletsoe ka holimo, ha re batle ho sebelisa Neptune joalo ka polokelo ea data e sa fetoheng, empa joalo ka motheo oa tsebo o feto-fetohang le o fetohang. Kahoo re ne re hloka ho fumana mekhoa ea ho kenyelletsa makhetlo a mararo ha motheo oa tsebo o fetoha, mohlala ha albamo e ncha e phatlalatsoa kapa ha re batla ho hlahisa tsebo e nkiloeng.

Neptune e ts'ehetsa basebelisi ba ho kenya letsoho ka lipotso tsa SPARQL, tse tala le tse thehiloeng ho sampole. Re tla tšohla mekhoa ena ka bobeli ka tlase.

E 'ngoe ea lipakane tsa rona e ne e le ho kenya data ka mokhoa oa ho phallela. Nahana ka ho lokolla albamo naheng e ncha. Ho latela pono ea MusicBrainz, sena se bolela hore bakeng sa tokollo e kenyelletsang lialbamo, li-singles, li-EP, joalo-joalo, keno e ncha e eketsoa tafoleng. naha ea tokollo. Ho RDF, re bapisa tlhahisoleseling ena le makhetlo a mabeli a macha.

INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };

Sepheo se seng e ne e le ho fumana tsebo e ncha ho tsoa kerafong. Ha re re re batla ho fumana palo ea likhatiso tse hatisitsoeng ke sebini ka seng mosebetsing oa bona. Potso e joalo e rarahane haholo 'me e nka metsotso e fetang 20 Neptune, kahoo re hloka ho fetola sephetho e le hore re sebelise tsebo ena e ncha hape potsong e' ngoe. Kahoo re eketsa makhetlo a mararo ka tlhaiso-leseling ena ho graph, re kenya sephetho sa subquery.

INSERT {
 
 
  ?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
 
} WHERE {
 
  SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
 
  WHERE {
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
  }
 
  GROUP BY ?artist_credit
 
}

Ho eketsa makhetlo a mararo ho graph ho nka li-milliseconds tse 'maloa, ha nako ea ts'ebetso ea ho kenya sephetho sa subquery e itšetlehile ka nako ea ts'ebetso ea subquery ka boeona.

Leha re ne re sa e sebelise khafetsa, Neptune e boetse e u lumella ho tlosa li-triplets ho latela lisampole kapa lintlha tse hlakileng, tse ka sebelisoang ho nchafatsa litaba.

SPARQL lipotso

Ka ho hlahisa sampole e fetileng, e khutlisang palo ea likhatiso bakeng sa sebini ka seng, re se re hlahisitse mofuta oa pele oa potso eo re batlang ho e araba re sebelisa Neptune. Ho aha potso Neptune ho bonolo - romella kopo ea POST ho SPARQL qetellong, joalo ka ha ho bonts'itsoe ka tlase:

curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparql

Ho feta moo, re sebelisitse potso e khutlisetsang boemo ba sebini bo nang le tlhahisoleseling mabapi le mabitso a bona, lilemo, kapa naha ea tsoalo. Hopola hore libapali e ka 'na ea e-ba batho ka bomong, lihlopha kapa lihlopha tsa 'mino oa liletsa. Ntle le moo, re tlatselletsa data ena ka tlhaiso-leseling mabapi le palo ea litokollo tse lokollotsoeng ke baetsi ba litšoantšo nakong ea selemo. Bakeng sa baetsi ba libini ba le bang, re boetse re eketsa tlhahisoleseling mabapi le lihlopha tseo sebini se ileng sa kenya letsoho ho tsona selemo se seng le se seng.

SELECT
 
 
 ?artist_name ?year
 
 ?releases_in_year ?releases_up_year
 
 ?artist_type_name ?releases
 
 ?artist_gender ?artist_country_name
 
 ?artist_begin_date ?bands
 
 ?bands_in_year
 
WHERE {
 
 # Bands for each artist
 
 {
 
   SELECT
 
     ?year
 
     ?first_artist
 
     (group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
 
     (COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)     
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018
 
     }   
 
     ?first_artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
 
     ?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
 

 
 
     ?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
 
     ?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
 
     optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
 
     optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
 

 
 
     ?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/link> ?link .
 

 
 
     optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
 
     optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
 

 
 
     FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
 
     FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
 
     FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
 
     FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
 
     FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
     FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
   }
 
   GROUP BY ?first_artist ?year
 
 }
 
 # Releases up to a year
 
 {
 
   SELECT
 
     ?artist
 
     ?year
 
     (group_concat(DISTINCT ?release_name;separator=",") as ?releases)
 
     (COUNT(*) as ?releases_up_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release <http://musicbrainz.foo/name> ?release_name .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year <= ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Releases in a year
 
 {
 
   SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year = ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Master data
 
 {
 
   SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
 
   WHERE {
 
     ?artist <http://musicbrainz.foo/name> ?artist_name .
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
 
     ?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
 
     ?artist <http://musicbrainz.foo/area> ?birth_area .
 
     ?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
 
     ?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
 

 
 
     FILTER(datatype(?artist_begin_date) = xsd:int)
 
   }

Ka lebaka la ho rarahana ha potso e joalo, re ne re ka botsa feela sebini se ikhethileng, joalo ka Elton John, empa eseng bakeng sa litsebi tsohle. Neptune ha e bonahale e ka ntlafatsa potso e joalo ka ho beha li-filters ho likhetho tse nyane. Ka hona, khetho e 'ngoe le e' ngoe e tlameha ho sefshoa ka letsoho ka lebitso la sebini.

Neptune e na le litefiso tsa hora le tsa per-I/O. Bakeng sa liteko tsa rona, re sebelisitse mohlala oa Neptune o se nang letho, o bitsang $0,384/hora. Tabeng ea potso e ka holimo, e balang boemo ba mosebeletsi a le mong, Amazon e re lefisa mashome a likete a ts'ebetso ea I / O, e bolelang litšenyehelo tsa $ 0.02.

fihlela qeto e

Taba ea pele, Amazon Neptune e boloka boholo ba litšepiso tsa eona. Joalo ka ts'ebeletso e laoloang, ke database ea graph e bonolo haholo ho e kenya mme e ka sebetsa ntle le tlhophiso e ngata. Liphuputso tsa rona tse hlano tsa bohlokoa ke tsena:

  • Ho kenya ka bongata ho bonolo empa butle. Empa e ka rarahana le melaetsa ea liphoso e sa thuseng haholo.
  • Khoasollo ea ho phallela e tšehetsa ntho e 'ngoe le e 'ngoe eo re neng re e lebelletse 'me e ne e potlakile haholo
  • Lipotso li bonolo, empa ha li kopane ho lekana ho etsa lipotso tsa tlhahlobo
  • Lipotso tsa SPARQL li tlameha ho ntlafatsoa ka letsoho
  • Litefiso tsa Amazon ho thata ho hakanya hobane ho thata ho hakanya palo ea data e hlahlobiloeng ke potso ea SPARQL.

Ke phetho. Ngodisa bakeng sa webinar ea mahala sehloohong se reng "Load Balancing".


Source: www.habr.com

Eketsa ka tlhaloso