Imibono yokuqala ye-Amazon Neptune

Ndibulise, abahlali baseKhabrovsk. Ngokulindela ukuqala kwekhosi "I-AWS yabaPhuhlisi" Siye salungiselela inguqulelo yombandela obangel’ umdla.

Imibono yokuqala ye-Amazon Neptune

Kwiimeko ezininzi zokusetyenziswa esizithandayo bakdataNjengoko sibona kwiiwebhusayithi zabathengi bethu, ulwazi olufanelekileyo lufihlwe kuqhagamshelwano phakathi kwamaziko, umzekelo xa sihlalutya ubudlelwane phakathi kwabasebenzisi, ukuxhomekeka phakathi kwezinto, okanye ukudibanisa phakathi kweenzwa. Iimeko ezinjalo zokusetyenziswa zihlala zimodelwa kwigrafu. Kwangoko kulo nyaka, iAmazon ikhuphe idatabase yayo entsha yegrafu, iNeptune. Kule post sifuna ukwabelana ngezimvo zethu zokuqala, izenzo ezilungileyo kunye noko kunokuphuculwa ngokuhamba kwexesha.

Kutheni sifuna iAmazon Neptune

Oovimba beenkcukacha zegrafu bathembisa ukuphatha iiseti zedatha eziqhagamshelwe kakhulu ngcono kunezo zilingana nazo. Kwiiseti zedatha ezinjalo, ulwazi olufanelekileyo luhlala lugcinwa kubudlelwane phakathi kwezinto. Sisebenzise iprojekthi yedatha evulekileyo emangalisayo ukuvavanya iNeptune UmculoBrainz. I-MusicBrainz iqokelela lonke uhlobo lwemetadata yomculo onokucinga ngayo, njengolwazi malunga namagcisa, iingoma, ukukhutshwa kwealbham okanye iikonsathi, kunye nokuba igcisa elisemva kwengoma lisebenzisana nabani okanye xa icwecwe lakhutshwa kweliphi ilizwe. I-MusicBrainz inokubonwa njengothungelwano olukhulu lwamaqumrhu adityaniswe ngandlel' ithile kushishino lomculo.

Iseti yedatha yeMusicBrainz ibonelelwe njengendawo yokulahla iCSV yesiseko sedatha esinxulumeneyo. Iyonke, indawo yokulahla iqulethe malunga ne-93 yezigidi zemiqolo kwiitafile ze-157. Ngelixa ezinye zezi tafile ziqulethe idatha esisiseko njengamagcisa, imicimbi, ukurekhoda, ukukhutshwa okanye amathrekhi, abanye iitafile zekhonkco - ukugcina ubudlelwane phakathi kwamagcisa kunye nokurekhoda, amanye amagcisa okanye ukukhutshwa, njl ... Babonisa isakhiwo segrafu yesethi yedatha. Xa siguqulela idatha yedatha ibe yi-RDF kathathu, sifumene malunga nezigidi ezingama-500 zemizekelo.

Ngokusekelwe kumava kunye neempembelelo zamahlakani eprojekthi esisebenza nawo, sibonisa indawo apho esi siseko solwazi sisetyenziselwa ukufumana ulwazi olutsha. Ukongeza, silindele ukuba ihlaziywe rhoqo, umzekelo ngokongeza ukukhutshwa okutsha okanye ukuhlaziya amalungu eqela.

Yenza ngokwezifiso

Njengoko kulindelekile, ukufaka i-Amazon Neptune kulula. Uneenkcukacha ezininzi ibhaliwe. Ungaqalisa idatabase yegrafu ngonqakrazo nje olumbalwa. Nangona kunjalo, xa kuziwa kuqwalaselo oluneenkcukacha ngakumbi, ulwazi oluyimfuneko kunzima ukufumana. Ke ngoko, sifuna ukwalatha kwipharamitha yoqwalaselo olunye.

Imibono yokuqala ye-Amazon Neptune
Uqwalaselo umfanekiso wekhusi kumaqela parameter

I-Amazon ithi i-Neptune igxile kwi-low-latency transactional loadloads, yiyo loo nto ixesha lokucela okungagqibekanga liyi-120 imizuzwana. Nangona kunjalo, siye savavanya iimeko ezininzi zokusebenzisa uhlalutyo apho sasifikelela rhoqo kulo mda. Eli xesha lokuphuma lingatshintshwa ngokwenza iqela elitsha leparameter yeNeptune kunye nokucwangcisa neptune_query_timeout isithintelo esihambelanayo.

Ilayisha idatha

Apha ngezantsi siza kuxoxa ngokweenkcukacha ukuba siyifake njani idatha yeMusicBrainz kwiNeptune.

Ubudlelwane kwisithathu

Okokuqala, siguqule idatha ye-MusicBrainz yaba yi-RDF kathathu. Ngoko ke, kwitheyibhile nganye, sichaza itemplate echaza indlela ikholamu nganye imelwe ngayo kwi-triple. Kulo mzekelo, umqolo ngamnye ukusuka kwitheyibhile yomdlali uzotywe kwi-RDF kathathu.

<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .

ukulayisha ngobuninzi

Indlela ecetyisiweyo yokulayisha isixa esikhulu sedatha kwiNeptune kungenkqubo yokulayisha isambuku nge-S3. Emva kokulayisha iifayile zakho ezintathu kwi-S3, uqala ukulayisha usebenzisa isicelo se-POST. Kwimeko yethu, kuthathe malunga neeyure ezingama-24 kwi-500 yezigidi ze-triplets. Besilindele ukuba ikhawuleze.

curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
 
 
 "source" : "s3://your-s3-bucket",
 
 "format" : "ntriples",
 
 "iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
 
 "region" : "eu-west-1",
 
 "failOnError" : "FALSE"
 
}'

Ukunqanda le nkqubo inde ngalo lonke ixesha sisungula iNeptune, sigqibe kwelokuba siwubuyisele umzekelo ukusuka kwisnapshot apho ezi zintathu bezisele zilayishiwe. Ukubaleka kwisnapshot kukhawuleza kakhulu, kodwa kusathatha malunga neyure de iNeptune ifumaneke kwizicelo.

Xa ekuqaleni silayisha amawele amathathu kwiNeptune, siye sadibana neempazamo ezahlukeneyo.

{
 
 
 "errorCode" : "PARSING_ERROR",
 
 "errorMessage" : "Content after '.' is not allowed",
 
 "fileName" : [...],
 
 "recordNum" : 25
 
}

Ezinye zazo beziziimpazamo zokwahlulahlula, njengoko kubonisiwe ngasentla. Ukuza kuthi ga ngoku, asikaqondi ukuba yintoni kanye kanye eyonakeleyo ngeli xesha. Ingcaciso encinci ngakumbi iya kunceda apha. Le mpazamo yenzekile malunga ne-1% ye-triple efakiweyo. Kodwa malunga nokuvavanya iNeptune, siyamkele into yokuba sisebenza kuphela nge-99% yolwazi oluvela kwiMusicBrainz.

Nangona oku kulula kubantu abaqhelene ne-SPARQL, qaphela ukuba i-RDF iphindwe kathathu kufuneka ichazwe ngeentlobo zedatha ezicacileyo, ezinokuthi kwakhona zibangele iimpazamo.

Ukukhuphela komsinga

Njengoko kukhankanyiwe ngasentla, asifuni ukusebenzisa iNeptune njengendawo yokugcina idatha engatshintshiyo, kodwa njengesiseko solwazi esiguqukayo. Ke bekufuneka sifumane iindlela zokwazisa kathathu okutsha xa isiseko solwazi sitshintsha, umzekelo xa kupapashwa icwecwe elitsha okanye xa sifuna ukwenza ulwazi olufunyenweyo.

INeptune ixhasa igalelo labaqhubi ngemibuzo ye-SPARQL, yomibini ekrwada kunye nesekwe kwisampulu. Siza kuxoxa ngeendlela zombini apha ngezantsi.

Enye yeenjongo zethu yayikukufaka idatha ngendlela yokusasaza. Cinga ngokukhupha icwecwe kwilizwe elitsha. Ukusuka kumbono weMusicBrainz, oku kuthetha ukuba ukukhutshwa okubandakanya ii-albhamu, iingoma ezingatshatanga, ii-EPs, njl., ingeniso entsha yongezwa etafileni. ukukhululwa-ilizwe. Kwi-RDF, sithelekisa olu lwazi kunye neentathu ezintsha.

INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };

Olunye usukelo yayikukufumana ulwazi olutsha kwigrafu. Masithi sifuna ukufumana inani lokukhutshwa kwegcisa ngalinye elipapashiweyo kumsebenzi wabo. Umbuzo onje untsokothile kwaye uthatha ngaphezulu kwemizuzu engama-20 eNeptune, ke kufuneka senze isiphumo ukuze siphinde sisebenzise olu lwazi lutsha komnye umbuzo. Ke songeza kathathu ngolu lwazi emva kwigrafu, sifaka isiphumo se-subquery.

INSERT {
 
 
  ?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
 
} WHERE {
 
  SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
 
  WHERE {
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
  }
 
  GROUP BY ?artist_credit
 
}

Ukongeza i-triples enye kwigrafu ithatha ii-milliseconds ezimbalwa, ngelixa ixesha lokwenziwa kokufaka isiphumo se-subquery kuxhomekeke kwixesha lokwenziwa kwe-subquery ngokwayo.

Nangona besingayisebenzisanga rhoqo, iNeptune ikwavumela ukuba ususe amawele amathathu ngokusekwe kwisampulu okanye idatha ecacileyo, enokusetyenziswa ukuhlaziya ulwazi.

SPARQL imibuzo

Ngokuzisa isampulu yangaphambili, ebuyisela inani lokukhutshwa kwegcisa ngalinye, sele sazise uhlobo lokuqala lombuzo esifuna ukuwuphendula sisebenzisa iNeptune. Ukwakha umbuzo eNeptune kulula - thumela isicelo se-POST kwisiphelo se-SPARQL, njengoko kubonisiwe ngezantsi:

curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparql

Ukongeza, sisebenzise umbuzo obuyisela iprofayile yegcisa enolwazi malunga namagama, ubudala, okanye ilizwe lemvelaphi. Gcina ukhumbula ukuba abadlali banokuba ngabantu, amaqela, okanye iiokhestra. Ukongeza, songeza le datha ngolwazi malunga nenani lokukhutshwa okukhutshwe ngamagcisa enyakeni. Kumagcisa ayedwa, songeza ulwazi malunga namaqela amagcisa athathe inxaxheba kuwo nyaka ngamnye.

SELECT
 
 
 ?artist_name ?year
 
 ?releases_in_year ?releases_up_year
 
 ?artist_type_name ?releases
 
 ?artist_gender ?artist_country_name
 
 ?artist_begin_date ?bands
 
 ?bands_in_year
 
WHERE {
 
 # Bands for each artist
 
 {
 
   SELECT
 
     ?year
 
     ?first_artist
 
     (group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
 
     (COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)     
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018
 
     }   
 
     ?first_artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
 
     ?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
 

 
 
     ?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
 
     ?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
 
     optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
 
     optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
 

 
 
     ?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/link> ?link .
 

 
 
     optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
 
     optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
 

 
 
     FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
 
     FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
 
     FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
 
     FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
 
     FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
     FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
   }
 
   GROUP BY ?first_artist ?year
 
 }
 
 # Releases up to a year
 
 {
 
   SELECT
 
     ?artist
 
     ?year
 
     (group_concat(DISTINCT ?release_name;separator=",") as ?releases)
 
     (COUNT(*) as ?releases_up_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release <http://musicbrainz.foo/name> ?release_name .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year <= ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Releases in a year
 
 {
 
   SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year = ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Master data
 
 {
 
   SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
 
   WHERE {
 
     ?artist <http://musicbrainz.foo/name> ?artist_name .
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
 
     ?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
 
     ?artist <http://musicbrainz.foo/area> ?birth_area .
 
     ?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
 
     ?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
 

 
 
     FILTER(datatype(?artist_begin_date) = xsd:int)
 
   }

Ngenxa yokuntsokotha kombuzo onjalo, sinokwenza kuphela imibuzo yamanqaku kwigcisa elithile, elifana no-Elton John, kodwa hayi kuwo onke amagcisa. INeptune ayibonakali iwunyusela lo mbuzo ngokulahla izihluzi kukhetho olusezantsi. Ke ngoko, ukhetho ngalunye kufuneka luhluzwe ngesandla ngegama lomculi.

INeptune ineentlawulo zeyure kunye ne-I/O nganye. Kuvavanyo lwethu, sisebenzise ubuncinci beNeptune umzekelo, oxabisa i-0,384 yeedola ngeyure. Kwimeko yombuzo ongentla, obala iphrofayili yomsebenzi omnye, i-Amazon ihlawulisa amashumi amawaka emisebenzi ye-I / O, oku kuthetha ukuba ixabiso le-$ 0.02.

isiphelo

Okokuqala, iAmazon Neptune igcina uninzi lwezithembiso zayo. Njengenkonzo elawulwayo, yidatabase yegrafu ekulula kakhulu ukuyifaka kwaye inokuphakama kwaye isebenze ngaphandle koqwalaselo oluninzi. Nazi iziphumo zethu ezintlanu eziphambili:

  • Ukulayisha ngobuninzi kulula kodwa kucotha. Kodwa inokuba nzima ngemiyalezo yempazamo engeloncedo kakhulu.
  • Ukhuphelo losasazo luxhasa yonke into ebesiyilindele kwaye ibikhawuleza kakhulu
  • Imibuzo ilula, kodwa ayidibani ngokwaneleyo ukuqhuba imibuzo yohlalutyo
  • SPARQL imibuzo kufuneka ilungiswe ngesandla
  • Iintlawulo zeAmazon kunzima ukuqikelela kuba kunzima ukuqikelela isixa sedatha eskenwe ngumbuzo we-SPARQL.

Kuko konke. Bhalisela I-webinar yasimahla kwisihloko esithi "Umthwalo wokuLinganisa".


umthombo: www.habr.com

Yongeza izimvo