Aragtida ugu horeysa ee Amazon Neptune

Salaan, dadka deggan Khabrovsk. Iyadoo la filayo bilowga koorsada "AWS ee Horumarinta" Waxaan diyaarinay tarjumaad waxyaabo xiiso leh.

Aragtida ugu horeysa ee Amazon Neptune

Xaalado badan oo isticmaal oo aan jecelnahay bakdataSida aan ku aragno mareegaha macaamiisheenna, macluumaadka khuseeya waxa ay ku qarsoon yihiin xidhiidhada ka dhexeeya hay'adaha, tusaale ahaan marka la falanqeynayo cilaaqaadka ka dhexeeya isticmaalayaasha, ku tiirsanaanta u dhaxaysa walxaha, ama xidhiidhada dareemayaasha. Kiisaska isticmaalka noocan oo kale ah waxaa badanaa lagu qaabeeyaa garaaf. Horraantii sanadkan, Amazon waxay sii daysay xogta garaafkeeda cusub, Neptune. Maqaalkan waxaan rabnaa inaan wadaagno fikradahayaga ugu horreeya, dhaqamada wanaagsan iyo waxa la hagaajin karo waqti ka dib.

Waa maxay sababta aan ugu baahanahay Amazon Neptune

Xogta garaafyada waxay ballan qaadaysaa inay wax ka qabato xogta xogta aadka ugu xidhan si ka fiican kuwa u dhigma. Xog-ururintan oo kale, macluumaadka khuseeya waxa badanaa lagu kaydiyaa cilaaqaadka ka dhexeeya walxaha. Waxaan isticmaalnay mashruuc xog furan oo cajiib ah si aan u tijaabino Neptune MusicBrainz. MusicBrainz waxay ururisaa nooc kasta oo ka mid ah xogta muusigga ee la qiyaasi karo, sida macluumaadka ku saabsan fannaaniinta, heesaha, sii daynta albamka ama riwaayadaha, iyo sidoo kale cidda fannaanka ka dambaysa heesta la kaashaday ama markii albamka lagu sii daayay waddankee. MusicBrainz waxaa loo arki karaa shabakad weyn oo ah hay'ado si uun ugu xidhan warshadaha muusiga.

Dataset-ka MusicBrainz waxa loo bixiyay sidii CSV qashin-qubka xog-ururinta. Wadar ahaan, qashinka ayaa ka kooban ilaa 93 milyan oo saf oo 157 miis ah. Halka qaar ka mid ah jaantusyadan ay ka kooban yihiin xogta aasaasiga ah sida fanaaniinta, dhacdooyinka, duubista, sii deynta ama raadadka, kuwa kale miisaska isku xirka - kaydinta xidhiidhada u dhexeeya fannaaniinta iyo duubista, fannaaniinta kale ama sii daynta, iwm... Waxay muujiyaan qaab-dhismeedka garaafka ee xogta. Markii xogta loo beddelayo RDF saddex jibaaran, waxaan helnay ku dhawaad ​​​​500 milyan oo xaaladood.

Iyada oo ku saleysan waayo-aragnimada iyo aragtida la-hawlgalayaasha mashruuca ee aan la shaqeyno, waxaan soo bandhigaynaa goob taas oo saldhiggan aqoonta loo isticmaalo si loo helo macluumaad cusub. Intaa waxa dheer, waxaanu filaynaa in si joogto ah loo cusboonaysiiyo, tusaale ahaan in lagu daro sii dayn cusub ama la cusboonaysiiyo xubnaha kooxda.

sixitaanka

Sida la filayo, rakibidda Amazon Neptune waa mid fudud. Aad bay u faahfaahsan tahay dukumeenti. Waxaad ku bilaabi kartaa xogta garaafyada dhawr dhagsi oo kaliya. Si kastaba ha ahaatee, marka ay timaado qaabeynta faahfaahsan, macluumaadka lagama maarmaanka ah adag tahay in la helo. Sidaa darteed, waxaan rabnaa inaan tilmaamno halbeeg qaabeynta.

Aragtida ugu horeysa ee Amazon Neptune
Sawirka qaabaynta ee kooxaha qiyaasaha

Amazon waxay leedahay Neptune waxay diiradda saartaa culeysyada shaqo ee macaamilka ee hooseeya, taas oo ah sababta waqtiga codsiga caadiga ah uu yahay 120 ilbiriqsi. Si kastaba ha ahaatee, waxaanu tijaabinay kiisas badan oo isticmaalka falanqaynta ah kuwaas oo aan si joogto ah u gaarnay xadkan. Wakhtiga kama dambaysta ah waxa lagu bedeli karaa iyada oo la abuurayo koox qiyaaseed cusub Neptune iyo dejinta neptune_query_timeout xayiraad u dhiganta.

Soodejinaya Xogta

Hoos waxaan ka wada hadli doonaa si faahfaahsan sida aan ku shuban xogta MusicBrainz galay Neptune.

Xidhiidhada saddex geesoodka ah

Marka hore, waxaan u beddelnay xogta MusicBrainz saddex laab RDF. Sidaa darteed, miis kasta, waxaanu u qeexnay qaab-dhismeed qeexaya sida tiir kasta uu u matalo saddex-geesoodka. Tusaalahan, saf kasta oo ka soo baxa miiska hawl-wadeenka waxa loo habeeyey laba iyo toban RDF saddex laab.

<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .

bulk upload

Habka la soo jeediyay ee loogu shubayo tiro badan oo xog ah Neptune waa iyada oo loo marayo habka rarka badan ee S3. Ka dib markaad ku dhejiso faylashaada saddex-laaban S3, waxaad bilaabaysaa raritaanka adigoo isticmaalaya codsi POST. Xaaladeena, waxay qaadatay ilaa 24 saacadood 500 milyan oo saddex jeer ah. Waxaan filaynay inay dhakhso badan doonto.

curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
 
 
 "source" : "s3://your-s3-bucket",
 
 "format" : "ntriples",
 
 "iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
 
 "region" : "eu-west-1",
 
 "failOnError" : "FALSE"
 
}'

Si aan uga fogaano nidaamkan dheer mar kasta oo aan soo saarno Neptune, waxaan go'aansanay in aan dib u soo celinno tusaale ahaan sawir-qaadis ah oo saddexdan mataano ah ay horayba u rareen. Ka socodsiinta sawir-qaadista aad bay u dhakhso badan tahay, laakiin waxay weli qaadataa saacad ilaa Neptune la heli karo codsiyada.

Markii hore saddex-geesoodka Neptune, waxaan la kulannay khaladaad kala duwan.

{
 
 
 "errorCode" : "PARSING_ERROR",
 
 "errorMessage" : "Content after '.' is not allowed",
 
 "fileName" : [...],
 
 "recordNum" : 25
 
}

Qaar ka mid ah waxay ahaayeen kuwo sifeynaya khaladaadka, sida kor ku cad. Ilaa hadda, wali maynaan garan waxa saxda ah ee meeshan ka khaldamay. Faahfaahin yar oo dheeraad ah ayaa hubaal ah inay halkan ku caawin doonto. Khaladkani waxa uu dhacay ku dhawaad ​​1% saddex laab la geliyey. Laakiin illaa iyo inta la tijaabinayo Neptune, waxaan aqbalnay xaqiiqda ah inaan kaliya la shaqeyno 99% macluumaadka MusicBrainz.

In kasta oo ay tani u fududahay dadka aqoonta u leh SPARQL, la soco in RDF saddex-laabantay ay tahay in lagu sharraxo noocyada xogta cad, taas oo mar kale sababi karta khaladaad.

Soo dajiso download

Sida kor ku xusan, ma rabno in aan u isticmaalno Neptune sida kayd xogta taagan, laakiin halkii ay ka ahaan lahayd saldhig aqooneed dabacsan oo kobcaya. Markaa waxaan u baahnay inaan helno habab aan ku soo bandhigno seddexleey cusub marka saldhigga aqoontu is beddelo, tusaale ahaan marka albamka cusub la daabaco ama marka aan rabno inaan ka dhabeyno aqoonta la soo qaatay.

Neptune waxay taageertaa hawl wadeenada wax gelinta iyada oo loo marayo weydiimaha SPARQL, labadaba cayriin iyo muunad ku salaysan. Waxaan ka wada hadli doonaa labada hab hoos.

Mid ka mid ah yoolalkayagu waxa ay ahayd in aan galno xogta qaab qulqulaya. Tixgeli in aad albamka ku sii deyso wadan cusub Marka loo eego dhinaca MusicBrainz, tani waxay la macno tahay in la sii daayo ay ku jiraan albamka, keligood, EPs, iwm., gelid cusub ayaa lagu daray miiska. dalka sii dayn. Gudaha RDF, waxaanu ku dhignaa macluumaadkan laba saddexley oo cusub.

INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };

Hadafka kale wuxuu ahaa in aqoon cusub laga helo garaafka. Aynu nidhaahno waxaan rabnaa inaan helno tirada sii daynta fanaan kastaa ku daabacay xirfaddiisa. Weydiinta noocaan oo kale ah waa mid aad u adag waxayna ku qaadataa in ka badan 20 daqiiqo gudaha Neptune, markaa waxaan u baahanahay inaan xaqiijino natiijada si aan dib ugu isticmaalno aqoontan cusub su'aal kale. Markaa waxaanu ku darnaa saddex laab xogtan oo dib ugu celinaysa garaafka, anagoo galinayna natiijada subquery.

INSERT {
 
 
  ?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
 
} WHERE {
 
  SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
 
  WHERE {
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
  }
 
  GROUP BY ?artist_credit
 
}

Ku darista hal saddex-geesood garaafku waxay qaadataa dhowr millise seconds, halka wakhtiga fulinta ee gelinta natiijada dugsi hoose ay ku xidhan tahay wakhtiga fulinta ee subquery laftiisa.

In kasta oo aannaan inta badan isticmaalin, Neptune sidoo kale wuxuu kuu ogolaanayaa inaad ka saarto saddex-geesoodka iyada oo ku saleysan muunado ama xog cad, oo loo isticmaali karo in lagu cusbooneysiiyo macluumaadka.

Su'aalaha SPARQL

Anagoo soo bandhigayna tusaalihii hore, kaasoo soo celinaya tirada la sii daayay fanaan kasta, waxaan horay u soo bandhignay nooca ugu horeeya ee weydiinta aan rabno inaan ka jawaabno anagoo adeegsanayna Neptune. Dhisida su'aal Neptune waa sahlan tahay - u dir codsi POST barta dhamaadka SPARQL, sida hoos ku cad:

curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparql

Intaa waxaa dheer, waxaanu hirgelinay waydiin soo celisa astaanta farshaxanimada oo ka kooban macluumaad ku saabsan magacooda, da'dooda, ama waddanka ay ka yimaadeen. Maskaxda ku hay in jilayaashu ay noqon karaan shaqsiyaad, kooxo, ama orkestras. Intaa waxa dheer, waxaanu xogtan ku kordhinay macluumaadka ku saabsan tirada sii daynta ee ay fanaaniintu sii daayeen sanadkan gudihiisa. Fanaaniinta keligiis ah, waxaan sidoo kale ku darnaa macluumaadka ku saabsan kooxihii uu fanaanku ka qayb qaadan jiray sannad kasta.

SELECT
 
 
 ?artist_name ?year
 
 ?releases_in_year ?releases_up_year
 
 ?artist_type_name ?releases
 
 ?artist_gender ?artist_country_name
 
 ?artist_begin_date ?bands
 
 ?bands_in_year
 
WHERE {
 
 # Bands for each artist
 
 {
 
   SELECT
 
     ?year
 
     ?first_artist
 
     (group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
 
     (COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)     
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018
 
     }   
 
     ?first_artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
 
     ?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
 

 
 
     ?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
 
     ?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
 
     optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
 
     optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
 

 
 
     ?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/link> ?link .
 

 
 
     optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
 
     optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
 

 
 
     FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
 
     FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
 
     FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
 
     FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
 
     FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
     FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
   }
 
   GROUP BY ?first_artist ?year
 
 }
 
 # Releases up to a year
 
 {
 
   SELECT
 
     ?artist
 
     ?year
 
     (group_concat(DISTINCT ?release_name;separator=",") as ?releases)
 
     (COUNT(*) as ?releases_up_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release <http://musicbrainz.foo/name> ?release_name .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year <= ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Releases in a year
 
 {
 
   SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year = ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Master data
 
 {
 
   SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
 
   WHERE {
 
     ?artist <http://musicbrainz.foo/name> ?artist_name .
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
 
     ?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
 
     ?artist <http://musicbrainz.foo/area> ?birth_area .
 
     ?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
 
     ?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
 

 
 
     FILTER(datatype(?artist_begin_date) = xsd:int)
 
   }

Kakanaanta su'aashan oo kale awgeed, waxaan kaliya u samayn karnaa su'aalaha dhibcaha farshaxan gaar ah, sida Elton John, laakiin maaha dhammaan fanaaniinta. Neptune uma eka inay wanaajiso waydiintan oo kale iyadoo ku tuuraysa filtarrada door hoose. Sidaa darteed, doorasho kasta waa in gacanta lagu sifeeyaa magaca farshaxanimada.

Neptune waxay leedahay kharash saacadle ah iyo mid-I/O labadaba. Imtixaankayaga, waxaan u isticmaalnay tusaalaha Neptune ugu yar, kaasoo qiimihiisu yahay $0,384/saacaddii. Marka la eego su'aasha sare, ee xisaabisa astaanta hal shaqaale, Amazon waxay nagu soo dalacdaa tobanaan kun oo hawlgal I/O ah, taasoo ka dhigan kharash dhan $0.02.

gunaanad

Marka hore, Amazon Neptune waxay ilaalisaa inta badan ballanqaadyadeeda. Adeeg la maareeyey ahaan, waa xog-ururin garaaf ah oo aad u fudud in la rakibo oo socon kara iyada oo aan la habeynin badan. Waa kuwan shanta natiijooyin ee muhiimka ah:

  • Kor u qaadida tirada badan waa sahlan tahay laakiin gaabis ah. Laakiin waxay ku adkaan kartaa fariimaha khaldan ee aan waxtarka lahayn.
  • Soo dejintu waxay taageertaa wax kasta oo aan filaynay oo aad u dhaqso badan
  • Weydiimaha waa sahlan yihiin, laakiin maaha kuwo is dhexgal ku filan oo lagu socodsiiyo weydiimaha falanqaynta
  • Weydiimaha SPARQL waa in gacanta lagu hagaajiyaa
  • Lacag bixinta Amazon way adagtahay in la qiyaaso sababtoo ah way adagtahay in la qiyaaso qadarka xogta lagu sawiray waydiinta SPARQL.

Waa intaas. Isku qor Webinar bilaash ah oo ku saabsan mawduuca "Dhaliilaha Xamuulka".


Source: www.habr.com

Add a comment