Amazon Neptun haqidagi birinchi taassurotlar

Salom, Xabrovsk aholisi. Kurs boshlanishini kutish bilan "Ishlab chiquvchilar uchun AWS" Biz qiziqarli materialning tarjimasini tayyorladik.

Amazon Neptun haqidagi birinchi taassurotlar

Ko'p hollarda bizga yoqadi bakdataMijozlarimiz veb-saytlarida ko'rib turganimizdek, tegishli ma'lumotlar ob'ektlar o'rtasidagi aloqalarda yashiringan, masalan, foydalanuvchilar o'rtasidagi munosabatlar, elementlar orasidagi bog'liqliklar yoki sensorlar o'rtasidagi aloqalarni tahlil qilishda. Bunday foydalanish holatlari odatda grafikda modellashtiriladi. Shu yilning boshida Amazon o'zining yangi grafik ma'lumotlar bazasini - Neptunni chiqardi. Ushbu postda biz birinchi g'oyalarimiz, yaxshi amaliyotlarimiz va vaqt o'tishi bilan yaxshilanishi mumkin bo'lgan narsalarni baham ko'rmoqchimiz.

Nima uchun bizga Amazon Neptun kerak edi

Grafik ma'lumotlar bazalari yuqori darajada bog'langan ma'lumotlar to'plamlarini ularning relyatsion ekvivalentlariga qaraganda yaxshiroq boshqarishni va'da qiladi. Bunday ma'lumotlar to'plamlarida tegishli ma'lumotlar odatda ob'ektlar orasidagi munosabatlarda saqlanadi. Biz Neptunni sinab ko'rish uchun ajoyib ochiq ma'lumotlar loyihasidan foydalandik MusicBrainz. MusicBrainz har qanday musiqa metamaʼlumotlarini, jumladan, sanʼatkorlar, qoʻshiqlar, albom nashrlari yoki kontsertlar, shuningdek, qoʻshiq ortidagi ijrochi kim bilan hamkorlik qilgani yoki albom qaysi mamlakatda qachon chiqarilgani haqidagi maʼlumotlarni toʻplaydi. MusicBrainz-ni musiqa sanoati bilan qandaydir tarzda bog'langan ulkan ob'ektlar tarmog'i sifatida ko'rish mumkin.

MusicBrainz ma'lumotlar to'plami relyatsion ma'lumotlar bazasining CSV chiqindisi sifatida taqdim etiladi. Hammasi bo'lib, axlatxonada 93 ta jadvaldagi 157 millionga yaqin qator mavjud. Ushbu jadvallarning ba'zilarida san'atkorlar, voqealar, yozuvlar, relizlar yoki treklar kabi asosiy ma'lumotlar mavjud bo'lsa, boshqalari bog'langan jadvallar — sanʼatkorlar va yozuvlar, boshqa rassomlar yoki relizlar va boshqalar oʻrtasidagi munosabatlarni saqlash... Ular maʼlumotlar toʻplamining grafik tuzilishini namoyish etadi. Ma'lumotlar to'plamini RDF uchliklariga aylantirishda biz taxminan 500 million nusxa oldik.

Biz ishlayotgan loyiha hamkorlarining tajribasi va taassurotlariga asoslanib, biz ushbu ma'lumot bazasidan yangi ma'lumotlarni olish uchun foydalaniladigan muhitni taqdim etamiz. Bundan tashqari, biz uni muntazam ravishda yangilab turishini kutamiz, masalan, yangi nashrlarni qo'shish yoki guruh a'zolarini yangilash.

moslashish

Kutilganidek, Amazon Neptunni o'rnatish juda oddiy. U juda batafsil hujjatlashtirilgan. Grafik ma'lumotlar bazasini bir necha marta bosish bilan ishga tushirishingiz mumkin. Biroq, batafsilroq konfiguratsiya haqida gap ketganda, zarur ma'lumotlar topish qiyin. Shuning uchun biz bitta konfiguratsiya parametriga ishora qilmoqchimiz.

Amazon Neptun haqidagi birinchi taassurotlar
Parametr guruhlari uchun konfiguratsiya skrinshoti

Amazonning ta'kidlashicha, Neptun asosiy e'tiborini past kechikishli tranzaksiya ish yuklariga qaratadi, shuning uchun standart so'rovning kutish vaqti 120 soniyani tashkil qiladi. Biroq, biz ko'plab tahliliy foydalanish holatlarini sinab ko'rdik, ularda biz muntazam ravishda ushbu chegaraga erishdik. Bu vaqt tugashini Neptun va sozlash uchun yangi parametrlar guruhini yaratish orqali o'zgartirish mumkin neptune_query_timeout tegishli cheklov.

Ma'lumotlar yuklanmoqda

Quyida biz MusicBrainz ma'lumotlarini Neptunga qanday yuklaganimizni batafsil muhokama qilamiz.

Uchlikdagi munosabatlar

Birinchidan, biz MusicBrainz ma'lumotlarini RDF uchliklariga aylantirdik. Shuning uchun, har bir jadval uchun biz har bir ustunning uchlikda qanday ifodalanishini belgilaydigan shablonni aniqladik. Ushbu misolda, ijrochilar jadvalidagi har bir satr o'n ikki RDF uchligiga ko'rsatilgan.

<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .

ommaviy yuklash

Neptunga katta hajmdagi ma'lumotlarni yuklashning tavsiya etilgan usuli S3 orqali ommaviy yuklash jarayonidir. S3-ga uch martalik fayllarni yuklaganingizdan so'ng, siz POST so'rovi yordamida yuklashni boshlaysiz. Bizning holatda, 24 million uchlik uchun taxminan 500 soat vaqt kerak bo'ldi. Biz tezroq bo'lishini kutgandik.

curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
 
 
 "source" : "s3://your-s3-bucket",
 
 "format" : "ntriples",
 
 "iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
 
 "region" : "eu-west-1",
 
 "failOnError" : "FALSE"
 
}'

Neptunni har safar ishga tushirganimizda, bu uzoq jarayonning oldini olish uchun biz ushbu uchlik allaqachon yuklangan suratdan namunani tiklashga qaror qildik. Snapshotdan ishga tushirish sezilarli darajada tezroq, lekin Neptun so'rovlar uchun mavjud bo'lgunga qadar taxminan bir soat davom etadi.

Dastlab Neptunga uchliklarni yuklaganimizda, biz turli xil xatolarga duch keldik.

{
 
 
 "errorCode" : "PARSING_ERROR",
 
 "errorMessage" : "Content after '.' is not allowed",
 
 "fileName" : [...],
 
 "recordNum" : 25
 
}

Ulardan ba'zilari yuqorida ko'rsatilganidek, tahlil qilish xatolari edi. Bugungi kunga kelib, biz bu nuqtada nima noto'g'ri bo'lganini hali ham tushunmadik. Bu erda biroz ko'proq ma'lumot, albatta, yordam beradi. Bu xato kiritilgan uchliklarning taxminan 1 foizida yuz berdi. Ammo Neptunni sinab ko'rishga kelsak, biz faqat MusicBrainz ma'lumotlarining 99 foizi bilan ishlayotganimizni qabul qildik.

SPARQL bilan tanish bo'lgan odamlar uchun bu oson bo'lsa ham, shuni yodda tutingki, RDF uchliklari aniq ma'lumotlar turlari bilan izohlanishi kerak, bu esa yana xatolarga olib kelishi mumkin.

Oqimli yuklab olish

Yuqorida aytib o'tilganidek, biz Neptunni statik ma'lumotlar ombori sifatida emas, balki moslashuvchan va rivojlanayotgan bilimlar bazasi sifatida ishlatishni xohlaymiz. Shunday qilib, biz bilimlar bazasi o'zgarganda, masalan, yangi albom nashr etilganda yoki olingan bilimlarni amalga oshirishni xohlaganimizda, yangi uchlikni joriy qilish yo'llarini topishimiz kerak edi.

Neptun SPARQL so'rovlari orqali kirish operatorlarini ham xom, ham namunaga asoslangan holda qo'llab-quvvatlaydi. Quyida ikkala yondashuvni ham muhokama qilamiz.

Maqsadlarimizdan biri ma'lumotlarni oqimli tarzda kiritish edi. Yangi mamlakatda albom chiqarishni o'ylab ko'ring. MusicBrainz nuqtai nazaridan, bu albomlar, singllar, RaIlar va boshqalarni o'z ichiga olgan nashr uchun jadvalga yangi yozuv qo'shilganligini anglatadi. ozod - mamlakat. RDFda biz ushbu ma'lumotni ikkita yangi uchlik bilan moslashtiramiz.

INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };

Yana bir maqsad grafikdan yangi bilimlarni olish edi. Aytaylik, biz har bir rassom o'z karerasida chop etgan relizlar sonini olishni xohlaymiz. Bunday so'rov juda murakkab va Neptunda 20 daqiqadan ko'proq vaqtni oladi, shuning uchun biz ushbu yangi bilimni boshqa so'rovda qayta ishlatish uchun natijani amalga oshirishimiz kerak. Shunday qilib, biz pastki so'rovning natijasini kiritib, ushbu ma'lumot bilan uchtalikni grafikga qaytaramiz.

INSERT {
 
 
  ?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
 
} WHERE {
 
  SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
 
  WHERE {
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
  }
 
  GROUP BY ?artist_credit
 
}

Grafikga bitta uchlik qo'shish bir necha millisekundni oladi, shu bilan birga pastki so'rov natijasini kiritish uchun bajarilish vaqti quyi so'rovning o'zi bajarilish vaqtiga bog'liq.

Garchi biz uni tez-tez ishlatmagan bo'lsak-da, Neptun shuningdek, ma'lumotni yangilash uchun ishlatilishi mumkin bo'lgan namunalar yoki aniq ma'lumotlarga asoslangan uchliklarni olib tashlashga imkon beradi.

SPARQL so'rovlari

Har bir rassom uchun relizlar sonini qaytaradigan oldingi kichik namunani joriy qilish orqali biz Neptun yordamida javob bermoqchi bo'lgan so'rovning birinchi turini allaqachon taqdim etdik. Neptunda so'rov yaratish oson - quyida ko'rsatilganidek, SPARQL so'nggi nuqtasiga POST so'rovini yuboring:

curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparql

Bundan tashqari, biz ularning ismi, yoshi yoki kelib chiqqan mamlakati haqidagi ma'lumotlarni o'z ichiga olgan rassom profilini qaytaradigan so'rovni amalga oshirdik. Shuni yodda tutingki, ijrochilar shaxslar, guruhlar yoki orkestrlar bo'lishi mumkin. Bundan tashqari, biz ushbu ma'lumotni rassomlar tomonidan yil davomida chiqarilgan relizlar soni haqidagi ma'lumotlar bilan to'ldiramiz. Yakkaxon san'atkorlar uchun, shuningdek, har yili rassomlar ishtirok etgan guruhlar haqida ma'lumot qo'shamiz.

SELECT
 
 
 ?artist_name ?year
 
 ?releases_in_year ?releases_up_year
 
 ?artist_type_name ?releases
 
 ?artist_gender ?artist_country_name
 
 ?artist_begin_date ?bands
 
 ?bands_in_year
 
WHERE {
 
 # Bands for each artist
 
 {
 
   SELECT
 
     ?year
 
     ?first_artist
 
     (group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
 
     (COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)     
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018
 
     }   
 
     ?first_artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
 
     ?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
 

 
 
     ?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
 
     ?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
 
     optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
 
     optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
 

 
 
     ?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/link> ?link .
 

 
 
     optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
 
     optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
 

 
 
     FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
 
     FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
 
     FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
 
     FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
 
     FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
     FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
   }
 
   GROUP BY ?first_artist ?year
 
 }
 
 # Releases up to a year
 
 {
 
   SELECT
 
     ?artist
 
     ?year
 
     (group_concat(DISTINCT ?release_name;separator=",") as ?releases)
 
     (COUNT(*) as ?releases_up_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release <http://musicbrainz.foo/name> ?release_name .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year <= ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Releases in a year
 
 {
 
   SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year = ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Master data
 
 {
 
   SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
 
   WHERE {
 
     ?artist <http://musicbrainz.foo/name> ?artist_name .
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
 
     ?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
 
     ?artist <http://musicbrainz.foo/area> ?birth_area .
 
     ?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
 
     ?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
 

 
 
     FILTER(datatype(?artist_begin_date) = xsd:int)
 
   }

Bunday so'rovning murakkabligi tufayli biz faqat Elton Jon kabi ma'lum bir rassom uchun nuqtali so'rovlarni bajarishimiz mumkin edi, lekin hamma rassomlar uchun emas. Neptun filtrlarni pastki tanlovlarga qo'yib, bunday so'rovni optimallashtirishga o'xshamaydi. Shuning uchun, har bir tanlov qo'lda rassom nomi bo'yicha filtrlanishi kerak.

Neptun ham soatlik, ham kirish/chiqarish uchun to'lovlarga ega. Sinovimiz uchun biz soatiga 0,384 dollar turadigan minimal Neptun misolidan foydalandik. Bitta ishchi uchun profilni hisoblaydigan yuqoridagi so'rov bo'lsa, Amazon bizdan o'n minglab kiritish-chiqarish operatsiyalari uchun haq oladi, bu esa $0.02 xarajatni nazarda tutadi.

xulosa

Birinchidan, Amazon Neptun o'z va'dalarining aksariyat qismini bajaradi. Boshqariladigan xizmat sifatida bu grafik maʼlumotlar bazasi boʻlib, uni oʻrnatish juda oson va koʻp konfiguratsiyalarsiz ishga tushishi mumkin. Mana bizning beshta asosiy topilmamiz:

  • Ommaviy yuklash oson, lekin sekin. Ammo bu juda foydali bo'lmagan xato xabarlari bilan murakkablashishi mumkin.
  • Streaming yuklab olish biz kutgan hamma narsani qo'llab-quvvatlaydi va juda tez edi
  • So‘rovlar oddiy, ammo analitik so‘rovlarni bajarish uchun yetarlicha interaktiv emas
  • SPARQL so'rovlarini qo'lda optimallashtirish kerak
  • Amazon to'lovlarini taxmin qilish qiyin, chunki SPARQL so'rovi orqali skanerlangan ma'lumotlar miqdorini taxmin qilish qiyin.

Ana xolos. Roʻyxatdan oʻtish "Yuklarni muvozanatlash" mavzusida bepul vebinar.


Manba: www.habr.com

a Izoh qo'shish