Tegishli yo'llarni topish muammosini hal qilish uchun JanusGraph DBMS grafigining qo'llanilishini sinovdan o'tkazuvchi eksperiment

Tegishli yo'llarni topish muammosini hal qilish uchun JanusGraph DBMS grafigining qo'llanilishini sinovdan o'tkazuvchi eksperiment

Hammaga salom. Biz oflayn trafikni tahlil qilish uchun mahsulotni ishlab chiqmoqdamiz. Loyihada mintaqalar bo'ylab tashrif buyuruvchilar yo'nalishlarini statistik tahlil qilish bilan bog'liq vazifa bor.

Ushbu vazifaning bir qismi sifatida foydalanuvchilar quyidagi turdagi tizim so'rovlarini so'rashlari mumkin:

  • "A" maydonidan "B" maydoniga qancha tashrif buyuruvchilar o'tgan;
  • β€œA” maydonidan β€œB” maydoniga β€œC” hududi orqali, keyin esa β€œD” hududi orqali qancha tashrif buyuruvchilar oβ€˜tgan;
  • ma'lum turdagi tashrif buyuruvchilarning "A" maydonidan "B" hududiga sayohat qilishlari uchun qancha vaqt kerak bo'ldi.

va shunga o'xshash bir qator analitik so'rovlar.

Mehmonning hududlar bo'ylab harakati yo'naltirilgan grafikdir. Internetni o'qib chiqqanimdan so'ng, men grafik DBMSlar analitik hisobotlar uchun ham qo'llanilishini aniqladim. Menda DBMS grafigi bunday so'rovlar bilan qanday kurashishini ko'rishni xohlardim (Dl; yomon).

Men DBMS dan foydalanishni tanladim JanusGraph, (mening fikrimcha) uni munosib operatsion xususiyatlar bilan ta'minlashi kerak bo'lgan etuk texnologiyalar to'plamiga tayanadigan ochiq kodli DBMS grafigining taniqli vakili sifatida:

  • BerkeleyDB xotira serveri, Apache Cassandra, Scylla;
  • murakkab indekslar Lucene, Elasticsearch, Solr-da saqlanishi mumkin.

JanusGraph mualliflari u OLTP va OLAP uchun ham mos ekanligini yozadilar.

Men BerkeleyDB, Apache Cassandra, Scylla va ES bilan ishlaganman va bu mahsulotlar ko'pincha bizning tizimlarimizda qo'llaniladi, shuning uchun men ushbu grafik DBMSni sinab ko'rishga optimistik munosabatda bo'ldim. Menga RocksDB o'rniga BerkeleyDB ni tanlash g'alati tuyuldi, lekin bu tranzaksiya talablari bilan bog'liq bo'lishi mumkin. Har qanday holatda, kengaytiriladigan, mahsulotdan foydalanish uchun Cassandra yoki Scylla-da backenddan foydalanish tavsiya etiladi.

Men Neo4jni hisobga olmadim, chunki klasterlash tijorat versiyasini talab qiladi, ya'ni mahsulot ochiq emas.

Grafik DBMSlar shunday deydi: "Agar u grafikga o'xshasa, uni grafik kabi ko'ring!" - go'zallik!

Birinchidan, men DBMS grafigining qonunlariga muvofiq tuzilgan grafikni chizdim:

Tegishli yo'llarni topish muammosini hal qilish uchun JanusGraph DBMS grafigining qo'llanilishini sinovdan o'tkazuvchi eksperiment

Mohiyat bor Zone, hudud uchun javobgar. Agar ZoneStep bunga tegishli Zone, keyin u unga ishora qiladi. Mohiyat bo'yicha Area, ZoneTrack, Person E'tibor bermang, ular domenga tegishli va testning bir qismi hisoblanmaydi. Umuman olganda, bunday grafik tuzilishi uchun zanjirli qidiruv so'rovi quyidagicha ko'rinadi:

g.V().hasLabel('Zone').has('id',0).in_()
       .repeat(__.out()).until(__.out().hasLabel('Zone').has('id',19)).count().next()

Rus tilida nima bo'ladi: ID=0 bo'lgan zonani toping, unga chekka bo'lgan barcha cho'qqilarni oling (ZoneStep), ortga qaytmasdan turib, zonaga chekka bo'lgan ZoneSteplarni topmaguningizcha, qadam qo'ying. ID=19, bunday zanjirlar sonini hisoblang.

Men grafiklarni qidirishning barcha nozik tomonlarini bilgandek ko'rsatmayman, lekin bu so'rov ushbu kitob asosida yaratilgan (https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html).

Men BerkeleyDB backend yordamida JanusGraph grafik ma'lumotlar bazasiga uzunligi 50 dan 3 ballgacha bo'lgan 20 ming trekni yukladim, shunga ko'ra indekslarni yaratdim. etakchilik.

Python yuklab olish skripti:


from random import random
from time import time

from init import g, graph

if __name__ == '__main__':

    points = []
    max_zones = 19
    zcache = dict()
    for i in range(0, max_zones + 1):
        zcache[i] = g.addV('Zone').property('id', i).next()

    startZ = zcache[0]
    endZ = zcache[max_zones]

    for i in range(0, 10000):

        if not i % 100:
            print(i)

        start = g.addV('ZoneStep').property('time', int(time())).next()
        g.V(start).addE('belongs').to(startZ).iterate()

        while True:
            pt = g.addV('ZoneStep').property('time', int(time())).next()
            end_chain = random()
            if end_chain < 0.3:
                g.V(pt).addE('belongs').to(endZ).iterate()
                g.V(start).addE('goes').to(pt).iterate()
                break
            else:
                zone_id = int(random() * max_zones)
                g.V(pt).addE('belongs').to(zcache[zone_id]).iterate()
                g.V(start).addE('goes').to(pt).iterate()

            start = pt

    count = g.V().count().next()
    print(count)

Biz SSD-da 4 yadroli va 16 GB operativ xotiraga ega VM-dan foydalandik. JanusGraph ushbu buyruq yordamida o'rnatildi:

docker run --name janusgraph -p8182:8182 janusgraph/janusgraph:latest

Bunday holda, aniq moslik qidirish uchun ishlatiladigan ma'lumotlar va indekslar BerkeleyDB da saqlanadi. Oldin berilgan so'rovni bajarib, men bir necha o'n soniyalarga teng vaqt oldim.

Yuqoridagi 4 ta skriptni parallel ravishda ishga tushirish orqali men DBMSni Docker jurnallarida Java stacktracelarining quvnoq oqimiga ega qovoqqa aylantirishga muvaffaq bo'ldim (va barchamiz Java steklarini o'qishni yaxshi ko'ramiz).

Biroz o'ylab, men grafik diagrammani quyidagicha soddalashtirishga qaror qildim:

Tegishli yo'llarni topish muammosini hal qilish uchun JanusGraph DBMS grafigining qo'llanilishini sinovdan o'tkazuvchi eksperiment

Ob'ekt atributlari bo'yicha qidirish qirralar bo'yicha qidirishdan tezroq bo'lishini hal qilish. Natijada mening so'rovim quyidagilarga aylandi:

g.V().hasLabel('ZoneStep').has('id',0).repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',19)).count().next()

Rus tilida nima bo'ladi: ID=0 bilan ZoneStep-ni toping, ID=19 bilan ZoneStep-ni topmaguningizcha orqaga qaytmasdan oyoq osti qiling, bunday zanjirlar sonini hisoblang.

Atributlar bilan cheklanib, keraksiz ulanishlarni yaratmaslik uchun yuqorida keltirilgan yuklash skriptini ham soddalashtirdim.

So'rovni bajarish uchun hali ham bir necha soniya kerak bo'ldi, bu bizning vazifamiz uchun mutlaqo qabul qilinishi mumkin emas edi, chunki u har qanday turdagi AdHoc so'rovlari maqsadlari uchun umuman mos emas edi.

Men JanusGraph-ni Cassandra-ning eng tezkor ilovasi sifatida Scylla-dan foydalanib o'rnatishga harakat qildim, lekin bu ham unumdorlikdagi sezilarli o'zgarishlarga olib kelmadi.

Shunday qilib, "bu grafikga o'xshaydi" bo'lishiga qaramay, men DBMS grafigini tezda qayta ishlashga erisha olmadim. Men hech narsani bilmayman deb o'ylayman va JanusGraph bu qidiruvni bir soniya ichida amalga oshirishi mumkin, ammo men buni qila olmadim.

Muammo hali ham hal qilinishi kerak bo'lganligi sababli, men JOIN va jadvallarning Pivotlari haqida o'ylay boshladim, bu nafislik nuqtai nazaridan optimizmni ilhomlantirmadi, lekin amalda to'liq ishlaydigan variant bo'lishi mumkin.

Bizning loyihamiz allaqachon Apache ClickHouse-dan foydalanadi, shuning uchun men tadqiqotimni ushbu analitik DBMSda sinab ko'rishga qaror qildim.

Oddiy retsept yordamida ClickHouse o'rnatilgan:

sudo docker run -d --name clickhouse_1 
     --ulimit nofile=262144:262144 
     -v /opt/clickhouse/log:/var/log/clickhouse-server 
     -v /opt/clickhouse/data:/var/lib/clickhouse 
     yandex/clickhouse-server

Men ma'lumotlar bazasi va unda shunday jadval yaratdim:

CREATE TABLE 
db.steps (`area` Int64, `when` DateTime64(1, 'Europe/Moscow') DEFAULT now64(), `zone` Int64, `person` Int64) 
ENGINE = MergeTree() ORDER BY (area, zone, person) SETTINGS index_granularity = 8192

Men uni quyidagi skript yordamida ma'lumotlar bilan to'ldirdim:

from time import time

from clickhouse_driver import Client
from random import random

client = Client('vm-12c2c34c-df68-4a98-b1e5-a4d1cef1acff.domain',
                database='db',
                password='secret')

max = 20

for r in range(0, 100000):

    if r % 1000 == 0:
        print("CNT: {}, TS: {}".format(r, time()))

    data = [{
            'area': 0,
            'zone': 0,
            'person': r
        }]

    while True:
        if random() < 0.3:
            break

        data.append({
                'area': 0,
                'zone': int(random() * (max - 2)) + 1,
                'person': r
            })

    data.append({
            'area': 0,
            'zone': max - 1,
            'person': r
        })

    client.execute(
        'INSERT INTO steps (area, zone, person) VALUES',
        data
    )

Qo'shimchalar to'plamlarda kelganligi sababli, to'ldirish JanusGraphga qaraganda ancha tez edi.

JOIN yordamida ikkita so'rov tuzildi. A nuqtadan B nuqtaga o'tish uchun:

SELECT s1.person AS person,
       s1.zone,
       s1.when,
       s2.zone,
       s2.when
FROM
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 0)) AS s1 ANY INNER JOIN
  (SELECT *
   FROM steps AS s2
   WHERE (area = 0)
     AND (zone = 19)) AS s2 USING person
WHERE s1.when <= s2.when

3 nuqtadan o'tish uchun:

SELECT s3.person,
       s1z,
       s1w,
       s2z,
       s2w,
       s3.zone,
       s3.when
FROM
  (SELECT s1.person AS person,
          s1.zone AS s1z,
          s1.when AS s1w,
          s2.zone AS s2z,
          s2.when AS s2w
   FROM
     (SELECT *
      FROM steps
      WHERE (area = 0)
        AND (zone = 0)) AS s1 ANY INNER JOIN
     (SELECT *
      FROM steps AS s2
      WHERE (area = 0)
        AND (zone = 3)) AS s2 USING person
   WHERE s1.when <= s2.when) p ANY INNER JOIN
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 19)) AS s3 USING person
WHERE p.s2w <= s3.when

So'rovlar, albatta, juda qo'rqinchli ko'rinadi; haqiqiy foydalanish uchun siz dasturiy ta'minot generatorini yaratishingiz kerak. Biroq, ular ishlaydi va ular tez ishlaydi. Birinchi va ikkinchi so'rovlar 0.1 soniyadan kamroq vaqt ichida bajariladi. 3 nuqtadan o'tgan count(*) uchun so'rovni bajarish vaqtiga misol:

SELECT count(*)
FROM 
(
    SELECT 
        s1.person AS person, 
        s1.zone AS s1z, 
        s1.when AS s1w, 
        s2.zone AS s2z, 
        s2.when AS s2w
    FROM 
    (
        SELECT *
        FROM steps
        WHERE (area = 0) AND (zone = 0)
    ) AS s1
    ANY INNER JOIN 
    (
        SELECT *
        FROM steps AS s2
        WHERE (area = 0) AND (zone = 3)
    ) AS s2 USING (person)
    WHERE s1.when <= s2.when
) AS p
ANY INNER JOIN 
(
    SELECT *
    FROM steps
    WHERE (area = 0) AND (zone = 19)
) AS s3 USING (person)
WHERE p.s2w <= s3.when

β”Œβ”€count()─┐
β”‚   11592 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.068 sec. Processed 250.03 thousand rows, 8.00 MB (3.69 million rows/s., 117.98 MB/s.)

IOPS haqida eslatma. Ma'lumotlarni to'ldirishda JanusGraph juda ko'p miqdordagi IOPS (ma'lumotlarning to'rtta populyatsiyasi uchun 1000-1300) sonini yaratdi va IOWAIT juda yuqori edi. Shu bilan birga, ClickHouse disk quyi tizimida minimal yuk hosil qildi.

xulosa

Ushbu turdagi so'rovlarga xizmat ko'rsatish uchun ClickHouse-dan foydalanishga qaror qildik. Biz har doim so'rovlarni ClickHouse-ga yuklashdan oldin Apache Flink yordamida voqealar oqimini oldindan qayta ishlash orqali moddiylashtirilgan ko'rinishlar va parallelizatsiya yordamida yanada optimallashtirishimiz mumkin.

Ishlash shunchalik yaxshiki, biz jadvallarni dasturiy ravishda aylantirish haqida o'ylamasligimiz mumkin. Ilgari biz Vertica-dan olingan ma'lumotlarni Apache Parket-ga yuklash orqali aylantirishimiz kerak edi.

Afsuski, DBMS grafigidan foydalanishga yana bir urinish muvaffaqiyatsiz tugadi. Men JanusGraph-da mahsulotdan foydalanishni tezlashtirishni osonlashtiradigan do'stona ekotizimga ega ekanligini topmadim. Shu bilan birga, serverni sozlash uchun an'anaviy Java usuli qo'llaniladi, bu Java bilan tanish bo'lmagan odamlarni ko'z yoshlarini yig'lashga majbur qiladi:

host: 0.0.0.0
port: 8182
threadPoolWorker: 1
gremlinPool: 8
scriptEvaluationTimeout: 30000
channelizer: org.janusgraph.channelizers.JanusGraphWsAndHttpChannelizer

graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
  ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties,
  airlines: conf/airlines.properties
}

scriptEngines: {
  gremlin-groovy: {
    plugins: { org.janusgraph.graphdb.tinkerpop.plugin.JanusGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.server.jsr223.GremlinServerGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.tinkergraph.jsr223.TinkerGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {classImports: [java.lang.Math], methodImports: [java.lang.Math#*]},
               org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/airline-sample.groovy]}}}}

serializers:
# GraphBinary is here to replace Gryo and Graphson
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { serializeResultToString: true }}
  # Gryo and Graphson, latest versions
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  # Older serialization versions for backwards compatibility:
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoLiteMessageSerializerV1d0, config: {ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV2d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}

processors:
  - { className: org.apache.tinkerpop.gremlin.server.op.session.SessionOpProcessor, config: { sessionTimeout: 28800000 }}
  - { className: org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor, config: { cacheExpirationTime: 600000, cacheMaxSize: 1000 }}

metrics: {
  consoleReporter: {enabled: false, interval: 180000},
  csvReporter: {enabled: false, interval: 180000, fileName: /tmp/gremlin-server-metrics.csv},
  jmxReporter: {enabled: false},
  slf4jReporter: {enabled: true, interval: 180000},
  gangliaReporter: {enabled: false, interval: 180000, addressingMode: MULTICAST},
  graphiteReporter: {enabled: false, interval: 180000}}
threadPoolBoss: 1
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 64
writeBufferHighWaterMark: 32768
writeBufferHighWaterMark: 65536
ssl: {
  enabled: false}

Men tasodifan JanusGraph ning BerkeleyDB versiyasini "qo'yishga" muvaffaq bo'ldim.

Hujjatlar indekslar bo'yicha ancha egri, chunki indekslarni boshqarish Groovy-da juda g'alati shamanizmni bajarishni talab qiladi. Misol uchun, indeks yaratish Gremlin konsolida kod yozish orqali amalga oshirilishi kerak (bu, aytmoqchi, qutidan tashqarida ishlamaydi). Rasmiy JanusGraph hujjatlaridan:

graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()

//Wait for the index to become available
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameComposite').call()
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()

So'zdan keyin

Qaysidir ma'noda, yuqoridagi tajriba issiq va yumshoq o'rtasidagi taqqoslashdir. Agar siz bu haqda o'ylab ko'rsangiz, DBMS grafigi bir xil natijalarni olish uchun boshqa operatsiyalarni bajaradi. Biroq, testlarning bir qismi sifatida men quyidagi so'rov bilan tajriba o'tkazdim:

g.V().hasLabel('ZoneStep').has('id',0)
    .repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',1)).count().next()

bu yurish masofasini aks ettiradi. Biroq, bunday ma'lumotlar bo'yicha ham, DBMS grafigi bir necha soniyadan oshib ketgan natijalarni ko'rsatdi... Bu, albatta, shunga o'xshash yo'llar mavjudligi bilan bog'liq. 0 -> X -> Y ... -> 1, buni grafik dvigatel ham tekshirdi.

Hatto quyidagi kabi so'rov uchun:

g.V().hasLabel('ZoneStep').has('id',0).out().has('id',1)).count().next()

Bir soniyadan kamroq ishlov berish vaqti bilan samarali javob ololmadim.

Hikoyaning axloqiy tomoni shundaki, chiroyli g'oya va paradigmatik modellashtirish istalgan natijaga olib kelmaydi, bu ClickHouse misolida ancha yuqori samaradorlik bilan namoyon bo'ladi. Ushbu maqolada keltirilgan foydalanish misoli ma'lumotlar bazasining grafik diagrammasi uchun aniq anti-naqshdir, garchi bu ularning paradigmasida modellashtirish uchun mos ko'rinadi.

Manba: www.habr.com

a Izoh qo'shish