Ukuhlolwa okuhlola ukusebenza kwegrafu ye-JanusGraph DBMS yokuxazulula inkinga yokuthola izindlela ezifanele.

Ukuhlolwa okuhlola ukusebenza kwegrafu ye-JanusGraph DBMS yokuxazulula inkinga yokuthola izindlela ezifanele.

Sanibonani nonke. Sakha umkhiqizo wokuhlaziya ithrafikhi engaxhunyiwe ku-inthanethi. Iphrojekthi inomsebenzi ohlobene nokuhlaziywa kwezibalo zemizila yezivakashi kuzo zonke izifunda.

Njengengxenye yalo msebenzi, abasebenzisi bangabuza imibuzo yesistimu yohlobo olulandelayo:

  • zingaki izivakashi ezidlule endaweni ethi "A" zaya endaweni ethi "B";
  • zingaki izivakashi ezidlule endaweni ethi "A" ziye endaweni ethi "B" endaweni ethi "C" bese zidlulela endaweni ethi "D";
  • kuthathe isikhathi esingakanani ukuthi uhlobo oluthile lwesivakashi luhambe lusuka endaweni “A” luye endaweni “B”.

kanye nenani lemibuzo yokuhlaziya efanayo.

Ukuhamba kwesivakashi ezindaweni zonke kuyigrafu eqondisiwe. Ngemva kokufunda i-Inthanethi, ngithole ukuthi ama-DBMS egrafu nawo asetshenziselwa imibiko yokuhlaziya. Nganginesifiso sokubona ukuthi ama-DBMS egrafu angabhekana kanjani nemibuzo enjalo (TL; DR; kahle).

Ngikhethe ukusebenzisa i-DBMS JanusGraph, njengommeleli ovelele we-DBMS yomthombo ovulekile wegrafu, ethembele kunqwaba yobuchwepheshe obuvuthiwe, okuthi (ngokubona kwami) kufanele bunikeze izici zokusebenza ezihloniphekile:

  • I-backend yokugcina i-BerkeleyDB, i-Apache Cassandra, i-Scylla;
  • izinkomba eziyinkimbinkimbi zingagcinwa e-Lucene, Elasticsearch, Solr.

Ababhali be-JanusGraph babhala ukuthi ifanele kokubili i-OLTP ne-OLAP.

Ngisebenze ne-BerkeleyDB, i-Apache Cassandra, i-Scylla ne-ES, futhi le mikhiqizo ivame ukusetshenziswa ezinhlelweni zethu, ngakho-ke benginethemba lokuhlola le grafu ye-DBMS. Ngithole kuyinqaba ukukhetha i-BerkeleyDB kune-RocksDB, kodwa lokho mhlawumbe kungenxa yezidingo zokwenziwayo. Kunoma ikuphi, ukuze kuhlaziywe, ukusetshenziswa komkhiqizo, kuphakanyiswa ukuthi usebenzise i-backend ku-Cassandra noma i-Scylla.

Angizange ngicabangele i-Neo4j ngoba ukuhlanganisa kudinga inguqulo yezohwebo, okungukuthi, umkhiqizo awuwona umthombo ovulekile.

Igrafu DBMSs ithi: "Uma ibukeka njengegrafu, iphathe njengegrafu!" - ubuhle!

Okokuqala, ngidwebe igrafu, eyenziwe ncamashí ne-canon ye-graph DBMSs:

Ukuhlolwa okuhlola ukusebenza kwegrafu ye-JanusGraph DBMS yokuxazulula inkinga yokuthola izindlela ezifanele.

Kukhona ingqikithi Zone, obhekele indawo. Uma ZoneStep kungokwalokhu Zone, abese ebhekisela kuyo. Empeleni Area, ZoneTrack, Person Unganaki, bangabesizinda futhi ababhekwa njengengxenye yokuhlolwa. Sekukonke, umbuzo wokusesha owuchungechunge wesakhiwo segrafu esinjalo ungabukeka kanje:

g.V().hasLabel('Zone').has('id',0).in_()
       .repeat(__.out()).until(__.out().hasLabel('Zone').has('id',19)).count().next()

Okufana nalokhu ngesi-Russian: thola i-Zone ene-ID=0, thatha wonke ama-vertices lapho unqenqema luya khona (ZoneStep), gxuma ngaphandle kokubuyela emuva uze uthole lawo ma-ZoneSteps okunomkhawulo oya ku-Zone nge ID=19, bala inombolo yamaketango anjalo.

Angizenzi ngiyazi bonke ubunkimbinkimbi bokusesha emagrafu, kodwa lo mbuzo wakhiwe ngokusekelwe kule ncwadi (https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html).

Ngilayishe amathrekhi ayizinkulungwane ezingama-50 asukela kumaphuzu angama-3 kuye kwangama-20 ubude kusizindalwazi segrafu ye-JanusGraph ngisebenzisa i-backend ye-BerkeleyDB, ngenza izinkomba ngokusho ubuholi.

I-Python download script:


from random import random
from time import time

from init import g, graph

if __name__ == '__main__':

    points = []
    max_zones = 19
    zcache = dict()
    for i in range(0, max_zones + 1):
        zcache[i] = g.addV('Zone').property('id', i).next()

    startZ = zcache[0]
    endZ = zcache[max_zones]

    for i in range(0, 10000):

        if not i % 100:
            print(i)

        start = g.addV('ZoneStep').property('time', int(time())).next()
        g.V(start).addE('belongs').to(startZ).iterate()

        while True:
            pt = g.addV('ZoneStep').property('time', int(time())).next()
            end_chain = random()
            if end_chain < 0.3:
                g.V(pt).addE('belongs').to(endZ).iterate()
                g.V(start).addE('goes').to(pt).iterate()
                break
            else:
                zone_id = int(random() * max_zones)
                g.V(pt).addE('belongs').to(zcache[zone_id]).iterate()
                g.V(start).addE('goes').to(pt).iterate()

            start = pt

    count = g.V().count().next()
    print(count)

Sisebenzise i-VM enamacores angu-4 kanye ne-RAM engu-16 GB ku-SSD. I-JanusGraph yasetshenziswa kusetshenziswa lo myalo:

docker run --name janusgraph -p8182:8182 janusgraph/janusgraph:latest

Kulesi simo, idatha nezinkomba ezisetshenziselwa ukusesha okufanayo zigcinwa ku-BerkeleyDB. Ngemva kokwenza isicelo engangisinikwe ngaphambili, ngathola isikhathi esilingana namashumi amaningana emizuzwana.

Ngokusebenzisa izikripthi ezi-4 ezingenhla ngokuhambisana, ngikwazile ukushintsha i-DBMS ibe ithanga ngomfudlana ojabulisayo we-Java stacktraces (futhi sonke siyakuthanda ukufunda ama-stacktraces e-Java) ezingodweni ze-Docker.

Ngemva kokucabanga okuthile, nginqume ukwenza lula umdwebo wegrafu ube lokhu okulandelayo:

Ukuhlolwa okuhlola ukusebenza kwegrafu ye-JanusGraph DBMS yokuxazulula inkinga yokuthola izindlela ezifanele.

Ukunquma ukuthi ukusesha ngezibaluli zebhizinisi kuzoshesha kunokusesha ngemiphetho. Ngenxa yalokho, isicelo sami saphenduka saba okulandelayo:

g.V().hasLabel('ZoneStep').has('id',0).repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',19)).count().next()

Okushiwo ngesi-Russian kufana nalokhu: thola i-ZoneStep ene-ID=0, cindezela ngaphandle kokubuyela emuva uze uthole i-ZoneStep ene-ID=19, ubale inani lamaketango anjalo.

Ngiphinde ngenza lula iskripthi sokulayisha esinikezwe ngenhla ukuze ngingadali ukuxhumana okungadingekile, ngikhawulele kuzibaluli.

Isicelo sisathatha imizuzwana embalwa ukuthi siqedwe, okwakungamukeleki neze emsebenzini wethu, njengoba sasingazifanele neze izinjongo zezicelo ze-AdHoc zanoma yiluphi uhlobo.

Ngizamile ukusebenzisa i-JanusGraph ngisebenzisa i-Scylla njengokuqaliswa kwe-Cassandra esheshayo, kodwa lokhu futhi akuzange kuholele kunoma yiziphi izinguquko ezibalulekile zokusebenza.

Ngakho-ke naphezu kweqiniso lokuthi "kubukeka njengegrafu", angikwazanga ukuthola i-DBMS yegrafu ukuyicubungula ngokushesha. Ngicabanga ngokugcwele ukuthi kukhona into engingayazi nokuthi i-JanusGraph ingenziwa ukuthi yenze lolu sesho ngengxenye yesekhondi, nokho, angikwazanga ukukwenza.

Njengoba inkinga yayisadinga ukuxazululwa, ngaqala ukucabanga ngama-JOIN nama-Pivots amatafula, angazange akhuthaze ithemba ngobuhle, kodwa kungaba inketho esebenzisekayo ngokuphelele ekusebenzeni.

Iphrojekthi yethu isivele isebenzisa i-Apache ClickHouse, ngakho-ke nginqume ukuhlola ucwaningo lwami kule DBMS yokuhlaziya.

Kusetshenziswe i-ClickHouse kusetshenziswa iresiphi elula:

sudo docker run -d --name clickhouse_1 
     --ulimit nofile=262144:262144 
     -v /opt/clickhouse/log:/var/log/clickhouse-server 
     -v /opt/clickhouse/data:/var/lib/clickhouse 
     yandex/clickhouse-server

Ngakha i-database kanye netafula kuyo kanje:

CREATE TABLE 
db.steps (`area` Int64, `when` DateTime64(1, 'Europe/Moscow') DEFAULT now64(), `zone` Int64, `person` Int64) 
ENGINE = MergeTree() ORDER BY (area, zone, person) SETTINGS index_granularity = 8192

Ngiyigcwalise ngedatha ngisebenzisa umbhalo olandelayo:

from time import time

from clickhouse_driver import Client
from random import random

client = Client('vm-12c2c34c-df68-4a98-b1e5-a4d1cef1acff.domain',
                database='db',
                password='secret')

max = 20

for r in range(0, 100000):

    if r % 1000 == 0:
        print("CNT: {}, TS: {}".format(r, time()))

    data = [{
            'area': 0,
            'zone': 0,
            'person': r
        }]

    while True:
        if random() < 0.3:
            break

        data.append({
                'area': 0,
                'zone': int(random() * (max - 2)) + 1,
                'person': r
            })

    data.append({
            'area': 0,
            'zone': max - 1,
            'person': r
        })

    client.execute(
        'INSERT INTO steps (area, zone, person) VALUES',
        data
    )

Njengoba okufakiwe kuza ngamaqoqo, ukugcwalisa bekushesha kakhulu kune-JanusGraph.

Yakha imibuzo emibili kusetshenziswa JOIN. Ukusuka ephuzwini A uye endaweni engu-B:

SELECT s1.person AS person,
       s1.zone,
       s1.when,
       s2.zone,
       s2.when
FROM
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 0)) AS s1 ANY INNER JOIN
  (SELECT *
   FROM steps AS s2
   WHERE (area = 0)
     AND (zone = 19)) AS s2 USING person
WHERE s1.when <= s2.when

Ukudlula amaphuzu ama-3:

SELECT s3.person,
       s1z,
       s1w,
       s2z,
       s2w,
       s3.zone,
       s3.when
FROM
  (SELECT s1.person AS person,
          s1.zone AS s1z,
          s1.when AS s1w,
          s2.zone AS s2z,
          s2.when AS s2w
   FROM
     (SELECT *
      FROM steps
      WHERE (area = 0)
        AND (zone = 0)) AS s1 ANY INNER JOIN
     (SELECT *
      FROM steps AS s2
      WHERE (area = 0)
        AND (zone = 3)) AS s2 USING person
   WHERE s1.when <= s2.when) p ANY INNER JOIN
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 19)) AS s3 USING person
WHERE p.s2w <= s3.when

Izicelo, vele, zibukeka zesabeka impela; ukuze uyisebenzise ngempela, udinga ukwakha ihhanisi le-software generator. Nokho, ziyasebenza futhi zisebenza ngokushesha. Kokubili izicelo zokuqala nezesibili ziqedwa ngaphansi kwamasekhondi angu-0.1. Nasi isibonelo sesikhathi sokwenza kombuzo sokubala(*) esidlula amaphoyinti angu-3:

SELECT count(*)
FROM 
(
    SELECT 
        s1.person AS person, 
        s1.zone AS s1z, 
        s1.when AS s1w, 
        s2.zone AS s2z, 
        s2.when AS s2w
    FROM 
    (
        SELECT *
        FROM steps
        WHERE (area = 0) AND (zone = 0)
    ) AS s1
    ANY INNER JOIN 
    (
        SELECT *
        FROM steps AS s2
        WHERE (area = 0) AND (zone = 3)
    ) AS s2 USING (person)
    WHERE s1.when <= s2.when
) AS p
ANY INNER JOIN 
(
    SELECT *
    FROM steps
    WHERE (area = 0) AND (zone = 19)
) AS s3 USING (person)
WHERE p.s2w <= s3.when

┌─count()─┐
│   11592 │
└─────────┘

1 rows in set. Elapsed: 0.068 sec. Processed 250.03 thousand rows, 8.00 MB (3.69 million rows/s., 117.98 MB/s.)

Inothi mayelana ne-IOPS. Lapho igcwalisa idatha, i-JanusGraph ikhiqize inani eliphezulu kakhulu le-IOPS (1000-1300 emicu emine yabantu bedatha) futhi i-IOWAIT yayiphezulu kakhulu. Ngaso leso sikhathi, i-ClickHouse ikhiqize umthwalo omncane kusistimu engaphansi yediski.

isiphetho

Sinqume ukusebenzisa i-ClickHouse ukuze sisevise lolu hlobo lwesicelo. Singahlala sithuthukisa imibuzo sisebenzisa ukubukwa okwenyama kanye nokufanisa ngokucubungula kusengaphambili ukusakaza komcimbi sisebenzisa i-Apache Flink ngaphambi kokuyilayisha ku-ClickHouse.

Ukusebenza kuhle kangangokuthi cishe ngeke size sicabange ngokuzungeza amatafula ngokohlelo. Ngaphambilini, bekufanele senze ama-pivots edatha etholwe ku-Vertica ngokulayisha ku-Apache Parquet.

Ngeshwa, omunye umzamo wokusebenzisa igrafu ye-DBMS awuphumelelanga. Angizange ngithole i-JanusGraph ukuthi ibe ne-ecosystem enobungane eyenza kwaba lula ukusheshisa umkhiqizo. Ngasikhathi sinye, ukulungisa iseva, kusetshenziswa indlela yeJava yendabuko, ezokwenza abantu abangajwayelene neJava bakhale izinyembezi zegazi:

host: 0.0.0.0
port: 8182
threadPoolWorker: 1
gremlinPool: 8
scriptEvaluationTimeout: 30000
channelizer: org.janusgraph.channelizers.JanusGraphWsAndHttpChannelizer

graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
  ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties,
  airlines: conf/airlines.properties
}

scriptEngines: {
  gremlin-groovy: {
    plugins: { org.janusgraph.graphdb.tinkerpop.plugin.JanusGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.server.jsr223.GremlinServerGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.tinkergraph.jsr223.TinkerGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {classImports: [java.lang.Math], methodImports: [java.lang.Math#*]},
               org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/airline-sample.groovy]}}}}

serializers:
# GraphBinary is here to replace Gryo and Graphson
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { serializeResultToString: true }}
  # Gryo and Graphson, latest versions
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  # Older serialization versions for backwards compatibility:
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoLiteMessageSerializerV1d0, config: {ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV2d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}

processors:
  - { className: org.apache.tinkerpop.gremlin.server.op.session.SessionOpProcessor, config: { sessionTimeout: 28800000 }}
  - { className: org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor, config: { cacheExpirationTime: 600000, cacheMaxSize: 1000 }}

metrics: {
  consoleReporter: {enabled: false, interval: 180000},
  csvReporter: {enabled: false, interval: 180000, fileName: /tmp/gremlin-server-metrics.csv},
  jmxReporter: {enabled: false},
  slf4jReporter: {enabled: true, interval: 180000},
  gangliaReporter: {enabled: false, interval: 180000, addressingMode: MULTICAST},
  graphiteReporter: {enabled: false, interval: 180000}}
threadPoolBoss: 1
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 64
writeBufferHighWaterMark: 32768
writeBufferHighWaterMark: 65536
ssl: {
  enabled: false}

Ngikwazile "ukubeka" ngephutha inguqulo ye-BerkeleyDB ye-JanusGraph.

Amadokhumenti agwegwile ngokwezinkomba, njengoba ukuphatha izinkomba kudinga ukuthi wenze ubushamanism obuyinqaba eGroovy. Isibonelo, ukwenza inkomba kufanele kwenziwe ngokubhala ikhodi kukhonsoli ye-Gremlin (okuthi, ngendlela, engasebenzi ngaphandle kwebhokisi). Kusuka kumadokhumenti asemthethweni we-JanusGraph:

graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()

//Wait for the index to become available
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameComposite').call()
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()

I-Afterword

Ngomqondo othile, ukuhlola okungenhla kuwukuqhathanisa phakathi kokufudumele nokuthambile. Uma ucabanga ngakho, igrafu i-DBMS yenza eminye imisebenzi ukuze ithole imiphumela efanayo. Kodwa-ke, njengengxenye yokuhlolwa, ngiphinde ngenza isilingo ngesicelo esifana nalokhu:

g.V().hasLabel('ZoneStep').has('id',0)
    .repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',1)).count().next()

okukhombisa ibanga lokuhamba. Kodwa-ke, ngisho nakudatha enjalo, i-DBMS yegrafu ibonise imiphumela eye yadlula imizuzwana embalwa ... Lokhu, yiqiniso, kungenxa yokuthi kwakukhona izindlela ezifana 0 -> X -> Y ... -> 1, okuhloliwe injini yegrafu.

Ngisho nombuzo ofana nokuthi:

g.V().hasLabel('ZoneStep').has('id',0).out().has('id',1)).count().next()

Angikwazanga ukuthola impendulo ekhiqizayo ngesikhathi sokucubungula esingaphansi kwesekhondi.

Ukuziphatha kwendaba ukuthi umbono omuhle kanye ne-paradigmatic modeling akuholeli kumphumela oyifunayo, oboniswa ngokusebenza kahle okuphezulu kakhulu usebenzisa isibonelo se-ClickHouse. Icala lokusebenzisa elivezwe kulesi sihloko liyiphethini eliphikisayo elicacile lama-DBMS egrafu, nakuba libonakala lilungele ukumodela ku-paradigm yawo.

Source: www.habr.com

Engeza amazwana