Ib qho kev sim ntsuas qhov siv tau ntawm JanusGraph graph DBMS los daws qhov teeb meem ntawm kev nrhiav txoj hauv kev tsim nyog

Ib qho kev sim ntsuas qhov siv tau ntawm JanusGraph graph DBMS los daws qhov teeb meem ntawm kev nrhiav txoj hauv kev tsim nyog

Nyob zoo sawv daws. Peb tab tom tsim cov khoom lag luam rau kev txheeb xyuas kev tsheb khiav offline. Qhov project muaj ib txoj haujlwm ntsig txog kev txheeb xyuas kev txheeb xyuas ntawm cov qhua tuaj xyuas thoob plaws cheeb tsam.

Raws li ib feem ntawm txoj haujlwm no, cov neeg siv tuaj yeem nug cov lus nug ntawm cov lus nug hauv qab no:

  • pes tsawg tus neeg tuaj xyuas dhau ntawm cheeb tsam "A" mus rau thaj tsam "B";
  • pes tsawg tus neeg tuaj xyuas dhau ntawm cheeb tsam "A" mus rau cheeb tsam "B" dhau ntawm cheeb tsam "C" thiab tom qab ntawd dhau ntawm cheeb tsam "D";
  • Nws siv sijhawm ntev npaum li cas rau qee tus neeg tuaj xyuas los ntawm cheeb tsam "A" mus rau thaj tsam "B".

thiab ntau cov lus nug zoo sib xws.

Tus neeg tuaj saib kev txav mus los ntawm thaj chaw yog ib daim duab qhia. Tom qab nyeem hauv Is Taws Nem, kuv pom tias daim duab DBMSs kuj tau siv rau cov ntaub ntawv txheeb xyuas. Kuv muaj lub siab xav pom cov duab DBMSs yuav tiv nrog cov lus nug li cas (TL; DR; tsis zoo).

Kuv xaiv siv DBMS Janus Graph, raws li tus neeg sawv cev zoo tshaj ntawm cov duab qhib-qhov DBMS, uas tso siab rau pawg ntawm cov thev naus laus zis paub tab, uas (hauv kuv lub tswv yim) yuav tsum muab nws nrog cov yam ntxwv ua haujlwm zoo:

  • BerkeleyDB cia backend, Apache Cassandra, Scylla;
  • complex indexes tuaj yeem khaws cia hauv Lucene, Elasticsearch, Solr.

Cov kws sau ntawv ntawm JanusGraph sau tias nws tsim nyog rau OLTP thiab OLAP.

Kuv tau ua haujlwm nrog BerkeleyDB, Apache Cassandra, Scylla thiab ES, thiab cov khoom no feem ntau siv hauv peb lub tshuab, yog li kuv tau zoo siab txog kev sim cov duab no DBMS. Kuv pom tias nws khib xaiv BerkeleyDB dhau RocksDB, tab sis qhov ntawd yog vim qhov yuav tsum tau ua. Nyob rau hauv txhua rooj plaub, rau scalable, khoom siv, nws yog pom zoo kom siv ib tug backend ntawm Cassandra los yog Scylla.

Kuv tsis tau xav txog Neo4j vim tias kev sib koom ua ke yuav tsum muaj kev lag luam version, uas yog, cov khoom tsis yog qhib qhov chaw.

Graph DBMSs hais tias: "Yog tias nws zoo li daim duab, kho nws zoo li daim duab!" - kev zoo nkauj!

Ua ntej, kuv kos ib daim duab, uas tau ua raws nraim raws li cov canons ntawm daim duab DBMSs:

Ib qho kev sim ntsuas qhov siv tau ntawm JanusGraph graph DBMS los daws qhov teeb meem ntawm kev nrhiav txoj hauv kev tsim nyog

Muaj ib qho essence Zone, lub luag haujlwm rau thaj chaw. Yog ZoneStep belongs rau qhov no Zone, ces nws hais rau nws. Ntawm essence Area, ZoneTrack, Person Tsis txhob xyuam xim, lawv koom nrog lub npe thiab tsis suav tias yog ib feem ntawm qhov kev xeem. Nyob rau hauv tag nrho, ib tug chain search query rau xws li ib tug graph qauv yuav zoo li:

g.V().hasLabel('Zone').has('id',0).in_()
       .repeat(__.out()).until(__.out().hasLabel('Zone').has('id',19)).count().next()

Yuav ua li cas nyob rau hauv Lavxias teb sab yog ib yam dab tsi zoo li no: nrhiav ib tug Zone nrog ID = 0, coj tag nrho cov vertices los ntawm uas ib tug ntug mus rau nws (ZoneStep), stomp tsis rov qab mus txog rau thaum koj pom cov ZoneSteps uas muaj ib tug ntug mus rau lub Zone nrog ID=19, suav tus lej xws li chains.

Kuv tsis ua txuj paub tag nrho cov intricacies ntawm kev tshawb nrhiav ntawm cov duab, tab sis cov lus nug no tau tsim los ntawm phau ntawv no (https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html).

Kuv thauj 50 txhiab lem los ntawm 3 mus rau 20 cov ntsiab lus ntev mus rau hauv JanusGraph daim duab database siv BerkeleyDB backend, tsim indexes raws li kev coj noj coj ua.

Python download tsab ntawv:


from random import random
from time import time

from init import g, graph

if __name__ == '__main__':

    points = []
    max_zones = 19
    zcache = dict()
    for i in range(0, max_zones + 1):
        zcache[i] = g.addV('Zone').property('id', i).next()

    startZ = zcache[0]
    endZ = zcache[max_zones]

    for i in range(0, 10000):

        if not i % 100:
            print(i)

        start = g.addV('ZoneStep').property('time', int(time())).next()
        g.V(start).addE('belongs').to(startZ).iterate()

        while True:
            pt = g.addV('ZoneStep').property('time', int(time())).next()
            end_chain = random()
            if end_chain < 0.3:
                g.V(pt).addE('belongs').to(endZ).iterate()
                g.V(start).addE('goes').to(pt).iterate()
                break
            else:
                zone_id = int(random() * max_zones)
                g.V(pt).addE('belongs').to(zcache[zone_id]).iterate()
                g.V(start).addE('goes').to(pt).iterate()

            start = pt

    count = g.V().count().next()
    print(count)

Peb siv VM nrog 4 cores thiab 16 GB RAM ntawm SSD. JanusGraph tau xa mus siv cov lus txib no:

docker run --name janusgraph -p8182:8182 janusgraph/janusgraph:latest

Hauv qhov no, cov ntaub ntawv thiab kev ntsuas uas siv rau kev tshawb nrhiav qhov tseeb yog khaws cia hauv BerkeleyDB. Tom qab ua tiav qhov kev thov ua ntej, kuv tau txais ib lub sijhawm sib npaug li ob peb kaum vib nas this.

Los ntawm kev khiav 4 cov ntawv sau saum toj no ua ke, kuv tau tswj hwm tig DBMS rau hauv taub taub nrog cov kwj deg zoo siab ntawm Java stacktraces (thiab peb txhua tus nyiam nyeem Java stacktraces) hauv Docker cav.

Tom qab qee qhov kev xav, kuv txiav txim siab ua kom yooj yim daim duab kos rau hauv qab no:

Ib qho kev sim ntsuas qhov siv tau ntawm JanusGraph graph DBMS los daws qhov teeb meem ntawm kev nrhiav txoj hauv kev tsim nyog

Kev txiav txim siab tias kev tshawb nrhiav los ntawm cov khoom ntiag tug yuav nrawm dua li kev tshawb nrhiav los ntawm ntug. Yog li ntawd, kuv qhov kev thov tau hloov mus rau hauv cov hauv qab no:

g.V().hasLabel('ZoneStep').has('id',0).repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',19)).count().next()

Dab tsi hauv Lavxias yog qee yam zoo li no: nrhiav ZoneStep nrog ID = 0, stomp yam tsis rov qab mus txog thaum koj pom ZoneStep nrog ID = 19, suav cov lej ntawm cov chains zoo li no.

Kuv kuj tau ua kom yooj yim rau cov ntawv thauj khoom uas tau muab los saum toj no kom tsis txhob tsim kev sib txuas tsis tsim nyog, txwv kuv tus kheej rau cov cwj pwm.

Qhov kev thov tseem siv sijhawm ob peb lub vib nas this kom tiav, uas yog qhov tsis txaus siab rau peb txoj haujlwm, vim nws tsis haum rau lub hom phiaj ntawm AdHoc thov txhua yam.

Kuv sim siv JanusGraph siv Scylla raws li kev siv Cassandra sai tshaj plaws, tab sis qhov no kuj tsis ua rau muaj kev hloov pauv tseem ceeb.

Yog li txawm tias qhov tseeb tias "nws zoo li daim duab", Kuv tsis tuaj yeem tau txais daim duab DBMS los ua nws sai. Kuv xav tias muaj qee yam uas kuv tsis paub thiab tias JanusGraph tuaj yeem ua los ua qhov kev tshawb fawb no hauv ib feem ntawm ib pliag, txawm li cas los xij, kuv tsis muaj peev xwm ua tau.

Txij li thaum qhov teeb meem tseem xav tau los daws, Kuv pib xav txog JOINs thiab Pivots ntawm cov ntxhuav, uas tsis txhawb kev cia siab ntawm kev zoo nkauj, tab sis tuaj yeem yog qhov kev xaiv ua haujlwm tau zoo hauv kev xyaum.

Peb qhov project twb siv Apache ClickHouse, yog li kuv txiav txim siab los sim kuv cov kev tshawb fawb ntawm qhov kev tshuaj ntsuam DBMS no.

Deployed ClickHouse siv ib daim ntawv qhia yooj yim:

sudo docker run -d --name clickhouse_1 
     --ulimit nofile=262144:262144 
     -v /opt/clickhouse/log:/var/log/clickhouse-server 
     -v /opt/clickhouse/data:/var/lib/clickhouse 
     yandex/clickhouse-server

Kuv tsim ib lub database thiab ib lub rooj hauv nws zoo li no:

CREATE TABLE 
db.steps (`area` Int64, `when` DateTime64(1, 'Europe/Moscow') DEFAULT now64(), `zone` Int64, `person` Int64) 
ENGINE = MergeTree() ORDER BY (area, zone, person) SETTINGS index_granularity = 8192

Kuv sau nws nrog cov ntaub ntawv siv cov ntawv hauv qab no:

from time import time

from clickhouse_driver import Client
from random import random

client = Client('vm-12c2c34c-df68-4a98-b1e5-a4d1cef1acff.domain',
                database='db',
                password='secret')

max = 20

for r in range(0, 100000):

    if r % 1000 == 0:
        print("CNT: {}, TS: {}".format(r, time()))

    data = [{
            'area': 0,
            'zone': 0,
            'person': r
        }]

    while True:
        if random() < 0.3:
            break

        data.append({
                'area': 0,
                'zone': int(random() * (max - 2)) + 1,
                'person': r
            })

    data.append({
            'area': 0,
            'zone': max - 1,
            'person': r
        })

    client.execute(
        'INSERT INTO steps (area, zone, person) VALUES',
        data
    )

Txij li thaum inserts tuaj nyob rau hauv batch, sau tau sai npaum li cas rau JanusGraph.

Tsim ob cov lus nug uas siv JOIN. Txhawm rau txav ntawm point A mus rau point B:

SELECT s1.person AS person,
       s1.zone,
       s1.when,
       s2.zone,
       s2.when
FROM
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 0)) AS s1 ANY INNER JOIN
  (SELECT *
   FROM steps AS s2
   WHERE (area = 0)
     AND (zone = 19)) AS s2 USING person
WHERE s1.when <= s2.when

Mus txog 3 ntsiab lus:

SELECT s3.person,
       s1z,
       s1w,
       s2z,
       s2w,
       s3.zone,
       s3.when
FROM
  (SELECT s1.person AS person,
          s1.zone AS s1z,
          s1.when AS s1w,
          s2.zone AS s2z,
          s2.when AS s2w
   FROM
     (SELECT *
      FROM steps
      WHERE (area = 0)
        AND (zone = 0)) AS s1 ANY INNER JOIN
     (SELECT *
      FROM steps AS s2
      WHERE (area = 0)
        AND (zone = 3)) AS s2 USING person
   WHERE s1.when <= s2.when) p ANY INNER JOIN
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 19)) AS s3 USING person
WHERE p.s2w <= s3.when

Cov kev thov, ntawm chav kawm, zoo heev txaus ntshai; rau kev siv tiag tiag, koj yuav tsum tsim ib lub tshuab hluav taws xob software harness. Txawm li cas los xij, lawv ua haujlwm thiab lawv ua haujlwm sai. Ob qhov kev thov thawj thiab thib ob tau ua tiav hauv tsawg dua 0.1 vib nas this. Nov yog ib qho piv txwv ntawm cov lus nug ua tiav lub sijhawm suav (*) dhau los ntawm 3 cov ntsiab lus:

SELECT count(*)
FROM 
(
    SELECT 
        s1.person AS person, 
        s1.zone AS s1z, 
        s1.when AS s1w, 
        s2.zone AS s2z, 
        s2.when AS s2w
    FROM 
    (
        SELECT *
        FROM steps
        WHERE (area = 0) AND (zone = 0)
    ) AS s1
    ANY INNER JOIN 
    (
        SELECT *
        FROM steps AS s2
        WHERE (area = 0) AND (zone = 3)
    ) AS s2 USING (person)
    WHERE s1.when <= s2.when
) AS p
ANY INNER JOIN 
(
    SELECT *
    FROM steps
    WHERE (area = 0) AND (zone = 19)
) AS s3 USING (person)
WHERE p.s2w <= s3.when

β”Œβ”€count()─┐
β”‚   11592 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.068 sec. Processed 250.03 thousand rows, 8.00 MB (3.69 million rows/s., 117.98 MB/s.)

Ib daim ntawv qhia txog IOPS. Thaum cov ntaub ntawv populating, JanusGraph tau tsim kom muaj tus lej ntawm IOPS (1000-1300 rau plaub cov ntaub ntawv pej xeem xov) thiab IOWAIT tau siab heev. Nyob rau tib lub sijhawm, ClickHouse tsim cov khoom thauj tsawg kawg ntawm lub disk subsystem.

xaus

Peb txiav txim siab los siv ClickHouse los pab hom kev thov no. Peb ib txwm tuaj yeem ua kom zoo dua cov lus nug uas siv cov ntsiab lus pom thiab sib piv los ntawm kev ua ntej ua ntej qhov kev tshwm sim kwj siv Apache Flink ua ntej thauj lawv mus rau ClickHouse.

Qhov kev ua tau zoo heev uas peb yuav tsis txawm xav txog pivoting rooj programmatically. Yav dhau los, peb yuav tsum ua pivots ntawm cov ntaub ntawv retrieve los ntawm Vertica ntawm upload rau Apache Parquet.

Hmoov tsis zoo, lwm qhov kev sim siv daim duab DBMS ua tsis tiav. Kuv tsis pom JanusGraph kom muaj kev sib raug zoo ecosystem uas ua rau nws yooj yim kom nce mus nrawm nrog cov khoom. Nyob rau tib lub sijhawm, txhawm rau txhim kho lub server, txoj kev siv Java ib txwm siv, uas yuav ua rau cov neeg tsis paub Java quaj kua muag ntshav:

host: 0.0.0.0
port: 8182
threadPoolWorker: 1
gremlinPool: 8
scriptEvaluationTimeout: 30000
channelizer: org.janusgraph.channelizers.JanusGraphWsAndHttpChannelizer

graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
  ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties,
  airlines: conf/airlines.properties
}

scriptEngines: {
  gremlin-groovy: {
    plugins: { org.janusgraph.graphdb.tinkerpop.plugin.JanusGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.server.jsr223.GremlinServerGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.tinkergraph.jsr223.TinkerGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {classImports: [java.lang.Math], methodImports: [java.lang.Math#*]},
               org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/airline-sample.groovy]}}}}

serializers:
# GraphBinary is here to replace Gryo and Graphson
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { serializeResultToString: true }}
  # Gryo and Graphson, latest versions
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  # Older serialization versions for backwards compatibility:
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoLiteMessageSerializerV1d0, config: {ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV2d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}

processors:
  - { className: org.apache.tinkerpop.gremlin.server.op.session.SessionOpProcessor, config: { sessionTimeout: 28800000 }}
  - { className: org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor, config: { cacheExpirationTime: 600000, cacheMaxSize: 1000 }}

metrics: {
  consoleReporter: {enabled: false, interval: 180000},
  csvReporter: {enabled: false, interval: 180000, fileName: /tmp/gremlin-server-metrics.csv},
  jmxReporter: {enabled: false},
  slf4jReporter: {enabled: true, interval: 180000},
  gangliaReporter: {enabled: false, interval: 180000, addressingMode: MULTICAST},
  graphiteReporter: {enabled: false, interval: 180000}}
threadPoolBoss: 1
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 64
writeBufferHighWaterMark: 32768
writeBufferHighWaterMark: 65536
ssl: {
  enabled: false}

Kuv tau tswj kom txhob txwm "muab" lub BerkeleyDB version ntawm JanusGraph.

Cov ntaub ntawv yog qhov tsis ncaj ncees rau cov ntsiab lus ntawm cov indexes, txij li kev tswj hwm indexes xav kom koj ua qee yam txawv txawv shamanism hauv Groovy. Piv txwv li, tsim qhov ntsuas yuav tsum tau ua los ntawm kev sau cov lej hauv Gremlin console (uas, los ntawm txoj kev, tsis ua haujlwm tawm ntawm lub thawv). Los ntawm cov ntaub ntawv JanusGraph:

graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()

//Wait for the index to become available
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameComposite').call()
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()

Tom qab ntawd

Hauv kev nkag siab, qhov kev sim saum toj no yog kev sib piv ntawm sov thiab mos. Yog tias koj xav txog nws, daim duab DBMS ua lwm yam haujlwm kom tau txais cov txiaj ntsig zoo ib yam. Txawm li cas los xij, ua ib feem ntawm qhov kev xeem, kuv kuj tau ua qhov kev sim nrog kev thov xws li:

g.V().hasLabel('ZoneStep').has('id',0)
    .repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',1)).count().next()

uas qhia txog kev taug kev deb. Txawm li cas los xij, txawm tias ntawm cov ntaub ntawv no, daim duab DBMS tau pom cov txiaj ntsig tau dhau ob peb feeb ... Qhov no, tau kawg, yog vim muaj qhov tseeb tias muaj txoj hauv kev zoo li. 0 -> X -> Y ... -> 1, uas lub graph engine kuj kuaj.

Tsuas yog rau cov lus nug xws li:

g.V().hasLabel('ZoneStep').has('id',0).out().has('id',1)).count().next()

Kuv tsis tuaj yeem tau txais cov lus teb tsim nyog nrog lub sijhawm ua haujlwm tsawg dua li ob.

Kev coj ncaj ncees ntawm zaj dab neeg yog tias lub tswv yim zoo nkauj thiab kev ua qauv zoo nkauj tsis coj mus rau qhov xav tau, uas tau pom tias muaj txiaj ntsig zoo dua siv cov piv txwv ntawm ClickHouse. Cov ntaub ntawv siv tau nthuav tawm hauv kab lus no yog qhov tseeb los tiv thaiv tus qauv rau daim duab DBMSs, txawm hais tias nws zoo li tsim nyog rau kev ua qauv hauv lawv cov qauv.

Tau qhov twg los: www.hab.com

Ntxiv ib saib