Eksperimen nguji aplikasi DBMS grafik JanusGraph kanggo ngrampungake masalah nemokake jalur sing cocog

Eksperimen nguji aplikasi DBMS grafik JanusGraph kanggo ngrampungake masalah nemokake jalur sing cocog

Halo kabeh. Kita ngembangake produk kanggo analisis lalu lintas offline. Proyek kasebut nduweni tugas sing ana gandhengane karo analisis statistik rute pengunjung ing wilayah.

Minangka bagΓ©an saka tugas iki, pangguna bisa takon pitakon sistem saka jinis ing ngisor iki:

  • jumlah pengunjung sing liwati saka wilayah "A" menyang wilayah "B";
  • pinten pengunjung liwat area "A" menyang area "B" liwat area "C" lan banjur liwat area "D";
  • suwene sawetara jinis pengunjung kanggo lelungan saka wilayah "A" menyang wilayah "B".

lan sawetara pitakon analitis sing padha.

Gerakan pengunjung ing wilayah kasebut minangka grafik terarah. Sawise maca Internet, aku nemokake manawa DBMS grafik uga digunakake kanggo laporan analitis. Aku kepengin weruh kepiye grafik DBMS bakal ngatasi pitakon kasebut (TL; DR; ala).

Aku milih nggunakake DBMS JanusGraph, minangka wakil pinunjul saka grafik open-source DBMS, kang gumantung ing tumpukan saka teknologi diwasa, kang (miturut mratelakake panemume) kudu nyedhiyani karo ciri operasional prayoga:

  • Backend panyimpenan BerkeleyDB, Apache Cassandra, Scylla;
  • indeks kompleks bisa disimpen ing Lucene, Elasticsearch, Solr.

Penulis JanusGraph nulis manawa cocog kanggo OLTP lan OLAP.

Aku wis kerjo karo BerkeleyDB, Apache Cassandra, Scylla lan ES, lan produk iki asring digunakake ing sistem kita, supaya aku optimistis nyoba DBMS grafik iki. Aku nemokake aneh kanggo milih BerkeleyDB tinimbang RocksDB, nanging bisa uga amarga syarat transaksi. Ing kasus apa wae, kanggo nggunakake produk sing bisa diukur, disaranake nggunakake backend ing Cassandra utawa Scylla.

Aku ora nganggep Neo4j amarga clustering mbutuhake versi komersial, yaiku, produk kasebut ora mbukak sumber.

Graph DBMSs ujar: "Yen katon kaya grafik, anggep kaya grafik!" - ayu!

Kaping pisanan, aku nggambar grafik, sing digawe persis miturut kanon DBMS grafik:

Eksperimen nguji aplikasi DBMS grafik JanusGraph kanggo ngrampungake masalah nemokake jalur sing cocog

Ana inti Zone, tanggung jawab kanggo wilayah. Yen ZoneStep belongs kanggo iki Zone, banjur dheweke nuduhake. Ing inti Area, ZoneTrack, Person Aja menehi perhatian, dheweke kalebu domain lan ora dianggep minangka bagian saka tes. Secara total, pitakon panelusuran rantai kanggo struktur grafik kasebut bakal katon kaya:

g.V().hasLabel('Zone').has('id',0).in_()
       .repeat(__.out()).until(__.out().hasLabel('Zone').has('id',19)).count().next()

Apa ing basa Rusia kaya iki: temokake Zona kanthi ID = 0, njupuk kabeh simpul saka ngendi pinggiran menyang (ZoneStep), stomp tanpa bali nganti sampeyan nemokake ZoneSteps sing ana pinggiran menyang Zone karo ID=19, cacahe rentengan kuwi.

Aku ora pura-pura ngerti kabeh seluk-beluk nggoleki ing grafik, nanging pitakon iki digawe adhedhasar buku iki (https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html).

Aku ngemot 50 ewu trek saka 3 nganti 20 poin ing basis data grafik JanusGraph nggunakake backend BerkeleyDB, nggawe indeks miturut kepemimpinan.

skrip download Python:


from random import random
from time import time

from init import g, graph

if __name__ == '__main__':

    points = []
    max_zones = 19
    zcache = dict()
    for i in range(0, max_zones + 1):
        zcache[i] = g.addV('Zone').property('id', i).next()

    startZ = zcache[0]
    endZ = zcache[max_zones]

    for i in range(0, 10000):

        if not i % 100:
            print(i)

        start = g.addV('ZoneStep').property('time', int(time())).next()
        g.V(start).addE('belongs').to(startZ).iterate()

        while True:
            pt = g.addV('ZoneStep').property('time', int(time())).next()
            end_chain = random()
            if end_chain < 0.3:
                g.V(pt).addE('belongs').to(endZ).iterate()
                g.V(start).addE('goes').to(pt).iterate()
                break
            else:
                zone_id = int(random() * max_zones)
                g.V(pt).addE('belongs').to(zcache[zone_id]).iterate()
                g.V(start).addE('goes').to(pt).iterate()

            start = pt

    count = g.V().count().next()
    print(count)

We digunakake VM karo 4 intine lan 16 GB RAM ing SSD. JanusGraph disebarake nggunakake printah iki:

docker run --name janusgraph -p8182:8182 janusgraph/janusgraph:latest

Ing kasus iki, data lan indeks sing digunakake kanggo telusuran sing cocog disimpen ing BerkeleyDB. Sawise nindakake panjaluk sing diwenehake sadurunge, aku nampa wektu sing padha karo sawetara puluhan detik.

Kanthi mbukak 4 script ndhuwur ing podo karo, Aku ngatur kanggo nguripake DBMS menyang waluh karo stream ceria saka stacktraces Jawa (lan kita kabeh seneng maca stacktraces Jawa) ing log Docker.

Sawise sawetara pamikiran, aku mutusake kanggo nyederhanakake diagram grafik ing ngisor iki:

Eksperimen nguji aplikasi DBMS grafik JanusGraph kanggo ngrampungake masalah nemokake jalur sing cocog

Nemtokake yen nggoleki kanthi atribut entitas bakal luwih cepet tinimbang nggoleki kanthi pinggir. AkibatΓ©, panjalukku dadi kaya ing ngisor iki:

g.V().hasLabel('ZoneStep').has('id',0).repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',19)).count().next()

Apa ing basa Rusia kaya iki: temokake ZoneStep kanthi ID = 0, stomp tanpa bali nganti sampeyan nemokake ZoneStep kanthi ID = 19, ngitung jumlah rantai kasebut.

Aku uga nyederhanakake skrip loading sing diwenehake ing ndhuwur supaya ora nggawe sambungan sing ora perlu, mbatesi aku kanggo atribut.

Panjaluk kasebut isih butuh sawetara detik kanggo ngrampungake, sing ora bisa ditampa kanggo tugas kita, amarga ora cocog kanggo panjaluk AdHoc apa wae.

Aku nyoba masang JanusGraph nggunakake Scylla minangka implementasine Cassandra paling cepet, nanging iki uga ora nyebabake owah-owahan kinerja sing signifikan.

Dadi sanajan kasunyatane "katon kaya grafik", aku ora bisa njaluk grafik DBMS supaya bisa diproses kanthi cepet. Aku nganggep manawa ana sing ora dingerteni lan JanusGraph bisa ditindakake kanggo nindakake telusuran iki sajrone sekedhik, nanging aku ora bisa nindakake.

Wiwit masalah isih perlu kanggo ditanggulangi, Aku wiwit mikir babagan JOINs lan Pivots tabel, kang ora inspirasi optimisme ing syarat-syarat keanggunan, nanging bisa dadi pilihan rampung bisa digunakake ing laku.

Proyek kita wis nggunakake Apache ClickHouse, mula aku mutusake kanggo nyoba riset babagan DBMS analitis iki.

Dipasang ClickHouse nggunakake resep sing prasaja:

sudo docker run -d --name clickhouse_1 
     --ulimit nofile=262144:262144 
     -v /opt/clickhouse/log:/var/log/clickhouse-server 
     -v /opt/clickhouse/data:/var/lib/clickhouse 
     yandex/clickhouse-server

Aku nggawe database lan tabel kaya iki:

CREATE TABLE 
db.steps (`area` Int64, `when` DateTime64(1, 'Europe/Moscow') DEFAULT now64(), `zone` Int64, `person` Int64) 
ENGINE = MergeTree() ORDER BY (area, zone, person) SETTINGS index_granularity = 8192

Aku ngisi data nggunakake skrip ing ngisor iki:

from time import time

from clickhouse_driver import Client
from random import random

client = Client('vm-12c2c34c-df68-4a98-b1e5-a4d1cef1acff.domain',
                database='db',
                password='secret')

max = 20

for r in range(0, 100000):

    if r % 1000 == 0:
        print("CNT: {}, TS: {}".format(r, time()))

    data = [{
            'area': 0,
            'zone': 0,
            'person': r
        }]

    while True:
        if random() < 0.3:
            break

        data.append({
                'area': 0,
                'zone': int(random() * (max - 2)) + 1,
                'person': r
            })

    data.append({
            'area': 0,
            'zone': max - 1,
            'person': r
        })

    client.execute(
        'INSERT INTO steps (area, zone, person) VALUES',
        data
    )

Wiwit sisipan teka ing batch, ngisi luwih cepet tinimbang JanusGraph.

Dibangun loro pitakon nggunakake JOIN. Kanggo pindhah saka titik A menyang titik B:

SELECT s1.person AS person,
       s1.zone,
       s1.when,
       s2.zone,
       s2.when
FROM
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 0)) AS s1 ANY INNER JOIN
  (SELECT *
   FROM steps AS s2
   WHERE (area = 0)
     AND (zone = 19)) AS s2 USING person
WHERE s1.when <= s2.when

Kanggo ngliwati 3 poin:

SELECT s3.person,
       s1z,
       s1w,
       s2z,
       s2w,
       s3.zone,
       s3.when
FROM
  (SELECT s1.person AS person,
          s1.zone AS s1z,
          s1.when AS s1w,
          s2.zone AS s2z,
          s2.when AS s2w
   FROM
     (SELECT *
      FROM steps
      WHERE (area = 0)
        AND (zone = 0)) AS s1 ANY INNER JOIN
     (SELECT *
      FROM steps AS s2
      WHERE (area = 0)
        AND (zone = 3)) AS s2 USING person
   WHERE s1.when <= s2.when) p ANY INNER JOIN
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 19)) AS s3 USING person
WHERE p.s2w <= s3.when

Panjaluk kasebut, mesthi katon medeni; kanggo panggunaan nyata, sampeyan kudu nggawe sabuk generator piranti lunak. Nanging, dheweke kerja lan kerjane kanthi cepet. Panjalukan pisanan lan kaloro rampung kurang saka 0.1 detik. Iki minangka conto wektu eksekusi pitakon kanggo count (*) sing ngliwati 3 poin:

SELECT count(*)
FROM 
(
    SELECT 
        s1.person AS person, 
        s1.zone AS s1z, 
        s1.when AS s1w, 
        s2.zone AS s2z, 
        s2.when AS s2w
    FROM 
    (
        SELECT *
        FROM steps
        WHERE (area = 0) AND (zone = 0)
    ) AS s1
    ANY INNER JOIN 
    (
        SELECT *
        FROM steps AS s2
        WHERE (area = 0) AND (zone = 3)
    ) AS s2 USING (person)
    WHERE s1.when <= s2.when
) AS p
ANY INNER JOIN 
(
    SELECT *
    FROM steps
    WHERE (area = 0) AND (zone = 19)
) AS s3 USING (person)
WHERE p.s2w <= s3.when

β”Œβ”€count()─┐
β”‚   11592 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.068 sec. Processed 250.03 thousand rows, 8.00 MB (3.69 million rows/s., 117.98 MB/s.)

Cathetan babagan IOPS. Nalika ngisi data, JanusGraph ngasilake jumlah IOPS sing cukup dhuwur (1000-1300 kanggo papat utas populasi data) lan IOWAIT cukup dhuwur. Ing wektu sing padha, ClickHouse ngasilake beban minimal ing subsistem disk.

kesimpulan

Kita mutusake nggunakake ClickHouse kanggo nglayani jinis panyuwunan iki. Kita mesthi bisa ngoptimalake pitakon kanthi nggunakake tampilan lan paralelisasi kanthi pra-proses acara stream nggunakake Apache Flink sadurunge dimuat menyang ClickHouse.

Kinerja apik banget sing mbokmenawa ora kudu mikir babagan pivoting tabel kanthi program. Sadurunge, kita kudu nindakake pivots data sing dijupuk saka Vertica liwat upload menyang Apache Parquet.

Sayange, upaya liya kanggo nggunakake DBMS grafik ora kasil. Aku ora nemokake JanusGraph duwe ekosistem loropaken sing wis gampang kanggo njaluk munggah kanggo kacepetan karo produk. Ing wektu sing padha, kanggo ngatur server digunakake cara Jawa tradisional, sing bakal nggawe wong sing ora kenal karo Jawa nangis nangis getih:

host: 0.0.0.0
port: 8182
threadPoolWorker: 1
gremlinPool: 8
scriptEvaluationTimeout: 30000
channelizer: org.janusgraph.channelizers.JanusGraphWsAndHttpChannelizer

graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
  ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties,
  airlines: conf/airlines.properties
}

scriptEngines: {
  gremlin-groovy: {
    plugins: { org.janusgraph.graphdb.tinkerpop.plugin.JanusGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.server.jsr223.GremlinServerGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.tinkergraph.jsr223.TinkerGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {classImports: [java.lang.Math], methodImports: [java.lang.Math#*]},
               org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/airline-sample.groovy]}}}}

serializers:
# GraphBinary is here to replace Gryo and Graphson
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { serializeResultToString: true }}
  # Gryo and Graphson, latest versions
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  # Older serialization versions for backwards compatibility:
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoLiteMessageSerializerV1d0, config: {ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV2d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}

processors:
  - { className: org.apache.tinkerpop.gremlin.server.op.session.SessionOpProcessor, config: { sessionTimeout: 28800000 }}
  - { className: org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor, config: { cacheExpirationTime: 600000, cacheMaxSize: 1000 }}

metrics: {
  consoleReporter: {enabled: false, interval: 180000},
  csvReporter: {enabled: false, interval: 180000, fileName: /tmp/gremlin-server-metrics.csv},
  jmxReporter: {enabled: false},
  slf4jReporter: {enabled: true, interval: 180000},
  gangliaReporter: {enabled: false, interval: 180000, addressingMode: MULTICAST},
  graphiteReporter: {enabled: false, interval: 180000}}
threadPoolBoss: 1
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 64
writeBufferHighWaterMark: 32768
writeBufferHighWaterMark: 65536
ssl: {
  enabled: false}

Aku ora sengaja "nglebokake" versi BerkeleyDB saka JanusGraph.

Dokumentasi cukup bengkok babagan indeks, amarga ngatur indeks mbutuhake sampeyan nindakake shamanisme sing rada aneh ing Groovy. Contone, nggawe indeks kudu rampung kanthi nulis kode ing konsol Gremlin (sing, kanthi cara, ora bisa metu saka kothak). Saka dokumentasi resmi JanusGraph:

graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()

//Wait for the index to become available
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameComposite').call()
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()

Afterword

Ing pangertèn, eksperimen ing ndhuwur minangka perbandingan antara anget lan alus. Yen sampeyan mikir babagan iki, grafik DBMS nindakake operasi liyane kanggo entuk asil sing padha. Nanging, minangka bagean saka tes, aku uga nindakake eksperimen kanthi panjaluk kaya:

g.V().hasLabel('ZoneStep').has('id',0)
    .repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',1)).count().next()

sing nggambarake jarak mlaku. Nanging, sanajan ing data kasebut, grafik DBMS nuduhake asil sing ngluwihi sawetara detik ... Iki, mesthi, amarga ana jalur kaya 0 -> X -> Y ... -> 1, sing mesin grafik uga dicenthang.

Malah kanggo pitakon kaya:

g.V().hasLabel('ZoneStep').has('id',0).out().has('id',1)).count().next()

Aku ora bisa entuk respon sing produktif kanthi wektu pangolahan kurang saka detik.

Moral crita kasebut yaiku ide sing apik lan model paradigmatik ora nyebabake asil sing dikarepake, sing dituduhake kanthi efisiensi sing luwih dhuwur nggunakake conto ClickHouse. Kasus panggunaan sing ditampilake ing artikel iki minangka anti-pola sing jelas kanggo DBMS grafik, sanajan katon cocok kanggo modeling ing paradigma.

Source: www.habr.com

Add a comment