Ékspérimén pikeun nguji panerapan DBMS grafik JanusGraph pikeun ngarengsekeun masalah milarian jalur anu cocog.

Ékspérimén pikeun nguji panerapan DBMS grafik JanusGraph pikeun ngarengsekeun masalah milarian jalur anu cocog.

Halo sadayana. Kami nuju ngembangkeun produk pikeun analisis lalu lintas offline. Proyék ieu ngagaduhan tugas anu aya hubunganana sareng analisa statistik tina jalur gerakan sémah dumasar daérah.

Salaku bagian tina tugas ieu, pangguna tiasa naroskeun patarosan sistem tina bentuk ieu:

  • sabaraha pangunjung ngaliwatan wewengkon "A" ka wewengkon "B";
  • sabaraha pangunjung ngaliwatan wewengkon "A" ka wewengkon "B" ngaliwatan wewengkon "C" lajeng ngaliwatan wewengkon "D";
  • sabaraha lila waktu nu diperlukeun hiji tipe husus nganjang ti wewengkon "A" ka wewengkon "B".

sarta sajumlah queries analitik sarupa.

Gerak pangunjung ngalangkungan daérah mangrupikeun grafik anu diarahkeun. Saatos maca Internét, kuring mendakan yén grafik DBMS ogé dianggo pikeun laporan analitis. Kuring ngagaduhan kahayang pikeun ningali kumaha grafik DBMS bakal ngatasi patarosan sapertos kitu (TL; DR; goréng).

Kuring milih ngagunakeun DBMS JanusGraph, salaku wawakil pinunjul tina grafik open-source DBMS, anu ngandelkeun tumpukan téknologi dewasa anu (dina pamanggih kuring) kedah nyayogikeun ciri operasional anu santun:

  • gudang backend BerkeleyDB, Apache Cassandra, Scylla;
  • indéks kompléks bisa disimpen di Lucene, Elasticsearch, Solr.

Panulis JanusGraph nyerat yén éta cocog pikeun OLTP sareng OLAP.

Kuring parantos damel sareng BerkeleyDB, Apache Cassandra, Scylla, sareng ES, sareng produk ieu sering dianggo dina sistem kami, janten kuring optimis pikeun nguji DBMS grafik ieu. Kuring kapanggih aneh mun milih BerkeleyDB leuwih RocksDB, tapi meureun aya hubunganana jeung sarat urus. Dina sagala hal, pikeun scalable, pamakéan produk, Disarankeun make backend Cassandra atanapi Scylla.

Kuring teu nganggap Neo4j, sabab clustering merlukeun versi komérsial, nyaeta, produk nu teu muka.

Graph DBMS nyarios: "Upami aya anu katingali sapertos grafik, perlakukan sapertos grafik!" - kageulisan!

Mimiti, kuring ngagambar grafik, anu dilakukeun dumasar kana kanon tina grafik DBMS:

Ékspérimén pikeun nguji panerapan DBMS grafik JanusGraph pikeun ngarengsekeun masalah milarian jalur anu cocog.

Aya hakekatna Zonejawab wewengkon. Lamun ZoneStep milik ieu Zone, lajeng anjeunna nujul kana eta. Dina panggih Area, ZoneTrack, Person teu nengetan, aranjeunna milik domain jeung teu dianggap dina ujian. Dina total, pikeun struktur grafik misalna, query pilarian ranté bakal kasampak kawas kieu:

g.V().hasLabel('Zone').has('id',0).in_()
       .repeat(__.out()).until(__.out().hasLabel('Zone').has('id',19)).count().next()

Naon dina basa Rusia sapertos kieu: panggihan Zona sareng ID = 0, cokot sadaya simpul ti mana ujungna ka dinya (ZoneStep), stomp tanpa uih deui dugi ka mendakan ZoneStep sapertos kitu, ti mana tepi ka Zone sareng ID = 19, cacah jumlah ranté misalna.

Kuring henteu pura-pura terang sadayana intricacies tina pilarian grafik, tapi query ieu dihasilkeun dumasar kana buku ieu (https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html).

Kuring unggah 50 rébu lagu tina 3 nepi ka 20 titik panjangna ka database grafik JanusGraph ngagunakeun backend BerkeleyDB, dijieun indexes nurutkeun kapamimpinan.

Aksara unduhan Python:


from random import random
from time import time

from init import g, graph

if __name__ == '__main__':

    points = []
    max_zones = 19
    zcache = dict()
    for i in range(0, max_zones + 1):
        zcache[i] = g.addV('Zone').property('id', i).next()

    startZ = zcache[0]
    endZ = zcache[max_zones]

    for i in range(0, 10000):

        if not i % 100:
            print(i)

        start = g.addV('ZoneStep').property('time', int(time())).next()
        g.V(start).addE('belongs').to(startZ).iterate()

        while True:
            pt = g.addV('ZoneStep').property('time', int(time())).next()
            end_chain = random()
            if end_chain < 0.3:
                g.V(pt).addE('belongs').to(endZ).iterate()
                g.V(start).addE('goes').to(pt).iterate()
                break
            else:
                zone_id = int(random() * max_zones)
                g.V(pt).addE('belongs').to(zcache[zone_id]).iterate()
                g.V(start).addE('goes').to(pt).iterate()

            start = pt

    count = g.V().count().next()
    print(count)

Dipaké VM kalawan 4 cores na 16 GB RAM on SSD. JanusGraph dikaluarkeun ku paréntah di handap ieu:

docker run --name janusgraph -p8182:8182 janusgraph/janusgraph:latest

Dina hal ieu, data sareng indéks anu dianggo pikeun milarian patandingan pasti disimpen dina BerkeleyDB. Saatos ngalaksanakeun pamundut anu dipasihkeun sateuacana, kuring nampi waktos anu sami sareng sababaraha puluhan detik.

Ku ngajalankeun 4 Aksara di luhur dina paralel, abdi junun ngarobah DBMS kana waluh kalawan aliran riang Java tumpukan ngambah (sarta urang sadayana resep maca Java tumpukan ngambah) dina Docker log.

Dina réfléksi, kuring mutuskeun pikeun nyederhanakeun diagram grafik ka handap:

Ékspérimén pikeun nguji panerapan DBMS grafik JanusGraph pikeun ngarengsekeun masalah milarian jalur anu cocog.

Mutuskeun yén lookups atribut éntitas bakal leuwih gancang ti lookups tepi. Hasilna, pamundut abdi janten kieu:

g.V().hasLabel('ZoneStep').has('id',0).repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',19)).count().next()

Naon dina Rusia hal kawas kieu: manggihan hiji ZoneStep kalawan ID = 0, stomp tanpa balik nepi ka manggihan hiji ZoneStep kalawan ID = 19, cacah jumlah ranté misalna.

Kuring ogé nyederhanakeun skrip undeuran di luhur, supados henteu nyiptakeun tautan anu teu perlu, dugi ka atribut.

Paménta ieu masih dijalankeun pikeun sababaraha detik, anu leres-leres henteu katampi pikeun tugas urang, sabab pikeun kaperluan AdHoc pamundut anu sawenang-wenang, ieu henteu cocog pisan.

Kuring diusahakeun deploying JanusGraph maké Scylla salaku palaksanaan panggancangna Cassandra, tapi éta ogé henteu nyieun bédana kinerja signifikan.

Janten sanaos "sigana sapertos grafik" kuring henteu acan tiasa nampi DBMS grafik pikeun ngolahna gancang. Kuring rada nganggap yen kuring teu nyaho hal sarta kasebut nyaéta dimungkinkeun pikeun nyieun JanusGraph nedunan pilarian ieu dina sadetik pamisah, kumaha oge, kuring teu hasil.

Kusabab kuring masih diperlukeun pikeun ngajawab masalah, Kuring mimiti mikir ngeunaan JOINs na Pivot tabel, nu teu mere ilham optimism dina watesan elegance, tapi bisa jadi rada pilihan giat dina prakna.

Proyék kami parantos nganggo Apache ClickHouse, janten kuring mutuskeun pikeun nguji panalungtikan kuring ngeunaan DBMS analitik ieu.

Deployed ClickHouse nurutkeun resep basajan:

sudo docker run -d --name clickhouse_1 
     --ulimit nofile=262144:262144 
     -v /opt/clickhouse/log:/var/log/clickhouse-server 
     -v /opt/clickhouse/data:/var/lib/clickhouse 
     yandex/clickhouse-server

Dijieun di jerona database sareng tabel bentuk:

CREATE TABLE 
db.steps (`area` Int64, `when` DateTime64(1, 'Europe/Moscow') DEFAULT now64(), `zone` Int64, `person` Int64) 
ENGINE = MergeTree() ORDER BY (area, zone, person) SETTINGS index_granularity = 8192

Eusian data ku skrip ieu:

from time import time

from clickhouse_driver import Client
from random import random

client = Client('vm-12c2c34c-df68-4a98-b1e5-a4d1cef1acff.domain',
                database='db',
                password='secret')

max = 20

for r in range(0, 100000):

    if r % 1000 == 0:
        print("CNT: {}, TS: {}".format(r, time()))

    data = [{
            'area': 0,
            'zone': 0,
            'person': r
        }]

    while True:
        if random() < 0.3:
            break

        data.append({
                'area': 0,
                'zone': int(random() * (max - 2)) + 1,
                'person': r
            })

    data.append({
            'area': 0,
            'zone': max - 1,
            'person': r
        })

    client.execute(
        'INSERT INTO steps (area, zone, person) VALUES',
        data
    )

Kusabab inserts datangna dina bets, ngeusian éta leuwih gancang ti JanusGraph.

Ngarancang dua patarosan nganggo JOIN. Pikeun pindah ti titik A ka titik B:

SELECT s1.person AS person,
       s1.zone,
       s1.when,
       s2.zone,
       s2.when
FROM
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 0)) AS s1 ANY INNER JOIN
  (SELECT *
   FROM steps AS s2
   WHERE (area = 0)
     AND (zone = 19)) AS s2 USING person
WHERE s1.when <= s2.when

Pikeun ngaliwatan 3 titik:

SELECT s3.person,
       s1z,
       s1w,
       s2z,
       s2w,
       s3.zone,
       s3.when
FROM
  (SELECT s1.person AS person,
          s1.zone AS s1z,
          s1.when AS s1w,
          s2.zone AS s2z,
          s2.when AS s2w
   FROM
     (SELECT *
      FROM steps
      WHERE (area = 0)
        AND (zone = 0)) AS s1 ANY INNER JOIN
     (SELECT *
      FROM steps AS s2
      WHERE (area = 0)
        AND (zone = 3)) AS s2 USING person
   WHERE s1.when <= s2.when) p ANY INNER JOIN
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 19)) AS s3 USING person
WHERE p.s2w <= s3.when

Requests, tangtosna, kasampak geulis pikasieuneun, pikeun pamakéan nyata, anjeun kudu ngalakukeun software mengikat-generator. Nanging, aranjeunna tiasa dianggo sareng gancang. Paménta kahiji sareng kadua réngsé dina waktos kirang ti 0.1 detik. Ieu conto waktos palaksanaan query pikeun count (*) ngaliwatan 3 titik:

SELECT count(*)
FROM 
(
    SELECT 
        s1.person AS person, 
        s1.zone AS s1z, 
        s1.when AS s1w, 
        s2.zone AS s2z, 
        s2.when AS s2w
    FROM 
    (
        SELECT *
        FROM steps
        WHERE (area = 0) AND (zone = 0)
    ) AS s1
    ANY INNER JOIN 
    (
        SELECT *
        FROM steps AS s2
        WHERE (area = 0) AND (zone = 3)
    ) AS s2 USING (person)
    WHERE s1.when <= s2.when
) AS p
ANY INNER JOIN 
(
    SELECT *
    FROM steps
    WHERE (area = 0) AND (zone = 19)
) AS s3 USING (person)
WHERE p.s2w <= s3.when

┌─count()─┐
│   11592 │
└─────────┘

1 rows in set. Elapsed: 0.068 sec. Processed 250.03 thousand rows, 8.00 MB (3.69 million rows/s., 117.98 MB/s.)

Catetan ngeunaan IOPS. Nalika ngeusian data, JanusGraph ngahasilkeun jumlah IOPS anu cukup luhur (1000-1300 pikeun opat benang ngeusian data), sareng IOWAIT lumayan luhur. Dina waktos anu sami, ClickHouse ngahasilkeun beban minimal dina subsistem disk.

kacindekan

Urang mutuskeun pikeun ngagunakeun ClickHouse pikeun ngalayanan requests tina tipe ieu. Urang salawasna bisa salajengna ngaoptimalkeun queries ngagunakeun pintonan materialized na parallelization ku pre-processing aliran acara kalawan Apache Flink saméméh loading kana ClickHouse.

Prestasina saé pisan sahingga urang sigana henteu kedah mikirkeun pivots tabel sacara terprogram. Sateuacanna, urang kedah ngalakukeun pivots data anu dicandak ti Vertica ngalangkungan unggah ka Apache Parquet.

Hanjakal, usaha séjén pikeun ngagunakeun grafik DBMS teu hasil. Kuring henteu mendakan yén JanusGraph ngagaduhan ékosistem anu ramah anu ngamungkinkeun anjeun gancang ngaraosan produk. Dina waktos anu sami, cara Jawa tradisional dianggo pikeun ngonpigurasikeun server, anu bakal ngajantenkeun jalma-jalma anu teu wawuh sareng Jawa ceurik ceurik getih:

host: 0.0.0.0
port: 8182
threadPoolWorker: 1
gremlinPool: 8
scriptEvaluationTimeout: 30000
channelizer: org.janusgraph.channelizers.JanusGraphWsAndHttpChannelizer

graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
  ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties,
  airlines: conf/airlines.properties
}

scriptEngines: {
  gremlin-groovy: {
    plugins: { org.janusgraph.graphdb.tinkerpop.plugin.JanusGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.server.jsr223.GremlinServerGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.tinkergraph.jsr223.TinkerGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {classImports: [java.lang.Math], methodImports: [java.lang.Math#*]},
               org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/airline-sample.groovy]}}}}

serializers:
# GraphBinary is here to replace Gryo and Graphson
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { serializeResultToString: true }}
  # Gryo and Graphson, latest versions
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  # Older serialization versions for backwards compatibility:
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoLiteMessageSerializerV1d0, config: {ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV2d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}

processors:
  - { className: org.apache.tinkerpop.gremlin.server.op.session.SessionOpProcessor, config: { sessionTimeout: 28800000 }}
  - { className: org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor, config: { cacheExpirationTime: 600000, cacheMaxSize: 1000 }}

metrics: {
  consoleReporter: {enabled: false, interval: 180000},
  csvReporter: {enabled: false, interval: 180000, fileName: /tmp/gremlin-server-metrics.csv},
  jmxReporter: {enabled: false},
  slf4jReporter: {enabled: true, interval: 180000},
  gangliaReporter: {enabled: false, interval: 180000, addressingMode: MULTICAST},
  graphiteReporter: {enabled: false, interval: 180000}}
threadPoolBoss: 1
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 64
writeBufferHighWaterMark: 32768
writeBufferHighWaterMark: 65536
ssl: {
  enabled: false}

Kuring junun ngahaja "nempatkeun" versi BerkeleyDB tina JanusGraph.

Dokuméntasi rada bengkok dina hal indéks, sabab nalika ngatur indéks, anjeun kedah ngalakukeun sababaraha perdukunan anu rada anéh dina Groovy. Contona, nyieun hiji indéks kudu dipigawé ku nulis kode dina konsol Gremlin (anu teu dianggo out of the box, ku jalan). Tina dokuméntasi resmi JanusGraph:

graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()

//Wait for the index to become available
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameComposite').call()
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()

afterword

Dina rasa, percobaan di luhur nyaéta ngabandingkeun haneut jeung lemes. Upami anjeun mikirkeun éta, grafik DBMS ngalaksanakeun operasi sanés pikeun kéngingkeun hasil anu sami. Nanging, salaku bagian tina tés, kuring ogé ékspérimén sareng pamundut sapertos:

g.V().hasLabel('ZoneStep').has('id',0)
    .repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',1)).count().next()

nu ngagambarkeun jarak leumpang. Sanajan kitu, sanajan dina data misalna, grafik DBMS némbongkeun hasil nu indit saluareun sababaraha detik ... Ieu, tangtosna, alatan kanyataan yén aya jalur tina formulir. 0 -> X -> Y ... -> 1, nu mesin grafik ogé dipariksa.

Malah pikeun pamundut view:

g.V().hasLabel('ZoneStep').has('id',0).out().has('id',1)).count().next()

Abdi henteu tiasa nampi réspon anu produktif kalayan waktos ngolah kirang ti sadetik.

Moral fabel nyaéta yén ide anu saé sareng modél paradigmatik henteu ngakibatkeun hasil anu dipikahoyong, anu nunjukkeun efisiensi anu langkung luhur ngagunakeun conto ClickHouse. Kasus pamakean anu dipidangkeun dina tulisan ieu mangrupikeun pola anti anu jelas pikeun DBMS grafik, sanaos sigana cocog pikeun modél dina paradigma na.

sumber: www.habr.com

Tambahkeun komentar