Gwaji yana gwada amfani da DBMS na JanusGraph graph don warware matsalar nemo hanyoyin da suka dace.

Gwaji yana gwada amfani da DBMS na JanusGraph graph don warware matsalar nemo hanyoyin da suka dace.

Assalamu alaikum. Muna haɓaka samfur don nazarin zirga-zirgar layi. Aikin yana da ɗawainiya mai alaƙa da ƙididdigar ƙididdiga na hanyoyin baƙo a cikin yankuna.

A matsayin wani ɓangare na wannan aikin, masu amfani za su iya tambayar tsarin tambayoyin nau'in mai zuwa:

  • baƙi nawa ne suka wuce daga yankin "A" zuwa yankin "B";
  • maziyarta nawa ne suka wuce daga yankin "A" zuwa yankin "B" ta wurin "C" sannan kuma ta wurin "D";
  • tsawon lokacin da wani nau'in baƙo ya ɗauki tafiya daga yankin "A" zuwa yankin "B".

da adadin tambayoyin nazari iri ɗaya.

Yunkurin baƙo a fadin wurare jadawali ne da aka jagoranta. Bayan karanta Intanet, na gano cewa ana amfani da DBMSs jadawali don rahotannin nazari. Ina da sha'awar ganin yadda DBMSs za su iya jimre da irin waɗannan tambayoyin (TL; DR; rashin kyau).

Na zaɓi yin amfani da DBMS JanusGraph, a matsayin fitaccen wakilin DBMS na budaddiyar jadawali, wanda ya dogara da tarin fasahar balagagge, wanda (a ganina) yakamata ya samar masa da kyawawan halaye na aiki:

  • BerkeleyDB ajiya baya, Apache Cassandra, Scylla;
  • Ana iya adana hadaddun fihirisa a Lucene, Elasticsearch, Solr.

Marubutan JanusGraph sun rubuta cewa ya dace da OLTP da OLAP.

Na yi aiki tare da BerkeleyDB, Apache Cassandra, Scylla da ES, kuma waɗannan samfuran galibi ana amfani da su a cikin tsarinmu, don haka ina da kyakkyawan fata game da gwada wannan DBMS jadawali. Na ga bai dace ba don zaɓar BerkeleyDB akan RocksDB, amma hakan yana iya yiwuwa saboda buƙatun ciniki. A kowane hali, don daidaitawa, amfani da samfur, ana ba da shawarar yin amfani da baya akan Cassandra ko Scylla.

Ban yi la'akari da Neo4j ba saboda tari yana buƙatar sigar kasuwanci, wato, samfurin ba buɗaɗɗen tushe ba ne.

Graph DBMSs suna cewa: "Idan yayi kama da jadawali, ɗauki shi kamar jadawali!" - kyau!

Na farko, na zana jadawali, wanda aka yi daidai da canons na DBMSs jadawali:

Gwaji yana gwada amfani da DBMS na JanusGraph graph don warware matsalar nemo hanyoyin da suka dace.

Akwai jigon Zone, alhakin yankin. Idan ZoneStep na wannan Zone, sai ya yi nuni da shi. A kan asali Area, ZoneTrack, Person Kada ku kula, suna cikin yankin kuma ba a la'akari da su a matsayin wani ɓangare na gwajin. Gabaɗaya, tambayar binciken sarkar don irin wannan tsarin zane zai yi kama da:

g.V().hasLabel('Zone').has('id',0).in_()
       .repeat(__.out()).until(__.out().hasLabel('Zone').has('id',19)).count().next()

Abin da a cikin Rashanci wani abu ne kamar haka: nemo Zone mai ID = 0, ɗauki duk wuraren da gefen da ke zuwa gare shi (ZoneStep), ku taka ba tare da komawa ba har sai kun sami waɗannan Takaddar Yanki daga inda akwai gefen yankin tare da shi. ID=19, ƙidaya adadin irin waɗannan sarƙoƙi.

Ba na yin kamar na san duk sarƙaƙƙiyar bincike akan jadawali, amma an ƙirƙiri wannan tambayar bisa wannan littafin (https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html).

Na ɗora waƙoƙi dubu 50 masu tsayi daga maki 3 zuwa 20 cikin tsayin daka a cikin JanusGraph graph database ta amfani da bayanan baya na BerkeleyDB, ƙirƙira fihirisa bisa ga jagoranci.

Rubutun zazzagewar Python:


from random import random
from time import time

from init import g, graph

if __name__ == '__main__':

    points = []
    max_zones = 19
    zcache = dict()
    for i in range(0, max_zones + 1):
        zcache[i] = g.addV('Zone').property('id', i).next()

    startZ = zcache[0]
    endZ = zcache[max_zones]

    for i in range(0, 10000):

        if not i % 100:
            print(i)

        start = g.addV('ZoneStep').property('time', int(time())).next()
        g.V(start).addE('belongs').to(startZ).iterate()

        while True:
            pt = g.addV('ZoneStep').property('time', int(time())).next()
            end_chain = random()
            if end_chain < 0.3:
                g.V(pt).addE('belongs').to(endZ).iterate()
                g.V(start).addE('goes').to(pt).iterate()
                break
            else:
                zone_id = int(random() * max_zones)
                g.V(pt).addE('belongs').to(zcache[zone_id]).iterate()
                g.V(start).addE('goes').to(pt).iterate()

            start = pt

    count = g.V().count().next()
    print(count)

Mun yi amfani da VM tare da 4 cores da 16 GB RAM akan SSD. An tura JanusGraph ta amfani da wannan umarni:

docker run --name janusgraph -p8182:8182 janusgraph/janusgraph:latest

A wannan yanayin, ana adana bayanai da fihirisa waɗanda ake amfani da su don ainihin binciken wasa a BerkeleyDB. Bayan aiwatar da bukatar da aka bayar a baya, na sami lokaci daidai da dubun daƙiƙai da yawa.

Ta hanyar tafiyar da rubutun 4 na sama a layi daya, na yi nasarar juya DBMS zuwa kabewa tare da rafi mai farin ciki na tarin Java (kuma duk muna son karanta stacktraces Java) a cikin Docker rajistan ayyukan.

Bayan wani tunani, sai na yanke shawarar sauƙaƙa jadawali zuwa mai zuwa:

Gwaji yana gwada amfani da DBMS na JanusGraph graph don warware matsalar nemo hanyoyin da suka dace.

Yanke shawarar cewa bincike ta halayen mahalli zai yi sauri fiye da bincika ta gefuna. Sakamakon haka, buƙatara ta zama kamar haka:

g.V().hasLabel('ZoneStep').has('id',0).repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',19)).count().next()

Abin da a cikin Rashanci wani abu ne kamar haka: nemo ZoneStep tare da ID = 0, taka ba tare da komawa ba har sai kun sami ZoneStep tare da ID = 19, ƙidaya adadin irin waɗannan sarƙoƙi.

Na kuma sauƙaƙa rubutun lodin da aka bayar a sama don kar in haifar da haɗin kai mara amfani, iyakance kaina ga halaye.

Buƙatun har yanzu ya ɗauki daƙiƙa da yawa don kammalawa, wanda gaba ɗaya bai dace da aikinmu ba, tunda ko kaɗan bai dace da manufar buƙatun AdHoc kowane iri ba.

Na yi ƙoƙarin tura JanusGraph ta amfani da Scylla a matsayin aiwatar da Cassandra mafi sauri, amma wannan kuma bai haifar da wani gagarumin canje-canjen aiki ba.

Don haka duk da cewa "yana kama da jadawali", na kasa samun DBMS jadawali don sarrafa shi da sauri. Ina tsammanin cewa akwai wani abu da ban sani ba kuma ana iya yin JanusGraph don yin wannan binciken a cikin ɗan daƙiƙa kaɗan, duk da haka, ban sami damar yin shi ba.

Tun da har yanzu matsalar tana buƙatar warwarewa, na fara tunanin JOINs da Pivots na tebur, waɗanda ba su haifar da kyakkyawan fata ba dangane da ladabi, amma zai iya zama zaɓi mai cikakken aiki a aikace.

Aikinmu ya riga ya yi amfani da Apache ClickHouse, don haka na yanke shawarar gwada bincike na akan wannan DBMS na nazari.

An ƙaddamar da ClickHouse ta amfani da girke-girke mai sauƙi:

sudo docker run -d --name clickhouse_1 
     --ulimit nofile=262144:262144 
     -v /opt/clickhouse/log:/var/log/clickhouse-server 
     -v /opt/clickhouse/data:/var/lib/clickhouse 
     yandex/clickhouse-server

Na kirkiro rumbun adana bayanai da tebur a cikinsa kamar haka:

CREATE TABLE 
db.steps (`area` Int64, `when` DateTime64(1, 'Europe/Moscow') DEFAULT now64(), `zone` Int64, `person` Int64) 
ENGINE = MergeTree() ORDER BY (area, zone, person) SETTINGS index_granularity = 8192

Na cika shi da bayanai ta amfani da rubutun mai zuwa:

from time import time

from clickhouse_driver import Client
from random import random

client = Client('vm-12c2c34c-df68-4a98-b1e5-a4d1cef1acff.domain',
                database='db',
                password='secret')

max = 20

for r in range(0, 100000):

    if r % 1000 == 0:
        print("CNT: {}, TS: {}".format(r, time()))

    data = [{
            'area': 0,
            'zone': 0,
            'person': r
        }]

    while True:
        if random() < 0.3:
            break

        data.append({
                'area': 0,
                'zone': int(random() * (max - 2)) + 1,
                'person': r
            })

    data.append({
            'area': 0,
            'zone': max - 1,
            'person': r
        })

    client.execute(
        'INSERT INTO steps (area, zone, person) VALUES',
        data
    )

Tun da abubuwan da aka shigar sun zo cikin batches, cikawa ya yi sauri fiye da na JanusGraph.

An gina tambayoyi biyu ta amfani da JOIN. Don matsawa daga aya A zuwa aya B:

SELECT s1.person AS person,
       s1.zone,
       s1.when,
       s2.zone,
       s2.when
FROM
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 0)) AS s1 ANY INNER JOIN
  (SELECT *
   FROM steps AS s2
   WHERE (area = 0)
     AND (zone = 19)) AS s2 USING person
WHERE s1.when <= s2.when

Don tafiya ta maki 3:

SELECT s3.person,
       s1z,
       s1w,
       s2z,
       s2w,
       s3.zone,
       s3.when
FROM
  (SELECT s1.person AS person,
          s1.zone AS s1z,
          s1.when AS s1w,
          s2.zone AS s2z,
          s2.when AS s2w
   FROM
     (SELECT *
      FROM steps
      WHERE (area = 0)
        AND (zone = 0)) AS s1 ANY INNER JOIN
     (SELECT *
      FROM steps AS s2
      WHERE (area = 0)
        AND (zone = 3)) AS s2 USING person
   WHERE s1.when <= s2.when) p ANY INNER JOIN
  (SELECT *
   FROM steps
   WHERE (area = 0)
     AND (zone = 19)) AS s3 USING person
WHERE p.s2w <= s3.when

Buƙatun, ba shakka, suna da ban tsoro; don amfani na gaske, kuna buƙatar ƙirƙirar kayan aikin janareta na software. Duk da haka, suna aiki kuma suna aiki da sauri. Duk buƙatun farko da na biyu ana kammala su cikin ƙasa da daƙiƙa 0.1. Ga misalin lokacin aiwatar da tambaya don ƙidaya(*) wucewa ta maki 3:

SELECT count(*)
FROM 
(
    SELECT 
        s1.person AS person, 
        s1.zone AS s1z, 
        s1.when AS s1w, 
        s2.zone AS s2z, 
        s2.when AS s2w
    FROM 
    (
        SELECT *
        FROM steps
        WHERE (area = 0) AND (zone = 0)
    ) AS s1
    ANY INNER JOIN 
    (
        SELECT *
        FROM steps AS s2
        WHERE (area = 0) AND (zone = 3)
    ) AS s2 USING (person)
    WHERE s1.when <= s2.when
) AS p
ANY INNER JOIN 
(
    SELECT *
    FROM steps
    WHERE (area = 0) AND (zone = 19)
) AS s3 USING (person)
WHERE p.s2w <= s3.when

┌─count()─┐
│   11592 │
└─────────┘

1 rows in set. Elapsed: 0.068 sec. Processed 250.03 thousand rows, 8.00 MB (3.69 million rows/s., 117.98 MB/s.)

Bayanan kula game da IOPS. Lokacin tattara bayanai, JanusGraph ya haifar da adadi mai yawa na IOPS (1000-1300 don zaren yawan adadin bayanai huɗu) kuma IOWAIT ya yi girma sosai. A lokaci guda, ClickHouse ya haifar da ƙananan kaya akan tsarin faifai.

ƙarshe

Mun yanke shawarar amfani da ClickHouse don hidimar irin wannan buƙatar. Kullum muna iya ƙara haɓaka tambayoyin ta amfani da ra'ayi na zahiri da daidaitawa ta hanyar aiwatar da rafin taron ta amfani da Apache Flink kafin loda su cikin ClickHouse.

Ayyukan yana da kyau sosai wanda wataƙila ba ma za mu yi tunani game da pivoting allunan cikin tsari ba. A baya can, dole ne mu yi ginshiƙan bayanan da aka samo daga Vertica ta hanyar lodawa zuwa Apache Parquet.

Abin takaici, wani yunƙurin amfani da DBMS mai hoto bai yi nasara ba. Ban sami JanusGraph don samun yanayin yanayi na abokantaka wanda ya sauƙaƙa saurin saurin samfurin ba. A lokaci guda kuma, don daidaita uwar garken, ana amfani da hanyar gargajiya ta Java, wanda zai sa mutanen da ba su da masaniya game da Java suna kuka na zubar da jini:

host: 0.0.0.0
port: 8182
threadPoolWorker: 1
gremlinPool: 8
scriptEvaluationTimeout: 30000
channelizer: org.janusgraph.channelizers.JanusGraphWsAndHttpChannelizer

graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
  ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties,
  airlines: conf/airlines.properties
}

scriptEngines: {
  gremlin-groovy: {
    plugins: { org.janusgraph.graphdb.tinkerpop.plugin.JanusGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.server.jsr223.GremlinServerGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.tinkergraph.jsr223.TinkerGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {classImports: [java.lang.Math], methodImports: [java.lang.Math#*]},
               org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/airline-sample.groovy]}}}}

serializers:
# GraphBinary is here to replace Gryo and Graphson
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { serializeResultToString: true }}
  # Gryo and Graphson, latest versions
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  # Older serialization versions for backwards compatibility:
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoLiteMessageSerializerV1d0, config: {ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV2d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}

processors:
  - { className: org.apache.tinkerpop.gremlin.server.op.session.SessionOpProcessor, config: { sessionTimeout: 28800000 }}
  - { className: org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor, config: { cacheExpirationTime: 600000, cacheMaxSize: 1000 }}

metrics: {
  consoleReporter: {enabled: false, interval: 180000},
  csvReporter: {enabled: false, interval: 180000, fileName: /tmp/gremlin-server-metrics.csv},
  jmxReporter: {enabled: false},
  slf4jReporter: {enabled: true, interval: 180000},
  gangliaReporter: {enabled: false, interval: 180000, addressingMode: MULTICAST},
  graphiteReporter: {enabled: false, interval: 180000}}
threadPoolBoss: 1
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 64
writeBufferHighWaterMark: 32768
writeBufferHighWaterMark: 65536
ssl: {
  enabled: false}

Na yi nasarar “saka” sigar BerkeleyDB na JanusGraph da gangan.

Takaddun sun karkata sosai dangane da fihirisa, tunda sarrafa fihirisa yana buƙatar ku yi wasu baƙon shamanism a Groovy. Misali, ƙirƙirar fihirisa dole ne a yi ta hanyar rubuta lamba a cikin na'ura mai ba da hanya tsakanin hanyoyin sadarwa na Gremlin (wanda, a hanya, baya aiki daga cikin akwatin). Daga bayanan JanusGraph na hukuma:

graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()

//Wait for the index to become available
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameComposite').call()
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()

Bayanword

A wata ma'ana, gwajin da ke sama shine kwatanta tsakanin dumi da taushi. Idan kuna tunani game da shi, DBMS jadawali yana yin wasu ayyuka don samun sakamako iri ɗaya. Koyaya, a matsayin wani ɓangare na gwaje-gwajen, na kuma gudanar da gwaji tare da buƙata kamar:

g.V().hasLabel('ZoneStep').has('id',0)
    .repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',1)).count().next()

wanda ke nuna nisan tafiya. Duk da haka, ko da a kan irin waɗannan bayanai, DBMS jadawali ya nuna sakamakon da ya wuce 'yan seconds ... Wannan, ba shakka, saboda gaskiyar cewa akwai hanyoyi kamar haka. 0 -> X -> Y ... -> 1, wanda injin jadawali kuma ya duba.

Ko da tambaya kamar:

g.V().hasLabel('ZoneStep').has('id',0).out().has('id',1)).count().next()

Na kasa samun amsa mai inganci tare da lokacin sarrafa ƙasa da daƙiƙa guda.

Halin halin labarin shine kyakkyawan ra'ayi da ƙirar ƙira ba sa haifar da sakamakon da ake so, wanda aka nuna tare da inganci mafi girma ta amfani da misalin ClickHouse. Shari'ar amfani da aka gabatar a cikin wannan labarin bayyanannen tsari ne don DBMSs mai hoto, kodayake yana da kama da dacewa don yin ƙira a cikin yanayin su.

source: www.habr.com

Add a comment