Hi uile. Tha sinn a’ leasachadh toradh airson mion-sgrùdadh trafaic far-loidhne. Tha gnìomh aig a’ phròiseact co-cheangailte ri mion-sgrùdadh staitistigeil air slighean luchd-tadhail thar roinnean.
Mar phàirt den obair seo, faodaidh luchd-cleachdaidh ceistean siostam den t-seòrsa a leanas fhaighneachd:
- cia mheud neach-tadhail a chaidh seachad bho sgìre "A" gu sgìre "B";
- cia mheud neach-tadhail a chaidh seachad bho sgìre "A" gu sgìre "B" tro sgìre "C" agus an uairsin tro sgìre "D";
- dè cho fada ’s a thug e air seòrsa sònraichte de neach-tadhail siubhal bho sgìre “A” gu sgìre “B”.
agus grunn cheistean anailis coltach ris.
Tha gluasad an neach-tadhail thar raointean na ghraf stiùirichte. Às deidh dhomh an eadar-lìn a leughadh, fhuair mi a-mach gu bheil grafaichean DBMS cuideachd air an cleachdadh airson aithisgean anailis. Bha miann agam faicinn mar a dhèiligeadh graf DBMSs ri ceistean mar sin (TL; DR; truagh).
Thagh mi am DBMS a chleachdadh
- Cùl-taic stòraidh BerkeleyDB, Apache Cassandra, Scylla;
- faodar clàran-amais iom-fhillte a stòradh ann an Lucene, Elasticsearch, Solr.
Sgrìobh ùghdaran JanusGraph gu bheil e freagarrach airson an dà chuid OLTP agus OLAP.
Tha mi air a bhith ag obair le BerkeleyDB, Apache Cassandra, Scylla agus ES, agus bidh na toraidhean sin gu tric air an cleachdadh anns na siostaman againn, agus mar sin bha mi dòchasach mu bhith a’ dèanamh deuchainn air a’ ghraf seo DBMS. Bha e neònach dhomh BerkeleyDB a thaghadh thairis air RocksDB, ach is dòcha gu bheil sin mar thoradh air riatanasan malairt. Ann an suidheachadh sam bith, airson cleachdadh toraidh scalable, thathas a’ moladh backend a chleachdadh air Cassandra no Scylla.
Cha do bheachdaich mi air Neo4j oir tha feum air dreach malairteach airson cruinneachadh, is e sin, chan eil an toradh fosgailte.
Tha graf DBMS ag ràdh: “Ma tha e coltach ri graf, thoir sùil air mar ghraf!” - bòidhchead!
An toiseach, tharraing mi graf, a chaidh a dhèanamh dìreach a rèir cananan graf DBMSs:
Tha brìgh ann Zone
, le uallach airson na sgìre. Ma tha ZoneStep
bhuineas do seo Zone
, an uairsin tha e a 'toirt iomradh air. Air bunait Area
, ZoneTrack
, Person
Na toir aire, buinidh iad don raon agus chan eilear gam faicinn mar phàirt den deuchainn. Gu h-iomlan, bhiodh ceist rannsachaidh slabhraidh airson structar grafa mar seo coltach:
g.V().hasLabel('Zone').has('id',0).in_()
.repeat(__.out()).until(__.out().hasLabel('Zone').has('id',19)).count().next()
Dè ann an Ruisis a tha rudeigin mar seo: lorg Sòn le ID = 0, thoir a h-uile vertices às a bheil oir a’ dol thuige (ZoneStep), stomp gun a dhol air ais gus an lorg thu na ZoneSteps sin às a bheil iomall don Sòn le ID=19, cunnt àireamh nan slabhraidhean sin.
Chan eil mi a’ gabhail orm gu bheil eòlas agam air a h-uile toinnte a thaobh rannsachadh air grafaichean, ach chaidh a’ cheist seo a chruthachadh a rèir an leabhair seo (
Luchdaich mi 50 mìle slighe eadar 3 agus 20 puing de dh'fhaid a-steach do stòr-dàta graf JanusGraph a’ cleachdadh backend BerkeleyDB, chruthaich mi clàran-amais a rèir
Sgriobt python luchdadh a-nuas
from random import random
from time import time
from init import g, graph
if __name__ == '__main__':
points = []
max_zones = 19
zcache = dict()
for i in range(0, max_zones + 1):
zcache[i] = g.addV('Zone').property('id', i).next()
startZ = zcache[0]
endZ = zcache[max_zones]
for i in range(0, 10000):
if not i % 100:
print(i)
start = g.addV('ZoneStep').property('time', int(time())).next()
g.V(start).addE('belongs').to(startZ).iterate()
while True:
pt = g.addV('ZoneStep').property('time', int(time())).next()
end_chain = random()
if end_chain < 0.3:
g.V(pt).addE('belongs').to(endZ).iterate()
g.V(start).addE('goes').to(pt).iterate()
break
else:
zone_id = int(random() * max_zones)
g.V(pt).addE('belongs').to(zcache[zone_id]).iterate()
g.V(start).addE('goes').to(pt).iterate()
start = pt
count = g.V().count().next()
print(count)
Chleachd sinn VM le 4 cores agus 16 GB RAM air SSD. Chaidh JanusGraph a chleachdadh leis an àithne seo:
docker run --name janusgraph -p8182:8182 janusgraph/janusgraph:latest
Anns a ’chùis seo, tha an dàta agus na clàran-amais a thathas a’ cleachdadh airson dearbh rannsachaidhean maidsidh air an stòradh ann am BerkeleyDB. Às deidh dhomh an t-iarrtas a chaidh a thoirt seachad na bu thràithe a chuir an gnìomh, fhuair mi ùine co-ionann ri grunn deichean diog.
Le bhith a’ ruith na 4 gu h-àrd sgriobtaichean aig an aon àm, fhuair mi air an DBMS a thionndadh gu bhith na phumpkin le sruth sunndach de stacan Java (agus is toil leinn uile a bhith a’ leughadh stacan Java) ann an logaichean Docker.
Às deidh beagan smaoineachaidh, chuir mi romham an diagram graf a dhèanamh nas sìmplidhe gu na leanas:
A’ co-dhùnadh gum biodh sgrùdadh a rèir buadhan eintiteas nas luaithe na bhith a’ sgrùdadh le oirean. Mar thoradh air an sin, thionndaidh an t-iarrtas agam gu na leanas:
g.V().hasLabel('ZoneStep').has('id',0).repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',19)).count().next()
Dè ann an Ruisis a tha rudeigin mar seo: lorg ZoneStep le ID = 0, stomp gun a dhol air ais gus an lorg thu ZoneStep le ID = 19, cunnt àireamh nan slabhraidhean sin.
Rinn mi nas sìmplidhe cuideachd air an sgriobt luchdachadh a chaidh a thoirt seachad gu h-àrd gus nach cruthaich mi ceanglaichean neo-riatanach, a’ cuingealachadh mi fhìn gu buadhan.
Thug an t-iarrtas grunn dhiog fhathast airson a choileanadh, rud a bha gu tur neo-iomchaidh airson ar gnìomh, leis nach robh e idir iomchaidh airson adhbharan AdHoc iarrtasan de sheòrsa sam bith.
Dh’ fheuch mi ri JanusGraph a chleachdadh a’ cleachdadh Scylla mar am buileachadh Cassandra as luaithe, ach cha do lean seo gu atharrachaidhean coileanaidh cudromach sam bith cuideachd.
Mar sin a dh’ aindeoin “gu bheil e coltach ri graf”, cha b’ urrainn dhomh an graf DBMS a phròiseasadh gu sgiobalta. Tha mi gu tur a’ gabhail ris gu bheil rudeigin ann air nach eil mi eòlach agus gun gabh JanusGraph toirt air an rannsachadh seo a dhèanamh ann am bloigh de dhiog, ge-tà, cha b’ urrainn dhomh a dhèanamh.
Leis gu robh feum fhathast air an duilgheadas fhuasgladh, thòisich mi a’ smaoineachadh air JOINs agus Pivots of tables, nach do bhrosnaich dòchas a thaobh eireachdas, ach a dh’ fhaodadh a bhith na roghainn gu tur obrachail ann an cleachdadh.
Tha am pròiseact againn mu thràth a’ cleachdadh Apache ClickHouse, agus mar sin chuir mi romham an rannsachadh agam a dhearbhadh air an DBMS anailis seo.
Cleachd ClickHouse a’ cleachdadh reasabaidh shìmplidh:
sudo docker run -d --name clickhouse_1
--ulimit nofile=262144:262144
-v /opt/clickhouse/log:/var/log/clickhouse-server
-v /opt/clickhouse/data:/var/lib/clickhouse
yandex/clickhouse-server
Chruthaich mi stòr-dàta agus clàr ann mar seo:
CREATE TABLE
db.steps (`area` Int64, `when` DateTime64(1, 'Europe/Moscow') DEFAULT now64(), `zone` Int64, `person` Int64)
ENGINE = MergeTree() ORDER BY (area, zone, person) SETTINGS index_granularity = 8192
Lìon mi e le dàta a’ cleachdadh an sgriobt a leanas:
from time import time
from clickhouse_driver import Client
from random import random
client = Client('vm-12c2c34c-df68-4a98-b1e5-a4d1cef1acff.domain',
database='db',
password='secret')
max = 20
for r in range(0, 100000):
if r % 1000 == 0:
print("CNT: {}, TS: {}".format(r, time()))
data = [{
'area': 0,
'zone': 0,
'person': r
}]
while True:
if random() < 0.3:
break
data.append({
'area': 0,
'zone': int(random() * (max - 2)) + 1,
'person': r
})
data.append({
'area': 0,
'zone': max - 1,
'person': r
})
client.execute(
'INSERT INTO steps (area, zone, person) VALUES',
data
)
Leis gu bheil cuir a-steach a’ tighinn a-steach ann an baidsean, bha an lìonadh fada nas luaithe na bha e airson JanusGraph.
Chaidh dà cheist a thogail a’ cleachdadh JOIN. Gus gluasad bho phuing A gu puing B:
SELECT s1.person AS person,
s1.zone,
s1.when,
s2.zone,
s2.when
FROM
(SELECT *
FROM steps
WHERE (area = 0)
AND (zone = 0)) AS s1 ANY INNER JOIN
(SELECT *
FROM steps AS s2
WHERE (area = 0)
AND (zone = 19)) AS s2 USING person
WHERE s1.when <= s2.when
Airson a dhol tro 3 puingean:
SELECT s3.person,
s1z,
s1w,
s2z,
s2w,
s3.zone,
s3.when
FROM
(SELECT s1.person AS person,
s1.zone AS s1z,
s1.when AS s1w,
s2.zone AS s2z,
s2.when AS s2w
FROM
(SELECT *
FROM steps
WHERE (area = 0)
AND (zone = 0)) AS s1 ANY INNER JOIN
(SELECT *
FROM steps AS s2
WHERE (area = 0)
AND (zone = 3)) AS s2 USING person
WHERE s1.when <= s2.when) p ANY INNER JOIN
(SELECT *
FROM steps
WHERE (area = 0)
AND (zone = 19)) AS s3 USING person
WHERE p.s2w <= s3.when
Tha na h-iarrtasan, gu dearbh, a’ coimhead gu math eagallach; airson fìor fheum, feumaidh tu acfhainn gineadair bathar-bog a chruthachadh. Ach, bidh iad ag obair agus bidh iad ag obair gu sgiobalta. Thèid a’ chiad agus an dàrna iarrtas a chrìochnachadh ann an nas lugha na 0.1 diog. Seo eisimpleir den ùine cur an gnìomh ceiste airson cunntais (*) a’ dol tro 3 puingean:
SELECT count(*)
FROM
(
SELECT
s1.person AS person,
s1.zone AS s1z,
s1.when AS s1w,
s2.zone AS s2z,
s2.when AS s2w
FROM
(
SELECT *
FROM steps
WHERE (area = 0) AND (zone = 0)
) AS s1
ANY INNER JOIN
(
SELECT *
FROM steps AS s2
WHERE (area = 0) AND (zone = 3)
) AS s2 USING (person)
WHERE s1.when <= s2.when
) AS p
ANY INNER JOIN
(
SELECT *
FROM steps
WHERE (area = 0) AND (zone = 19)
) AS s3 USING (person)
WHERE p.s2w <= s3.when
┌─count()─┐
│ 11592 │
└─────────┘
1 rows in set. Elapsed: 0.068 sec. Processed 250.03 thousand rows, 8.00 MB (3.69 million rows/s., 117.98 MB/s.)
Nota mu IOPS. Nuair a bha e a’ cruinneachadh dàta, chruthaich JanusGraph àireamh gu math àrd de IOPS (1000-1300 airson ceithir snàithleanan àireamh-sluaigh dàta) agus bha IOWAIT gu math àrd. Aig an aon àm, chruthaich ClickHouse glè bheag de luchd air an fho-shiostam diosc.
co-dhùnadh
Cho-dhùin sinn ClickHouse a chleachdadh gus an seòrsa iarrtas seo a fhrithealadh. Is urrainn dhuinn an-còmhnaidh barrachd cheistean a bharrachadh le bhith a’ cleachdadh seallaidhean tàbhachdach agus co-shìnte le bhith a’ ro-phròiseasadh sruth an tachartais a’ cleachdadh Apache Flink mus luchdaich iad a-steach iad gu ClickHouse.
Tha an coileanadh cho math is dòcha nach fheum sinn eadhon smaoineachadh air bùird pivoting gu prògramach. Roimhe sin, bha againn ri pivots de dhàta fhaighinn air ais bho Vertica tro luchdachadh suas gu Apache Parquet.
Gu mì-fhortanach, cha do shoirbhich le oidhirp eile gus graf DBMS a chleachdadh. Cha do lorg mi gu robh eag-shiostam càirdeil aig JanusGraph a rinn e furasta faighinn suas ris an toradh. Aig an aon àm, gus an t-seirbheisiche a rèiteachadh, thèid an dòigh thraidiseanta Java a chleachdadh, a bheir air daoine nach eil eòlach air Java deòir fala:
host: 0.0.0.0
port: 8182
threadPoolWorker: 1
gremlinPool: 8
scriptEvaluationTimeout: 30000
channelizer: org.janusgraph.channelizers.JanusGraphWsAndHttpChannelizer
graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties,
airlines: conf/airlines.properties
}
scriptEngines: {
gremlin-groovy: {
plugins: { org.janusgraph.graphdb.tinkerpop.plugin.JanusGraphGremlinPlugin: {},
org.apache.tinkerpop.gremlin.server.jsr223.GremlinServerGremlinPlugin: {},
org.apache.tinkerpop.gremlin.tinkergraph.jsr223.TinkerGraphGremlinPlugin: {},
org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {classImports: [java.lang.Math], methodImports: [java.lang.Math#*]},
org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/airline-sample.groovy]}}}}
serializers:
# GraphBinary is here to replace Gryo and Graphson
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1, config: { serializeResultToString: true }}
# Gryo and Graphson, latest versions
- { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { serializeResultToString: true }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
# Older serialization versions for backwards compatibility:
- { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GryoLiteMessageSerializerV1d0, config: {ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV2d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
processors:
- { className: org.apache.tinkerpop.gremlin.server.op.session.SessionOpProcessor, config: { sessionTimeout: 28800000 }}
- { className: org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor, config: { cacheExpirationTime: 600000, cacheMaxSize: 1000 }}
metrics: {
consoleReporter: {enabled: false, interval: 180000},
csvReporter: {enabled: false, interval: 180000, fileName: /tmp/gremlin-server-metrics.csv},
jmxReporter: {enabled: false},
slf4jReporter: {enabled: true, interval: 180000},
gangliaReporter: {enabled: false, interval: 180000, addressingMode: MULTICAST},
graphiteReporter: {enabled: false, interval: 180000}}
threadPoolBoss: 1
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 64
writeBufferHighWaterMark: 32768
writeBufferHighWaterMark: 65536
ssl: {
enabled: false}
Fhuair mi “gun fhiosta” air an dreach BerkeleyDB de JanusGraph.
Tha na sgrìobhainnean gu math cam a thaobh clàran-amais, leis gu bheil riaghladh chlàran-amais ag iarraidh ort beagan shamanism neònach a dhèanamh ann an Groovy. Mar eisimpleir, feumar clàr-amais a chruthachadh le bhith a’ sgrìobhadh còd ann an consol Gremlin (a tha, co-dhiù, nach obraich a-mach às a’ bhogsa). Bho na sgrìobhainnean oifigeil JanusGraph:
graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameComposite').call()
ManagementSystem.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()
Post-d gu caraid
Ann an seagh, tha an deuchainn gu h-àrd na choimeas eadar blàth is bog. Ma smaoinicheas tu mu dheidhinn, bidh graf DBMS a’ coileanadh obrachaidhean eile gus na h-aon toraidhean fhaighinn. Ach, mar phàirt de na deuchainnean, rinn mi deuchainn cuideachd le iarrtas mar:
g.V().hasLabel('ZoneStep').has('id',0)
.repeat(__.out().simplePath()).until(__.hasLabel('ZoneStep').has('id',1)).count().next()
a tha a’ nochdadh astar coiseachd. Ach, eadhon air an leithid de dhàta, sheall an graf DBMS toraidhean a chaidh seachad air beagan dhiog... Tha seo, gu dearbh, air sgàth gu robh slighean mar seo ann. 0 -> X -> Y ... -> 1
, a rinn an einnsean grafa sgrùdadh cuideachd.
Fiù 's airson ceist mar:
g.V().hasLabel('ZoneStep').has('id',0).out().has('id',1)).count().next()
Cha b’ urrainn dhomh freagairt buannachdail fhaighinn le ùine giollachd nas lugha na diog.
Is e moraltachd na sgeòil nach eil beachd brèagha agus modaladh paradigmatic a’ leantainn gu an toradh a tha thu ag iarraidh, a tha air a dhearbhadh le èifeachdas mòran nas àirde a ’cleachdadh eisimpleir ClickHouse. Tha a’ chùis cleachdaidh a tha air a thaisbeanadh san artaigil seo na fhrith-phàtran soilleir airson grafaichean DBMS, ged a tha e coltach gu bheil e iomchaidh airson modaladh anns a’ phàtran aca.
Source: www.habr.com