Multi-model DBMS - the basis of modern information systems?

Modern information systems are quite complex. Not least of all, their complexity is due to the complexity of the data processed in them. The complexity of data often lies in the variety of data models used. So, for example, when data becomes "large", one of the annoying characteristics is not only its volume ("volume"), but also its variety ("variety").

If you do not yet find a flaw in the reasoning, then read on.

Multi-model DBMS - the basis of modern information systems?


Content

Polyglot persistence
Multimodeling
Multi-model DBMS based on the relational model
     Document Model in MS SQL Server
     Graph model in MS SQL Server
Multi-model DBMS based on the document model
     Relational model in MarkLogic
     Graph model in MarkLogic
Multi-model DBMS "without the main model"
     ArangoDB
     OrientDB
     Azure CosmosDB
Multi-model DBMS based on a graph model?
Conclusion
Interview

Polyglot persistence

The foregoing leads to the fact that sometimes even within the framework of one system it is necessary to use several different DBMSs for storing data and solving various tasks for processing them, each of which supports its own data model. With the light hand of M. Fowler, Author a number of well-known books and one of co-authors Agile Manifesto, this situation is called multi-variant storage ("polyglot persistence").

Fowler also owns the following example of organizing data storage in a full-featured and highly loaded e-commerce application.

Multi-model DBMS - the basis of modern information systems?

This example, of course, is somewhat exaggerated, but some considerations in favor of choosing one or another DBMS for the corresponding purpose can be found, for example, here.

It is clear that being a minister in such a zoo is not easy.

  • The amount of code that saves data grows in proportion to the number of DBMSs used; the amount of code that synchronizes the data is good if not proportional to the square of this number.
  • By a multiple of the number of used DBMS, the costs of providing enterprise-characteristics (scalability, fault tolerance, high availability) of each of the used DBMS increase.
  • It is impossible to provide the enterprise-characteristics of the storage subsystem as a whole - especially transactional.

From the point of view of the zoo director, everything looks like this:

  • A multiple increase in the cost of licenses and technical support from the DBMS manufacturer.
  • Inflated staff and extended deadlines.
  • Direct financial losses or penalties due to data inconsistency.

There is a significant increase in the total cost of ownership of the system (TCO). Is there any way out of the situation of "multivariant storage"?

Multimodeling

The term "multiple storage" came into use in 2011. Awareness of the problems of the approach and the search for a solution took several years, and by 2015, the answer was formulated by Gartner analysts:

It seems that this time Gartner analysts were not mistaken with the forecast. If you go to the page with main rating DBMS on DB-Engines, you can see what wouldоMost of its leaders position themselves precisely as multi-model DBMS. The same can be seen on the page with any private rating.

The table below shows the DBMS - leaders in each of the private ratings, declaring their multi-model. For each DBMS, the original supported model (once the only one) is listed along with the currently supported models. There are also DBMSs that position themselves as “initially multi-model”, which, according to the creators, do not have any original inherited model.

DBMS original model Additional models
Oracle relational Graph, document
MS SQL relational Graph, document
PostgreSQL relational Graph*, document
MarkLogic documentary Graph, relational
MongoDB documentary Key-value, graph*
Data Stax Wide-column Documentary, graphic
Redis Key-value Documentary, graphic*
ArangoDB Graph, document
OrientDB Graph, Document, Relational
Azure CosmosDB Graph, Document, Relational

Notes to the table

Asterisks in the table mark statements that require reservations:

  • PostgreSQL DBMS does not support the graph data model, but this product does. on its basis, like, for example, AgensGraph.
  • With regard to MongoDB, it is more correct to speak rather about the presence of graph operators in the query language ($lookup, $graphLookup) than about graph model support, although, of course, their introduction required some optimizations at the physical storage level towards graph model support.
  • Redis refers to the extension Redis Graph.

Next, for each of the classes, we will show how support for several models is implemented in the DBMS from this class. We will consider the relational, document and graph models to be the most important and, using examples of specific DBMSs, show how the “missing” ones are implemented.

Multi-model DBMS based on the relational model

Relational DBMSs are currently leading, Gartner's prediction could not be considered true if RDBMSs did not show movements towards multi-modeling. And they show. Now the idea that a multi-model DBMS is like a Swiss Army knife that can't do anything well can be directed directly to Larry Ellison.

The author, however, prefers the implementation of multi-modeling in Microsoft SQL Server, on the example of which the support of RDBMS document and graph models will be described.

Document Model in MS SQL Server

There have already been two excellent articles on Habré about how MS SQL Server implements support for the document model, I will limit myself to a brief retelling and commentary:

The way of supporting the document model in MS SQL Server is quite typical for relational DBMS: it is proposed to store JSON documents in ordinary text fields. Supporting the document model is to provide special operators to parse this JSON:

The second argument to both operators is an expression in JSONPath-like syntax.

In the abstract, we can say that documents stored in this way are not "first-class entities" in a relational DBMS, unlike tuples. Specifically, in MS SQL Server, there are currently no indexes on the fields of JSON documents, which makes it difficult to join tables by the values ​​of these fields and even select documents by these values. However, it is possible to create a computable column on such a field and an index on it.

Additionally, MS SQL Server provides the ability to conveniently construct a JSON document from the contents of tables using the operator FOR JSON PATH - a possibility, in a certain sense, the opposite of the previous, conventional storage. It is clear that no matter how fast the RDBMS is, such an approach contradicts the ideology of document DBMS, which essentially store ready-made answers to popular queries, and can only solve problems of development convenience, but not speed.

Finally, MS SQL Server allows you to solve the reverse engineering problem of a document: you can decompose JSON into tables using OPENJSON. If the document is not completely flat, you will need to use CROSS APPLY.

Graph model in MS SQL Server

Support for the graph (LPG) model is also implemented in Microsoft SQL Server quite well. predictably: it is proposed to use special tables for storing nodes and for storing graph edges. Such tables are created using expressions CREATE TABLE AS NODE и CREATE TABLE AS EDGE respectively.

Tables of the first type are similar to regular tables for storing records, with the only external difference being that the table contains a system field $node_id is a unique identifier of the graph node within the database.

Similarly, tables of the second kind have system fields $from_id и $to_id, the entries in such tables clearly define the links between the nodes. A separate table is used to store links of each type.

Multi-model DBMS - the basis of modern information systems? Let's illustrate the above with an example. Let the graph data have a schema as in the figure below. Then, to create the corresponding structure in the database, you need to execute the following DDL queries:

CREATE TABLE Person (
  ID INTEGER NOT NULL,
  name VARCHAR(100)
) AS NODE;

CREATE TABLE Cafe (
  ID INTEGER NOT NULL, 
  name VARCHAR(100), 
) AS NODE;

CREATE TABLE likes (
  rating INTEGER
) AS EDGE;

CREATE TABLE friendOf
  AS EDGE;

ALTER TABLE likes
  ADD CONSTRAINT EC_LIKES CONNECTION (Person TO Cafe);

The main specificity of such tables is that it is possible to use graph patterns with Cypher-like syntax in queries to them (however, “*”, etc. are not supported yet). Based on performance measurements, it can also be assumed that the way data is stored in these tables is different from the way data is stored in regular tables and is optimized for executing such graph queries.

SELECT Cafe.name
  FROM Person, likes, Cafe
  WHERE MATCH (Person-(friendOf)-(likes)->Cafe)
  AND Person.name = 'John';

Moreover, when working with such tables, it is quite difficult not to use these graph patterns, since in ordinary SQL queries, to solve similar problems, additional efforts will be required to obtain system “graph” node identifiers ($node_id, $from_id, $to_id; for the same reason, data insert queries are not shown here as unnecessarily cumbersome).

Summing up the description of the implementations of the document and graph models in MS SQL Server, I would note that such implementations of one model on top of another do not seem to be successful in the first place from the point of view of language design. It is required to extend one language to another, languages ​​are not quite "orthogonal", compatibility rules can be quite bizarre.

Multi-model DBMS based on the document model

In this section, I would like to illustrate the implementation of multi-modeling in document DBMS using the example of MongoDB, which is not the most popular of them (as it was said, it has only conditionally graph operators $lookup и $graphLookupthat do not work on sharded collections), but on the example of a more mature and "enterprise" DBMS MarkLogic.

So, let the collection contain a set of XML documents of the following form (MarkLogic also allows you to store JSON documents):

<Person INN="631803299804">
  <name>John</name>
  <surname>Smith</surname>
</Person>

Relational model in MarkLogic

You can create a relational view of a collection of documents using display template (element content value in the example below, there can be an arbitrary XPath):

<template >
  <context>/Person</context>
  <rows>
    <row>
      <view-name>Person</view-name>
      <columns>
        <column>
          <name>SSN</name>
          <value>@SSN</value>
          <type>string</type>
        </column>
        <column>
          <name>name</name>
          <value>name</value>
        </column>
        <column>
          <name>surname</name>
          <value>surname</value>
        </column>
      </columns>
    </row>
  <rows>
</template>

A SQL query can be addressed to the created view (for example, via ODBC):

SELECT name, surname FROM Person WHERE name="John"

Unfortunately, the relational view generated by the mapping template is read-only. When processing a request to it, MarkLogic will try to use document indexes. Previously, MarkLogic also had limited relational views, entirely index based and writable, but now they are considered deprecated.

Graph model in MarkLogic

With support for the graph (RDF) model, everything is about the same. Again with the help display template you can create an RDF representation of the collection of documents from the example above:

<template >
  <context>/Person</context>
    <vars>
      <var>
        <name>PREFIX</name>
        <val>"http://example.org/example#"</val>
      </var>
    </vars>
  <triples>
    <triple>
      <subject><value>sem:iri( $PREFIX || @SSN )</value></subject>
      <predicate><value>sem:iri( $PREFIX || surname )</value></predicate>
      <object><value>xs:string( surname )</value></object>
    </triple>
    <triple>
      <subject><value>sem:iri( $PREFIX || @SSN )</value></subject>
      <predicate><value>sem:iri( $PREFIX || name )</value></predicate>
      <object><value>xs:string( name )</value></object>
    </triple>
  </triples>
  </template>

The resulting RDF graph can be addressed with a SPARQL query:

PREFIX : <http://example.org/example#>
SELECT ?name ?surname {
  :631803299804 :name ?name ; :surname ?surname .
}

Unlike the relational model, MarkLogic supports the graph model in two other ways:

  1. The DBMS can be a full-fledged separate storage of RDF data (the triplets in it will be called managed contrary to the above Extracted).
  2. RDF in special serialization can simply be inserted into XML or JSON documents (and the triplets will then be called unmanaged). This is probably such an alternative to mechanisms idref etc.

A good idea of ​​how "really" things work in MarkLogic is given by Optical API, in this sense it is low-level, although its purpose is rather the opposite - to try to abstract from the data model used, to ensure consistent work with data in different models, transactionality, etc.

Multi-model DBMS "without the main model"

There are also DBMSs on the market that position themselves as inherently multi-model, with no legacy core model. These include ArangoDB, OrientDB (since 2018 the developer company belongs to SAP) and CosmosDB (a service as part of the Microsoft Azure cloud platform).

In fact, there are "basic" models in ArangoDB and OrientDB. In both cases, these are their own data models, which are generalizations of the document one. Generalizations consist mainly in facilitating the ability to make queries of a graph and relational nature.

These models are the only ones available for use in the specified DBMS; their own query languages ​​are designed to work with them. Of course, such models and DBMS are promising, but the lack of compatibility with standard models and languages ​​makes it impossible to use these DBMS in legacy systems - replacing them with the DBMS already used there.

There was already a wonderful article about ArangoDB and OrientDB on Habré: JOIN in NoSQL databases.

ArangoDB

ArangoDB claims support for the graph data model.

Graph nodes in ArangoDB are ordinary documents, and edges are documents of a special kind, which, along with the usual system fields (_key, _id, _rev) system fields _from и _to. Documents in document DBMSs are traditionally grouped into collections. Collections of documents representing edges are called edge collections in ArangoDB. By the way, documents of edge collections are also documents, so edges in ArangoDB can also act as nodes.

Initial data

Let's have a collection persons, whose documents look like this:

[
  {
    "_id"  : "people/alice" ,
    "_key" : "alice" ,
    "name" : "Алиса"
  },
  {
    "_id"  : "people/bob" ,
    "_key" : "bob" ,
    "name" : "Боб"  
  }
]

Let there also be a collection cafes:

[
  {
    "_id" : "cafes/jd" ,
    "_key" : "jd" ,
    "name" : "Джон Донн"  
  },
  {
    "_id" : "cafes/jj" ,
    "_key" : "jj" ,
    "name" : "Жан-Жак"
  }
]

Then the collection likes might look like this:

[
  {
    "_id" : "likes/1" ,
    "_key" : "1" ,
    "_from" : "persons/alice" ,
    "_to" : "cafes/jd",
    "since" : 2010 
  },
  {
    "_id" : "likes/2" ,
    "_key" : "2" ,
    "_from" : "persons/alice" ,
    "_to" : "cafes/jj",
    "since" : 2011 
  } ,
  {
    "_id" : "likes/3" ,
    "_key" : "3" ,
    "_from" : "persons/bob" ,
    "_to" : "cafes/jd",
    "since" : 2012 
  }
]

Requests and results

A graph-style query using the AQL language used in ArangoDB, returning in human-readable form information about who likes which cafe, looks like this:

FOR p IN persons
  FOR c IN OUTBOUND p likes
  RETURN { person : p.name , likes : c.name }

In a relational style, when we rather “calculate” relationships rather than store them, this query can be rewritten like this (by the way, without a collection likes could do without):

FOR p IN persons
  FOR l IN likes
  FILTER p._key == l._from
    FOR c IN cafes
    FILTER l._to == c._key
    RETURN { person : p.name , likes : c.name }

The result in both cases will be the same:

[
  { "person" : "Алиса" , likes : "Жан-Жак" } ,
  { "person" : "Алиса" , likes : "Джон Донн" } ,
  { "person" : "Боб" , likes : "Джон Донн" }
]

More queries and results

If it seems that the format of the result above is more typical for a relational DBMS than for a document one, you can try this query (or you can use COLLECT):

FOR p IN persons
  RETURN {
    person : p.name,
    likes : (
      FOR c IN OUTBOUND p likes
      RETURN c.name
    )
}

The result will look like this:

[
  { "person" : "Алиса" , likes : ["Жан-Жак" , "Джон Донн"]  } ,
  { "person" : "Боб" , likes : ["Джон Донн"] }
]

OrientDB

The implementation of the graph model over the document model in OrientDB is based on opportunity document fields, in addition to more or less standard scalar values, also have values ​​of such types as LINK, LINKLIST, LINKSET, LINKMAP и LINKBAG. The values ​​of these types are references or collections of references to system identifiers documents.

The document identifier assigned by the system has a "physical meaning", indicating the position of the record in the database, and looks something like this: @rid : #3:16. Thus, the values ​​of the reference properties are indeed rather pointers (as in the graph model), rather than selection conditions (as in the relational one).

Like ArangoDB, OrientDB represents edges as separate documents (although if an edge doesn't have its own properties, it can be lightweight, and it will not be matched by a separate document).

Initial data

In a format similar to dump format OrientDB databases, the data from the previous example for ArangoDB would look something like this:

[
     {
      "@type": "document",
      "@rid": "#11:0",
      "@class": "Person",
      "name": "Алиса",
      "out_likes": [
        "#30:1",
        "#30:2"
      ],
      "@fieldTypes": "out_likes=LINKBAG"
    },
    {
      "@type": "document",
      "@rid": "#12:0",
      "@class": "Person",
      "name": "Боб",
      "out_likes": [
        "#30:3"
      ],
      "@fieldTypes": "out_likes=LINKBAG"
    },
    {
      "@type": "document",
      "@rid": "#21:0",
      "@class": "Cafe",
      "name": "Жан-Жак",
      "in_likes": [
        "#30:2",
        "#30:3"
      ],
      "@fieldTypes": "in_likes=LINKBAG"
    },
    {
      "@type": "document",
      "@rid": "#22:0",
      "@class": "Cafe",
      "name": "Джон Донн",
      "in_likes": [
        "#30:1"
      ],
      "@fieldTypes": "in_likes=LINKBAG"
    },
    {
      "@type": "document",
      "@rid": "#30:1",
      "@class": "likes",
      "in": "#22:0",
      "out": "#11:0",
      "since": 1262286000000,
      "@fieldTypes": "in=LINK,out=LINK,since=date"
    },
    {
      "@type": "document",
      "@rid": "#30:2",
      "@class": "likes",
      "in": "#21:0",
      "out": "#11:0",
      "since": 1293822000000,
      "@fieldTypes": "in=LINK,out=LINK,since=date"
    },
    {
      "@type": "document",
      "@rid": "#30:3",
      "@class": "likes",
      "in": "#21:0",
      "out": "#12:0",
      "since": 1325354400000,
      "@fieldTypes": "in=LINK,out=LINK,since=date"
    }
  ]

As we can see, vertices also store information about incoming and outgoing edges. At use of The Document API has to take care of the referential integrity itself, and the Graph API takes care of this work. But let's see how the call to OrientDB looks like in "pure" query languages ​​that are not integrated into programming languages.

Requests and results

A query similar in purpose to the query from the ArangoDB example in OrientDB looks like this:

SELECT name AS person_name, OUT('likes').name AS cafe_name
   FROM Person
   UNWIND cafe_name

The result will be in the following form:

[
  { "person_name": "Алиса", "cafe_name": "Джон Донн" },
  { "person_name": "Алиса", "cafe_name": "Жан-Жак" },
  { "person_name": "Боб",  "cafe_name": "Жан-Жак" }
]

If the format of the result again seems too "relational", you need to remove the line with UNWIND():

[
  { "person_name": "Алиса", "cafe_name": [ "Джон Донн", "Жан-Жак" ] },
  { "person_name": "Боб",  "cafe_name": [ "Жан-Жак" ' }
]

The query language of OrientDB can be described as SQL with Gremlin-like inserts. Version 2.2 introduced a Cypher-like request form, MATCH :

MATCH {CLASS: Person, AS: person}-likes->{CLASS: Cafe, AS: cafe}
RETURN person.name AS person_name, LIST(cafe.name) AS cafe_name
GROUP BY person_name

The format of the result will be the same as in the previous query. Think about what needs to be removed to make it more "relational", like in the very first query.

Azure CosmosDB

To a lesser extent, what was said above about ArangoDB and OrientDB applies to Azure CosmosDB. CosmosDB provides the following data access APIs: SQL, MongoDB, Gremlin, and Cassandra.

The SQL API and MongoDB API are used to access data in the document model. Gremlin API and Cassandra API - for accessing data, respectively, in graph and column. The data in all models is stored in the format of the internal CosmosDB model: ARS ("atom-record-sequence"), which is also close to the document one.

Multi-model DBMS - the basis of modern information systems?

But the data model chosen by the user and the API used are fixed at the time of creating an account in the service. It is not possible to access data loaded in one model in the format of another model, which would be illustrated something like this:

Multi-model DBMS - the basis of modern information systems?

Thus, multi-modeling in Azure CosmosDB today is only the ability to use several databases that support different models from one manufacturer, which does not solve all the problems of multi-variant storage.

Multi-model DBMS based on a graph model?

Attention is drawn to the fact that there are no multi-model DBMSs on the market that are based on a graph model (except for multi-model support for two graph models simultaneously: RDF and LPG; see this in previous publication). The greatest difficulties are caused by the implementation on top of the graph model of a documentary, rather than a relational one.

The question of how to implement a relational model on top of the graph model was considered even at the time of the formation of this latter. How говорилFor example, David McGovern:

There is nothing inherent in the graph approach that prevents creating a layer (eg, by suitable indexing) on ​​a graph database that enables a relational view with (1) recovery of tuples from the usual key value pairs and (2) grouping of tuples by relation type.

When implementing a document model on top of a graph one, you need to keep in mind, for example, the following:

  • Elements of the JSON array are considered ordered, while those coming from the top of the graph edge are not;
  • The data in the document model is usually denormalized, you still don’t want to store multiple copies of the same subdocument, and subdocuments usually don’t have identifiers;
  • On the other hand, the ideology of document DBMS lies in the fact that documents are ready-made “aggregates” that do not need to be rebuilt every time. It is required to provide in the graph model the ability to quickly obtain a subgraph corresponding to the finished document.

Some advertising

The author of the article is related to the development of the NitrosBase DBMS, the internal model of which is graph, and the external models - relational and document - are its representations. All models are equal: almost any data is available in any of them using its natural query language. Moreover, in any view, the data can be changed. Changes will be reflected in the internal model and, accordingly, in other views.

What the correspondence of models looks like in NitrosBase - I will describe, I hope, in one of the following articles.

Conclusion

I hope that the general contours of what is called multi-modeling have become more or less clear to the reader. There are quite different DBMSs that are called multi-model, and "multi-model support" may look different. To understand what is called “multi-modeling” in any given case, it is helpful to answer the following questions:

  1. Is it support for traditional models, or some one "hybrid" model?
  2. Are the models "equal" or is one of them subject to the others?
  3. Are the models "indifferent" to each other? Can data written in one model be read in another or even overwritten?

I think that the question of the relevance of multi-model DBMS can already be answered in the affirmative, but the question of what kind of their varieties will be more in demand in the near future is interesting. It seems that multi-model DBMS that support traditional models, primarily relational ones, will be more in demand; the popularity of multi-model DBMS, which offer new models that combine the advantages of various traditional ones, is a matter of a more distant future.

Only registered users can participate in the survey. Sign in, you are welcome.

Do you use multi-model DBMS?

  • We do not use, we store everything in one DBMS and in one model

  • We use the multi-model capabilities of traditional DBMS

  • Practicing multivariant storage (polyglot persistence)

  • We use new multi-model DBMS (Arango, Orient, CosmosDB)

19 users voted. 4 users abstained.

Source: habr.com

Add a comment