Database Design Fundamentals - Comparing PostgreSQL, Cassandra, and MongoDB

Hello, friends. Before leaving for the second part of the May holidays, we are sharing with you the material that we have translated in anticipation of the launch of a new stream at the rate "Relational DBMS".

Database Design Fundamentals - Comparing PostgreSQL, Cassandra, and MongoDB

Application developers spend a lot of time comparing multiple operational databases to choose the one that works best for their intended workload. Needs may include simplified data modeling, transactional guarantees, read/write performance, horizontal scaling, and fault tolerance. Traditionally, the choice begins with the category of database, SQL or NoSQL, since each category provides a clear set of trade-offs. High performance in terms of low latency and high throughput is generally seen as a requirement that cannot be compromised and is therefore essential for any database in the sample.

The purpose of this article is to help application developers make the right choice between SQL and NoSQL in the context of application data modeling. We will look at one SQL database, namely PostgreSQL, and two NoSQL databases, Cassandra and MongoDB, to cover the basics of database design, such as creating tables, populating them, reading data from a table, and deleting them. In the next article, we will definitely look at indexes, transactions, JOINs, TTL directives, and database design based on JSON.

What is the difference between SQL and NoSQL?

SQL databases increase application flexibility through ACID transactional guarantees, as well as their ability to query data using JOINs in unexpected ways on top of existing normalized relational database models.

Given their monolithic/single node architecture and use of a master-slave replication model for redundancy, traditional SQL databases lack two important features - linear write scalability (i.e. automatic splitting across multiple nodes) and automatic/zero data loss. This means that the amount of data received cannot exceed the maximum write throughput of a single node. In addition, some temporary loss of data must be taken into account for fault tolerance (in a non-shared architecture). Here you need to keep in mind that recent commits have not yet been reflected in the slave copy. No downtime updates are also difficult to achieve in SQL databases.

NoSQL databases are typically distributed in nature, i.e. in them, the data is divided into sections and distributed over several nodes. They require denormalization. This means that the entered data must also be copied several times in order to respond to the specific requests you send. The overall goal is to get high performance by reducing the number of shards available at read time. This implies that NoSQL requires you to model your queries, while SQL requires you to model your data.

NoSQL emphasizes achieving high performance in a distributed cluster and this is the main rationale for many database design trade-offs, which include loss of ACID transactions, JOINs, and consistent global secondary indexes.

There is an opinion that although NoSQL databases provide linear write scalability and high fault tolerance, the loss of transactional guarantees makes them unsuitable for critical data.

The following table shows how data modeling in NoSQL differs from SQL.

Database Design Fundamentals - Comparing PostgreSQL, Cassandra, and MongoDB

SQL and NoSQL: Why are both needed?

Real-life applications with a large number of users, such as Amazon.com, Netflix, Uber and Airbnb, are responsible for performing complex tasks of various kinds. For example, an e-commerce application like Amazon.com needs to store lightweight, highly sensitive data such as information about users, products, orders, invoices, along with heavy but less sensitive data such as product reviews, support messages. , user activity, user reviews and recommendations. Naturally, these applications rely on at least one SQL database along with at least one NoSQL database. In inter-regional and global systems, NoSQL database works as a geo-distributed cache for data stored in a trusted source, SQL database, operating in any one region.

How does YugaByte DB combine SQL and NoSQL?

Built on a log-oriented mixed storage engine, auto-sharding, sharded distributed consensus replication, and ACID distributed transactions (inspired by Google Spanner), YugaByte DB is the world's first open source database that is simultaneously NoSQL (Cassandra & Redis) compatible. ) and SQL (PostgreSQL). As shown in the table below, YCQL, a YugaByte DB API compatible with Cassandra, adds the concepts of single and multi-key ACID transactions and global secondary indexes to the NoSQL API, thus ushering in the era of transactional NoSQL databases. In addition, YCQL, a YugaByte DB API compatible with PostgreSQL, adds the concepts of linear write scaling and automatic failover to the SQL API, bringing distributed SQL databases to the world. Since the YugaByte DB database is inherently transactional, the NoSQL API can now be used in the context of critical data.

Database Design Fundamentals - Comparing PostgreSQL, Cassandra, and MongoDB

As previously stated in the article "Introducing YSQL: A PostgreSQL Compatible Distributed SQL API for YugaByte DB", the choice between SQL or NoSQL in YugaByte DB depends entirely on the characteristics of the underlying workload:

  • If your primary workload is multi-key JOIN operations, then when choosing YSQL, be aware that your keys may be spread across multiple nodes, resulting in higher latency and/or lower throughput than NoSQL.
  • Otherwise, choose either of the two NoSQL APIs, keeping in mind that you will get better performance as a result of queries served from one node at a time. YugaByte DB can serve as a single operational database for real complex applications that need to manage multiple workloads at the same time.

The Data modeling lab in the next section is based on the PostgreSQL and Cassandra compatible YugaByte DB database APIs, as opposed to the original databases. This approach emphasizes the ease of interacting with two different APIs (on two different ports) of the same database cluster, as opposed to using completely independent clusters of two different databases.
In the following sections, we'll take a look at the Data Modeling Lab to illustrate the difference and some of the commonalities of the databases in question.

Data Modeling Lab

Installing databases

Given the focus on data model design (rather than complex deployment architectures), we will install the databases in Docker containers on the local machine and then interact with them using their respective command line shells.

PostgreSQL & Cassandra compatible, YugaByte DB database

mkdir ~/yugabyte && cd ~/yugabyte
wget https://downloads.yugabyte.com/yb-docker-ctl && chmod +x yb-docker-ctl
docker pull yugabytedb/yugabyte
./yb-docker-ctl create --enable_postgres

MongoDB

docker run --name my-mongo -d mongo:latest

Command line access

Let's connect to the databases using the command line shell for the respective APIs.

PostgreSQL

psql is a command line shell for interacting with PostgreSQL. For ease of use, YugaByte DB comes with psql right in the bin folder.

docker exec -it yb-postgres-n1 /home/yugabyte/postgres/bin/psql -p 5433 -U postgres

Cassandra

sqlsh is a command line shell for interacting with Cassandra and its compatible databases via CQL (Cassandra Query Language). For ease of use, YugaByte DB comes with cqlsh in the catalog bin.
Note that CQL was inspired by SQL and has similar concepts of tables, rows, columns, and indexes. However, as a NoSQL language, it adds a certain set of restrictions, most of which we will also cover in other articles.

docker exec -it yb-tserver-n1 /home/yugabyte/bin/cqlsh

MongoDB

mongo is a command line shell for interacting with MongoDB. It can be found in the bin directory of the MongoDB installation.

docker exec -it my-mongo bash 
cd bin
mongo

Creating a table

Now we can interact with the database to perform various operations using the command line. Let's start by creating a table that stores information about songs written by different artists. These songs may be part of an album. Also optional attributes for the song are release year, price, genre and rating. We need to take into account additional attributes that may be needed in the future through the "tags" field. It can store semi-structured data as key-value pairs.

PostgreSQL

CREATE TABLE Music (
    Artist VARCHAR(20) NOT NULL, 
    SongTitle VARCHAR(30) NOT NULL,
    AlbumTitle VARCHAR(25),
    Year INT,
    Price FLOAT,
    Genre VARCHAR(10),
    CriticRating FLOAT,
    Tags TEXT,
    PRIMARY KEY(Artist, SongTitle)
);	

Cassandra

Creating a table in Cassandra is very similar to PostgreSQL. One of the main differences is the absence of integrity constraints (like NOT NULL), but this is the responsibility of the application, not the NoSQL database.. The primary key consists of a partition key (Artist column in the example below) and a set of clustering columns (SongTitle column in the example below). The partition key determines which partition/shard to put the row into, and the clustering columns indicate how the data should be organized within the current shard.

CREATE KEYSPACE myapp;
USE myapp;
CREATE TABLE Music (
    Artist TEXT, 
    SongTitle TEXT,
    AlbumTitle TEXT,
    Year INT,
    Price FLOAT,
    Genre TEXT,
    CriticRating FLOAT,
    Tags TEXT,
    PRIMARY KEY(Artist, SongTitle)
);

MongoDB

MongoDB organizes data into databases (Database) (similar to Keyspace in Cassandra), where there are collections (Collections) (similar to tables) that contain documents (Documents) (similar to rows in a table). In MongoDB, in principle, no initial schema definition is required. Team "use database", shown below, instantiates the database on the first call and changes the context for the newly created database. Even collections do not need to be created explicitly, they are created automatically, just when the first document is added to a new collection. Note that MongoDB uses a test database by default, so any collection level operation without specifying a specific database will be performed in it by default.

use myNewDatabase;

Getting information about a table
PostgreSQL

d Music
Table "public.music"
    Column    |         Type          | Collation | Nullable | Default 
--------------+-----------------------+-----------+----------+--------
 artist       | character varying(20) |           | not null | 
 songtitle    | character varying(30) |           | not null | 
 albumtitle   | character varying(25) |           |          | 
 year         | integer               |           |          | 
 price        | double precision      |           |          | 
 genre        | character varying(10) |           |          | 
 criticrating | double precision      |           |          | 
 tags         | text                  |           |          | 
Indexes:
    "music_pkey" PRIMARY KEY, btree (artist, songtitle)

Cassandra

DESCRIBE TABLE MUSIC;
CREATE TABLE myapp.music (
    artist text,
    songtitle text,
    albumtitle text,
    year int,
    price float,
    genre text,
    tags text,
    PRIMARY KEY (artist, songtitle)
) WITH CLUSTERING ORDER BY (songtitle ASC)
    AND default_time_to_live = 0
    AND transactions = {'enabled': 'false'};

MongoDB

use myNewDatabase;
show collections;

Entering data into a table
PostgreSQL

INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Year, Price, Genre, CriticRating, 
    Tags)
VALUES(
    'No One You Know', 'Call Me Today', 'Somewhat Famous',
    2015, 2.14, 'Country', 7.8,
    '{"Composers": ["Smith", "Jones", "Davis"],"LengthInSeconds": 214}'
);
INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Price, Genre, CriticRating)
VALUES(
    'No One You Know', 'My Dog Spot', 'Hey Now',
    1.98, 'Country', 8.4
);
INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Price, Genre)
VALUES(
    'The Acme Band', 'Look Out, World', 'The Buck Starts Here',
    0.99, 'Rock'
);
INSERT INTO Music 
    (Artist, SongTitle, AlbumTitle, 
    Price, Genre, 
    Tags)
VALUES(
    'The Acme Band', 'Still In Love', 'The Buck Starts Here',
    2.47, 'Rock', 
    '{"radioStationsPlaying": ["KHCR", "KBQX", "WTNR", "WJJH"], "tourDates": { "Seattle": "20150625", "Cleveland": "20150630"}, "rotation": Heavy}'
);

Cassandra

In general, the expression INSERT in Cassandra looks very similar to the one in PostgreSQL. However, there is one big difference in semantics. In Cassandra INSERT is actually an operation UPSERT, where the latest values ​​are added to the string, in case the string already exists.

Data entry is similar to PostgreSQL INSERT above

.

MongoDB

Even though MongoDB is a NoSQL database like Cassandra, its data entry operation has nothing to do with Cassandra's semantic behavior. In MongoDB insert () has no opportunity UPSERT, which makes it similar to PostgreSQL. Adding default data without _idspecified will result in a new document being added to the collection.

db.music.insert( {
artist: "No One You Know",
songTitle: "Call Me Today",
albumTitle: "Somewhat Famous",
year: 2015,
price: 2.14,
genre: "Country",
tags: {
Composers: ["Smith", "Jones", "Davis"],
LengthInSeconds: 214
}
}
);
db.music.insert( {
artist: "No One You Know",
songTitle: "My Dog Spot",
albumTitle: "Hey Now",
price: 1.98,
genre: "Country",
criticRating: 8.4
}
);
db.music.insert( {
artist: "The Acme Band",
songTitle: "Look Out, World",
albumTitle:"The Buck Starts Here",
price: 0.99,
genre: "Rock"
}
);
db.music.insert( {
artist: "The Acme Band",
songTitle: "Still In Love",
albumTitle:"The Buck Starts Here",
price: 2.47,
genre: "Rock",
tags: {
radioStationsPlaying:["KHCR", "KBQX", "WTNR", "WJJH"],
tourDates: {
Seattle: "20150625",
Cleveland: "20150630"
},
rotation: "Heavy"
}
}
);

Table query

Perhaps the most significant difference between SQL and NoSQL in terms of querying is the use of FROM и WHERE. SQL allows after expression FROM select multiple tables, and an expression with WHERE can be of any complexity (including operations JOIN between tables). However, NoSQL tends to impose a hard limit on FROM, and work with only one specified table, and in WHERE, the primary key must always be specified. This is due to the desire to improve the performance of NoSQL, which we talked about earlier. This desire leads to every possible reduction of any cross-tab and cross-key interaction. It can introduce a large delay in inter-node communication when responding to a request and is therefore best avoided in principle. For example, Cassandra requires requests to be restricted to certain operators (only allowed =, IN, <, >, =>, <=) on partition keys, except when querying a secondary index (only the = operator is allowed here).

PostgreSQL

The following are three examples of queries that can be easily executed by a SQL database.

  • Display all songs of the artist;
  • Display all songs of the artist that match the first part of the title;
  • Display all songs by the artist that have a certain word in the title and have a price less than 1.00.
SELECT * FROM Music
WHERE Artist='No One You Know';
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE 'Call%';
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE '%Today%'
AND Price > 1.00;

Cassandra

Of the PostgreSQL queries listed above, only the first one will work unchanged in Cassandra, because the statement LIKE cannot be applied to clustering columns such as SongTitle. In this case, only operators are allowed = и IN.

SELECT * FROM Music
WHERE Artist='No One You Know';
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle IN ('Call Me Today', 'My Dog Spot')
AND Price > 1.00;

MongoDB

As shown in the previous examples, the main method for creating queries in MongoDB is db.collection.find(). This method explicitly contains the name of the collection (music in the example below), so querying multiple collections is not allowed.

db.music.find( {
  artist: "No One You Know"
 } 
);
db.music.find( {
  artist: "No One You Know",
  songTitle: /Call/
 } 
);

Reading all rows of a table

Reading all rows is just a special case of the query pattern we discussed earlier.

PostgreSQL

SELECT * 
FROM Music;

Cassandra

Similar to the PostgreSQL example above.

MongoDB

db.music.find( {} );

Editing data in a table

PostgreSQL

PostgreSQL provides a statement UPDATE to change data. She has no opportunity UPSERT, so this statement will fail if the row is no longer in the database.

UPDATE Music
SET Genre = 'Disco'
WHERE Artist = 'The Acme Band' AND SongTitle = 'Still In Love';

Cassandra

Cassandra has UPDATE similar to PostgreSQL. UPDATE has the same semantics UPSERT, like INSERT.

Similar to the PostgreSQL example above.

MongoDB
Operation update() in MongoDB it can completely update an existing document or update only certain fields. By default it only updates one document with semantics disabled UPSERT. Refresh multiple documents and similar behavior UPSERT can be applied by setting additional flags for the operation. As for example in the example below, the genre of a particular artist is updated by his song.

db.music.update(
  {"artist": "The Acme Band"},
  { 
    $set: {
      "genre": "Disco"
    }
  },
  {"multi": true, "upsert": true}
);

Removing data from a table

PostgreSQL

DELETE FROM Music
WHERE Artist = 'The Acme Band' AND SongTitle = 'Look Out, World';

Cassandra

Similar to the PostgreSQL example above.

MongoDB

MongoDB has two types of operations for deleting documents − deleteOne() /deleteMany() и remove (). Both types delete documents but return different results.

db.music.deleteMany( {
        artist: "The Acme Band"
    }
);

Deleting a table

PostgreSQL

DROP TABLE Music;

Cassandra

Similar to the PostgreSQL example above.

MongoDB

db.music.drop();

Conclusion

The debate over the choice between SQL and NoSQL has been raging for over 10 years. There are two main aspects to this debate: the architecture of the database engine (monolithic, transactional SQL vs. distributed, non-transactional NoSQL) and the approach to database design (data modeling in SQL vs. modeling your queries in NoSQL).

With a distributed transactional database such as YugaByte DB, the database architecture debate can be easily dispelled. As data volumes become larger than what can be written to a single node, a fully distributed architecture that supports linear write scalability with automatic sharding/rebalancing becomes necessary.

In addition to what was said in one of the articles Google Cloud, transactional, strongly consistent architectures are now more widely adopted to provide better development flexibility than non-transactional, ultimately consistent architectures.

Returning to the discussion of database design, it is fair to say that both design approaches (SQL and NoSQL) are necessary for any complex real-world application. SQL's "data modeling" approach allows developers to more easily meet changing business requirements, while NoSQL's "query modeling" approach allows those same developers to handle large amounts of data with low latency and high throughput. It is for this reason that YugaByte DB provides SQL and NoSQL APIs in a common core, and does not advocate any one of the approaches. In addition, by providing compatibility with popular database languages ​​including PostgreSQL and Cassandra, YugaByte DB ensures that developers do not have to learn another language in order to work with a distributed strongly consistent database engine.

In this article, we looked at how database design fundamentals differ between PostgreSQL, Cassandra, and MongoDB. In the following articles, we will dive into advanced design concepts such as indexes, transactions, JOINs, TTL directives, and JSON documents.

We wish you a great weekend and invite you to free webinarwhich will take place on May 14th.

Source: habr.com

Add a comment