Hello, friends. Before leaving for the second part of the May holidays, we are sharing with you the material that we have translated in anticipation of the launch of a new stream at the rate
Application developers spend a lot of time comparing multiple operational databases to choose the one that works best for their intended workload. Needs may include simplified data modeling, transactional guarantees, read/write performance, horizontal scaling, and fault tolerance. Traditionally, the choice begins with the category of database, SQL or NoSQL, since each category provides a clear set of trade-offs. High performance in terms of low latency and high throughput is generally seen as a requirement that cannot be compromised and is therefore essential for any database in the sample.
The purpose of this article is to help application developers make the right choice between SQL and NoSQL in the context of application data modeling. We will look at one SQL database, namely PostgreSQL, and two NoSQL databases, Cassandra and MongoDB, to cover the basics of database design, such as creating tables, populating them, reading data from a table, and deleting them. In the next article, we will definitely look at indexes, transactions, JOINs, TTL directives, and database design based on JSON.
What is the difference between SQL and NoSQL?
SQL databases increase application flexibility through ACID transactional guarantees, as well as their ability to query data using JOINs in unexpected ways on top of existing normalized relational database models.
Given their monolithic/single node architecture and use of a master-slave replication model for redundancy, traditional SQL databases lack two important features - linear write scalability (i.e. automatic splitting across multiple nodes) and automatic/zero data loss. This means that the amount of data received cannot exceed the maximum write throughput of a single node. In addition, some temporary loss of data must be taken into account for fault tolerance (in a non-shared architecture). Here you need to keep in mind that recent commits have not yet been reflected in the slave copy. No downtime updates are also difficult to achieve in SQL databases.
NoSQL databases are typically distributed in nature, i.e. in them, the data is divided into sections and distributed over several nodes. They require denormalization. This means that the entered data must also be copied several times in order to respond to the specific requests you send. The overall goal is to get high performance by reducing the number of shards available at read time. This implies that NoSQL requires you to model your queries, while SQL requires you to model your data.
NoSQL emphasizes achieving high performance in a distributed cluster and this is the main rationale for many database design trade-offs, which include loss of ACID transactions, JOINs, and consistent global secondary indexes.
There is an opinion that although NoSQL databases provide linear write scalability and high fault tolerance, the loss of transactional guarantees makes them unsuitable for critical data.
The following table shows how data modeling in NoSQL differs from SQL.
SQL and NoSQL: Why are both needed?
Real-life applications with a large number of users, such as Amazon.com, Netflix, Uber and Airbnb, are responsible for performing complex tasks of various kinds. For example, an e-commerce application like Amazon.com needs to store lightweight, highly sensitive data such as information about users, products, orders, invoices, along with heavy but less sensitive data such as product reviews, support messages. , user activity, user reviews and recommendations. Naturally, these applications rely on at least one SQL database along with at least one NoSQL database. In inter-regional and global systems, NoSQL database works as a geo-distributed cache for data stored in a trusted source, SQL database, operating in any one region.
How does YugaByte DB combine SQL and NoSQL?
Built on a log-oriented mixed storage engine, auto-sharding, sharded distributed consensus replication, and ACID distributed transactions (inspired by Google Spanner), YugaByte DB is the world's first open source database that is simultaneously NoSQL (Cassandra & Redis) compatible. ) and SQL (PostgreSQL). As shown in the table below, YCQL, a YugaByte DB API compatible with Cassandra, adds the concepts of single and multi-key ACID transactions and global secondary indexes to the NoSQL API, thus ushering in the era of transactional NoSQL databases. In addition, YCQL, a YugaByte DB API compatible with PostgreSQL, adds the concepts of linear write scaling and automatic failover to the SQL API, bringing distributed SQL databases to the world. Since the YugaByte DB database is inherently transactional, the NoSQL API can now be used in the context of critical data.
As previously stated in the article
- If your primary workload is multi-key JOIN operations, then when choosing YSQL, be aware that your keys may be spread across multiple nodes, resulting in higher latency and/or lower throughput than NoSQL.
- Otherwise, choose either of the two NoSQL APIs, keeping in mind that you will get better performance as a result of queries served from one node at a time. YugaByte DB can serve as a single operational database for real complex applications that need to manage multiple workloads at the same time.
The Data modeling lab in the next section is based on the PostgreSQL and Cassandra compatible YugaByte DB database APIs, as opposed to the original databases. This approach emphasizes the ease of interacting with two different APIs (on two different ports) of the same database cluster, as opposed to using completely independent clusters of two different databases.
In the following sections, we'll take a look at the Data Modeling Lab to illustrate the difference and some of the commonalities of the databases in question.
Data Modeling Lab
Installing databases
Given the focus on data model design (rather than complex deployment architectures), we will install the databases in Docker containers on the local machine and then interact with them using their respective command line shells.
PostgreSQL & Cassandra compatible, YugaByte DB database
mkdir ~/yugabyte && cd ~/yugabyte
wget https://downloads.yugabyte.com/yb-docker-ctl && chmod +x yb-docker-ctl
docker pull yugabytedb/yugabyte
./yb-docker-ctl create --enable_postgres
MongoDB
docker run --name my-mongo -d mongo:latest
Command line access
Let's connect to the databases using the command line shell for the respective APIs.
PostgreSQL
docker exec -it yb-postgres-n1 /home/yugabyte/postgres/bin/psql -p 5433 -U postgres
Cassandra
cqlsh
in the catalog bin
.
Note that CQL was inspired by SQL and has similar concepts of tables, rows, columns, and indexes. However, as a NoSQL language, it adds a certain set of restrictions, most of which we will also cover in other articles.
docker exec -it yb-tserver-n1 /home/yugabyte/bin/cqlsh
MongoDB
docker exec -it my-mongo bash
cd bin
mongo
Creating a table
Now we can interact with the database to perform various operations using the command line. Let's start by creating a table that stores information about songs written by different artists. These songs may be part of an album. Also optional attributes for the song are release year, price, genre and rating. We need to take into account additional attributes that may be needed in the future through the "tags" field. It can store semi-structured data as key-value pairs.
PostgreSQL
CREATE TABLE Music (
Artist VARCHAR(20) NOT NULL,
SongTitle VARCHAR(30) NOT NULL,
AlbumTitle VARCHAR(25),
Year INT,
Price FLOAT,
Genre VARCHAR(10),
CriticRating FLOAT,
Tags TEXT,
PRIMARY KEY(Artist, SongTitle)
);
Cassandra
Creating a table in Cassandra is very similar to PostgreSQL. One of the main differences is the absence of integrity constraints (like NOT NULL), but this is the responsibility of the application, not the NoSQL database.. The primary key consists of a partition key (Artist column in the example below) and a set of clustering columns (SongTitle column in the example below). The partition key determines which partition/shard to put the row into, and the clustering columns indicate how the data should be organized within the current shard.
CREATE KEYSPACE myapp;
USE myapp;
CREATE TABLE Music (
Artist TEXT,
SongTitle TEXT,
AlbumTitle TEXT,
Year INT,
Price FLOAT,
Genre TEXT,
CriticRating FLOAT,
Tags TEXT,
PRIMARY KEY(Artist, SongTitle)
);
MongoDB
MongoDB organizes data into databases (Database) (similar to Keyspace in Cassandra), where there are collections (Collections) (similar to tables) that contain documents (Documents) (similar to rows in a table). In MongoDB, in principle, no initial schema definition is required. Team "use database", shown below, instantiates the database on the first call and changes the context for the newly created database. Even collections do not need to be created explicitly, they are created automatically, just when the first document is added to a new collection. Note that MongoDB uses a test database by default, so any collection level operation without specifying a specific database will be performed in it by default.
use myNewDatabase;
Getting information about a table
PostgreSQL
d Music
Table "public.music"
Column | Type | Collation | Nullable | Default
--------------+-----------------------+-----------+----------+--------
artist | character varying(20) | | not null |
songtitle | character varying(30) | | not null |
albumtitle | character varying(25) | | |
year | integer | | |
price | double precision | | |
genre | character varying(10) | | |
criticrating | double precision | | |
tags | text | | |
Indexes:
"music_pkey" PRIMARY KEY, btree (artist, songtitle)
Cassandra
DESCRIBE TABLE MUSIC;
CREATE TABLE myapp.music (
artist text,
songtitle text,
albumtitle text,
year int,
price float,
genre text,
tags text,
PRIMARY KEY (artist, songtitle)
) WITH CLUSTERING ORDER BY (songtitle ASC)
AND default_time_to_live = 0
AND transactions = {'enabled': 'false'};
MongoDB
use myNewDatabase;
show collections;
Entering data into a table
PostgreSQL
INSERT INTO Music
(Artist, SongTitle, AlbumTitle,
Year, Price, Genre, CriticRating,
Tags)
VALUES(
'No One You Know', 'Call Me Today', 'Somewhat Famous',
2015, 2.14, 'Country', 7.8,
'{"Composers": ["Smith", "Jones", "Davis"],"LengthInSeconds": 214}'
);
INSERT INTO Music
(Artist, SongTitle, AlbumTitle,
Price, Genre, CriticRating)
VALUES(
'No One You Know', 'My Dog Spot', 'Hey Now',
1.98, 'Country', 8.4
);
INSERT INTO Music
(Artist, SongTitle, AlbumTitle,
Price, Genre)
VALUES(
'The Acme Band', 'Look Out, World', 'The Buck Starts Here',
0.99, 'Rock'
);
INSERT INTO Music
(Artist, SongTitle, AlbumTitle,
Price, Genre,
Tags)
VALUES(
'The Acme Band', 'Still In Love', 'The Buck Starts Here',
2.47, 'Rock',
'{"radioStationsPlaying": ["KHCR", "KBQX", "WTNR", "WJJH"], "tourDates": { "Seattle": "20150625", "Cleveland": "20150630"}, "rotation": Heavy}'
);
Cassandra
In general, the expression INSERT
in Cassandra looks very similar to the one in PostgreSQL. However, there is one big difference in semantics. In Cassandra INSERT
is actually an operation UPSERT
, where the latest values are added to the string, in case the string already exists.
Data entry is similar to PostgreSQL
INSERT
above
.
MongoDB
Even though MongoDB is a NoSQL database like Cassandra, its data entry operation has nothing to do with Cassandra's semantic behavior. In MongoDB UPSERT
, which makes it similar to PostgreSQL. Adding default data without _idspecified
will result in a new document being added to the collection.
db.music.insert( {
artist: "No One You Know",
songTitle: "Call Me Today",
albumTitle: "Somewhat Famous",
year: 2015,
price: 2.14,
genre: "Country",
tags: {
Composers: ["Smith", "Jones", "Davis"],
LengthInSeconds: 214
}
}
);
db.music.insert( {
artist: "No One You Know",
songTitle: "My Dog Spot",
albumTitle: "Hey Now",
price: 1.98,
genre: "Country",
criticRating: 8.4
}
);
db.music.insert( {
artist: "The Acme Band",
songTitle: "Look Out, World",
albumTitle:"The Buck Starts Here",
price: 0.99,
genre: "Rock"
}
);
db.music.insert( {
artist: "The Acme Band",
songTitle: "Still In Love",
albumTitle:"The Buck Starts Here",
price: 2.47,
genre: "Rock",
tags: {
radioStationsPlaying:["KHCR", "KBQX", "WTNR", "WJJH"],
tourDates: {
Seattle: "20150625",
Cleveland: "20150630"
},
rotation: "Heavy"
}
}
);
Table query
Perhaps the most significant difference between SQL and NoSQL in terms of querying is the use of FROM
и WHERE
. SQL allows after expression FROM
select multiple tables, and an expression with WHERE
can be of any complexity (including operations JOIN
between tables). However, NoSQL tends to impose a hard limit on FROM
, and work with only one specified table, and in WHERE
, the primary key must always be specified. This is due to the desire to improve the performance of NoSQL, which we talked about earlier. This desire leads to every possible reduction of any cross-tab and cross-key interaction. It can introduce a large delay in inter-node communication when responding to a request and is therefore best avoided in principle. For example, Cassandra requires requests to be restricted to certain operators (only allowed =, IN, <, >, =>, <=
) on partition keys, except when querying a secondary index (only the = operator is allowed here).
PostgreSQL
The following are three examples of queries that can be easily executed by a SQL database.
- Display all songs of the artist;
- Display all songs of the artist that match the first part of the title;
- Display all songs by the artist that have a certain word in the title and have a price less than 1.00.
SELECT * FROM Music
WHERE Artist='No One You Know';
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE 'Call%';
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE '%Today%'
AND Price > 1.00;
Cassandra
Of the PostgreSQL queries listed above, only the first one will work unchanged in Cassandra, because the statement LIKE
cannot be applied to clustering columns such as SongTitle
. In this case, only operators are allowed =
и IN
.
SELECT * FROM Music
WHERE Artist='No One You Know';
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle IN ('Call Me Today', 'My Dog Spot')
AND Price > 1.00;
MongoDB
As shown in the previous examples, the main method for creating queries in MongoDB is music
in the example below), so querying multiple collections is not allowed.
db.music.find( {
artist: "No One You Know"
}
);
db.music.find( {
artist: "No One You Know",
songTitle: /Call/
}
);
Reading all rows of a table
Reading all rows is just a special case of the query pattern we discussed earlier.
PostgreSQL
SELECT *
FROM Music;
Cassandra
Similar to the PostgreSQL example above.
MongoDB
db.music.find( {} );
Editing data in a table
PostgreSQL
PostgreSQL provides a statement UPDATE
to change data. She has no opportunity UPSERT
, so this statement will fail if the row is no longer in the database.
UPDATE Music
SET Genre = 'Disco'
WHERE Artist = 'The Acme Band' AND SongTitle = 'Still In Love';
Cassandra
Cassandra has UPDATE
similar to PostgreSQL. UPDATE
has the same semantics UPSERT
, like INSERT
.
Similar to the PostgreSQL example above.
MongoDB
Operation UPSERT
. Refresh multiple documents and similar behavior UPSERT
can be applied by setting additional flags for the operation. As for example in the example below, the genre of a particular artist is updated by his song.
db.music.update(
{"artist": "The Acme Band"},
{
$set: {
"genre": "Disco"
}
},
{"multi": true, "upsert": true}
);
Removing data from a table
PostgreSQL
DELETE FROM Music
WHERE Artist = 'The Acme Band' AND SongTitle = 'Look Out, World';
Cassandra
Similar to the PostgreSQL example above.
MongoDB
MongoDB has two types of operations for deleting documents −
db.music.deleteMany( {
artist: "The Acme Band"
}
);
Deleting a table
PostgreSQL
DROP TABLE Music;
Cassandra
Similar to the PostgreSQL example above.
MongoDB
db.music.drop();
Conclusion
The debate over the choice between SQL and NoSQL has been raging for over 10 years. There are two main aspects to this debate: the architecture of the database engine (monolithic, transactional SQL vs. distributed, non-transactional NoSQL) and the approach to database design (data modeling in SQL vs. modeling your queries in NoSQL).
With a distributed transactional database such as YugaByte DB, the database architecture debate can be easily dispelled. As data volumes become larger than what can be written to a single node, a fully distributed architecture that supports linear write scalability with automatic sharding/rebalancing becomes necessary.
In addition to what was said in one of the articles
Returning to the discussion of database design, it is fair to say that both design approaches (SQL and NoSQL) are necessary for any complex real-world application. SQL's "data modeling" approach allows developers to more easily meet changing business requirements, while NoSQL's "query modeling" approach allows those same developers to handle large amounts of data with low latency and high throughput. It is for this reason that YugaByte DB provides SQL and NoSQL APIs in a common core, and does not advocate any one of the approaches. In addition, by providing compatibility with popular database languages including PostgreSQL and Cassandra, YugaByte DB ensures that developers do not have to learn another language in order to work with a distributed strongly consistent database engine.
In this article, we looked at how database design fundamentals differ between PostgreSQL, Cassandra, and MongoDB. In the following articles, we will dive into advanced design concepts such as indexes, transactions, JOINs, TTL directives, and JSON documents.
We wish you a great weekend and invite you to
Source: habr.com