TL; DR: JSONB can greatly simplify database schema development without sacrificing query performance.
Introduction
Here is a classic example of probably one of the oldest use cases in the world of relational databases (database): we have an entity, and we need to save certain properties (attributes) of this entity. But not all instances may have the same set of properties, moreover, in the future, the possible addition of more properties.
The easiest way to solve this problem is to create a column in the database table for each property value, and just fill in the ones that are needed for a particular entity instance. Great! Problem solved...until your table has millions of records and you need to add a new record.
Consider the EAV pattern (
However, I wouldn't be writing this post if there weren't some flaws in the EVA approach. So, for example, to get one or more entities that have 1 attribute each, 2 joins (joins) are required in the query: the first is a join with an attribute table, the second is a join with a value table. If an entity has 2 attributes, then 4 joins are needed! Also, all attributes are usually stored as strings, which results in type casting for both the result and the WHERE clause. If you write a lot of queries, then this is quite wasteful in terms of resource usage.
Despite these obvious shortcomings, EAV has long been used to solve these kinds of problems. These were inevitable drawbacks, and there was simply no better alternative.
But then a new “technology” appeared in PostgreSQL…
Starting with PostgreSQL 9.4, the JSONB data type was added to store binary JSON data. While storing JSON in this format typically takes a bit more space and time than plain text JSON, it is much faster to perform operations on it. JSONB also supports indexing, which makes querying them even faster.
The JSONB data type allows us to replace the cumbersome EAV pattern by adding just one JSONB column to our entity table, which greatly simplifies database design. But many argue that this should be accompanied by a decrease in performance ... That's why I came up with this article.
Setting up a test database
For this comparison, I created the database on a fresh install of PostgreSQL 9.5 on an $80 build.
CREATE TABLE entity (
id SERIAL PRIMARY KEY,
name TEXT,
description TEXT
);
CREATE TABLE entity_attribute (
id SERIAL PRIMARY KEY,
name TEXT
);
CREATE TABLE entity_attribute_value (
id SERIAL PRIMARY KEY,
entity_id INT REFERENCES entity(id),
entity_attribute_id INT REFERENCES entity_attribute(id),
value TEXT
);
Below is a table where the same data will be stored, but with attributes in a JSONB type column − properties.
CREATE TABLE entity_jsonb (
id SERIAL PRIMARY KEY,
name TEXT,
description TEXT,
properties JSONB
);
Looks much easier, doesn't it? Then it was added to entity tables (entity & entity_jsonb) 10 million records, and accordingly, were filled with the same table data where the EAV pattern is used and the JSONB column approach - entity_jsonb.properties. Thus, we received several different data types among the entire set of properties. Sample data:
{
id: 1
name: "Entity1"
description: "Test entity no. 1"
properties: {
color: "red"
lenght: 120
width: 3.1882420
hassomething: true
country: "Belgium"
}
}
So, now we have the same data for the two options. Let's start comparing implementations at work!
Design Simplification
As mentioned earlier, the database design has been greatly simplified: one table, by using a JSONB column for properties, instead of using three tables for EAV. But how is this reflected in requests? Updating a single entity property looks like this:
-- EAV
UPDATE entity_attribute_value
SET value = 'blue'
WHERE entity_attribute_id = 1
AND entity_id = 120;
-- JSONB
UPDATE entity_jsonb
SET properties = jsonb_set(properties, '{"color"}', '"blue"')
WHERE id = 120;
As you can see, the last query does not look simpler. To update the value of a property in a JSONB object, we must use the function
Now let's select the entity we just updated based on its new color:
-- EAV
SELECT e.name
FROM entity e
INNER JOIN entity_attribute_value eav ON e.id = eav.entity_id
INNER JOIN entity_attribute ea ON eav.entity_attribute_id = ea.id
WHERE ea.name = 'color' AND eav.value = 'blue';
-- JSONB
SELECT name
FROM entity_jsonb
WHERE properties ->> 'color' = 'blue';
I think we can agree that the second one is shorter (no join!), and therefore more readable. JSONB wins here! We use the JSON ->> operator to get the color as a text value from a JSONB object. There is also a second way to achieve the same result in a JSONB model using the @> operator:
-- JSONB
SELECT name
FROM entity_jsonb
WHERE properties @> '{"color": "blue"}';
It's a little more complicated: we're checking to see if the JSON object in the property column contains the object to the right of the @> operator. Less readable, more performant (see below).
Let's make it even easier to use JSONB when you need to select multiple properties at the same time. This is where the JSONB approach really comes in: we just select properties as additional columns in our result set without the need for joins:
-- JSONB
SELECT name
, properties ->> 'color'
, properties ->> 'country'
FROM entity_jsonb
WHERE id = 120;
With EAV you will need 2 joins for each property you want to query. In my opinion, the queries above show a great simplification in database design. See more examples of how to write JSONB queries, also see
Now it's time to talk about performance.
Performance
To compare performance I used
Refreshing the data showed the following results in terms of time (in ms). Note that the scale is logarithmic:
We see that JSONB is much (> 50000-x) faster than EAV if no indexes are used, for the reason stated above. When we index columns with primary keys, the difference almost disappears, but JSONB is still 1,3 times faster than EAV. Note that the index on the JSONB column has no effect here, as we are not using a property column in the evaluation criteria.
For selecting data based on property value, we get the following results (normal scale):
You can see that JSONB is again faster than EAV without indexes, but when EAV is with indexes it is still faster than JSONB. But then I saw that the timing for the JSONB requests was the same, which led me to the fact that the GIN index was not firing. Apparently, when you use a GIN index on a column with populated properties, it only takes effect when using the include operator @>. I used this in a new test, which had a huge impact on the time: only 0,153ms! This is 15000 times faster than EAV and 25000 times faster than the ->> operator.
I think it was fast enough!
DB table size
Let's compare the table sizes for both approaches. In psql we can show the size of all tables and indexes with the command dti+
For the EAV approach, table sizes are around 3068MB and indexes are up to 3427MB for a total of 6,43GB. The JSONB approach uses 1817 MB for the table and 318 MB for the indexes, which is 2,08 GB. It turns out 3 times less! This fact surprised me a little because we store property names in every JSONB object.
But still, the numbers speak for themselves: in EAV, we store 2 integer foreign keys per attribute value, resulting in 8 bytes of additional data. Also, in EAV, all property values are stored as text, while JSONB will use numeric and boolean values internally where possible, resulting in a smaller footprint.
Results
In general, I think storing entity properties in JSONB format can greatly simplify the design and maintenance of your database. If you are doing a lot of queries, then everything stored in the same table as the entity will actually work more efficiently. And the fact that this simplifies the interaction between data is already a plus, but the resulting database is 3 times smaller in volume.
Also, according to the tests done, we can conclude that the performance loss is very small. In some cases, JSONB is even faster than EAV, which makes it even better. However, this benchmark does not of course cover all aspects (e.g. entities with a very large number of properties, a significant increase in the number of properties of existing data,…), so if you have any suggestions on how to improve them, please feel free to leave in the comments!
Source: habr.com