Replacing EAV with JSONB in ​​PostgreSQL

TL; DR: JSONB can greatly simplify database schema development without sacrificing query performance.

Introduction

Here is a classic example of probably one of the oldest use cases in the world of relational databases (database): we have an entity, and we need to save certain properties (attributes) of this entity. But not all instances may have the same set of properties, moreover, in the future, the possible addition of more properties.

The easiest way to solve this problem is to create a column in the database table for each property value, and just fill in the ones that are needed for a particular entity instance. Great! Problem solved...until your table has millions of records and you need to add a new record.

Consider the EAV pattern (Entity-Attribute-Value) is quite common. One table contains entities (records), another table contains property names (attributes), and a third table associates entities with their attributes and contains the value of these attributes for the current entity. This gives you the ability to have different sets of properties for different objects, as well as add properties on the fly without changing the database structure.

However, I wouldn't be writing this post if there weren't some flaws in the EVA approach. So, for example, to get one or more entities that have 1 attribute each, 2 joins (joins) are required in the query: the first is a join with an attribute table, the second is a join with a value table. If an entity has 2 attributes, then 4 joins are needed! Also, all attributes are usually stored as strings, which results in type casting for both the result and the WHERE clause. If you write a lot of queries, then this is quite wasteful in terms of resource usage.

Despite these obvious shortcomings, EAV has long been used to solve these kinds of problems. These were inevitable drawbacks, and there was simply no better alternative.
But then a new “technology” appeared in PostgreSQL…

Starting with PostgreSQL 9.4, the JSONB data type was added to store binary JSON data. While storing JSON in this format typically takes a bit more space and time than plain text JSON, it is much faster to perform operations on it. JSONB also supports indexing, which makes querying them even faster.

The JSONB data type allows us to replace the cumbersome EAV pattern by adding just one JSONB column to our entity table, which greatly simplifies database design. But many argue that this should be accompanied by a decrease in performance ... That's why I came up with this article.

Setting up a test database

For this comparison, I created the database on a fresh install of PostgreSQL 9.5 on an $80 build. Digital ocean Ubuntu 14.04. After setting some options in postgresql.conf I ran this script with psql. The following tables have been created to present the data in EAV format:

CREATE TABLE entity ( 
  id           SERIAL PRIMARY KEY, 
  name         TEXT, 
  description  TEXT
);
CREATE TABLE entity_attribute (
  id          SERIAL PRIMARY KEY, 
  name        TEXT
);
CREATE TABLE entity_attribute_value (
  id                  SERIAL PRIMARY KEY, 
  entity_id           INT    REFERENCES entity(id), 
  entity_attribute_id INT    REFERENCES entity_attribute(id), 
  value               TEXT
);

Below is a table where the same data will be stored, but with attributes in a JSONB type column − properties.

CREATE TABLE entity_jsonb (
  id          SERIAL PRIMARY KEY, 
  name        TEXT, 
  description TEXT,
  properties  JSONB
);

Looks much easier, doesn't it? Then it was added to entity tables (entity & entity_jsonb) 10 million records, and accordingly, were filled with the same table data where the EAV pattern is used and the JSONB column approach - entity_jsonb.properties. Thus, we received several different data types among the entire set of properties. Sample data:

{
  id:          1
  name:        "Entity1"
  description: "Test entity no. 1"
  properties:  {
    color:        "red"
    lenght:       120
    width:        3.1882420
    hassomething: true
    country:      "Belgium"
  } 
}

So, now we have the same data for the two options. Let's start comparing implementations at work!

Design Simplification

As mentioned earlier, the database design has been greatly simplified: one table, by using a JSONB column for properties, instead of using three tables for EAV. But how is this reflected in requests? Updating a single entity property looks like this:

-- EAV
UPDATE entity_attribute_value 
SET value = 'blue' 
WHERE entity_attribute_id = 1 
  AND entity_id = 120;

-- JSONB
UPDATE entity_jsonb 
SET properties = jsonb_set(properties, '{"color"}', '"blue"') 
WHERE id = 120;

As you can see, the last query does not look simpler. To update the value of a property in a JSONB object, we must use the function jsonb_set(), and should pass our new value as a JSONB object. However, we don't need to know any identifier beforehand. Looking at the EAV example, we need to know both entity_id and entity_attribute_id in order to update. If you want to update a property in a JSONB column based on the name of an object, it's all done in one simple line.

Now let's select the entity we just updated based on its new color:

-- EAV
SELECT e.name 
FROM entity e 
  INNER JOIN entity_attribute_value eav ON e.id = eav.entity_id
  INNER JOIN entity_attribute ea ON eav.entity_attribute_id = ea.id
WHERE ea.name = 'color' AND eav.value = 'blue';

-- JSONB
SELECT name 
FROM entity_jsonb 
WHERE properties ->> 'color' = 'blue';

I think we can agree that the second one is shorter (no join!), and therefore more readable. JSONB wins here! We use the JSON ->> operator to get the color as a text value from a JSONB object. There is also a second way to achieve the same result in a JSONB model using the @> operator:

-- JSONB 
SELECT name 
FROM entity_jsonb 
WHERE properties @> '{"color": "blue"}';

It's a little more complicated: we're checking to see if the JSON object in the property column contains the object to the right of the @> operator. Less readable, more performant (see below).

Let's make it even easier to use JSONB when you need to select multiple properties at the same time. This is where the JSONB approach really comes in: we just select properties as additional columns in our result set without the need for joins:

-- JSONB 
SELECT name
  , properties ->> 'color'
  , properties ->> 'country'
FROM entity_jsonb 
WHERE id = 120;

With EAV you will need 2 joins for each property you want to query. In my opinion, the queries above show a great simplification in database design. See more examples of how to write JSONB queries, also see This post.
Now it's time to talk about performance.

Performance

To compare performance I used EXPLAIN ANALYZE in requests, to calculate the execution time. Each query was run at least three times because the query planner takes longer the first time. First I ran queries without any indexes. Obviously, this served as an advantage to JSONB, since the joins needed for EAV could not use indexes (foreign key fields were not indexed). After that I created an index on the 2 foreign key columns of the EAV value table and also the index GIN for a JSONB column.

Refreshing the data showed the following results in terms of time (in ms). Note that the scale is logarithmic:

Replacing EAV with JSONB in ​​PostgreSQL

We see that JSONB is much (> 50000-x) faster than EAV if no indexes are used, for the reason stated above. When we index columns with primary keys, the difference almost disappears, but JSONB is still 1,3 times faster than EAV. Note that the index on the JSONB column has no effect here, as we are not using a property column in the evaluation criteria.

For selecting data based on property value, we get the following results (normal scale):

Replacing EAV with JSONB in ​​PostgreSQL

You can see that JSONB is again faster than EAV without indexes, but when EAV is with indexes it is still faster than JSONB. But then I saw that the timing for the JSONB requests was the same, which led me to the fact that the GIN index was not firing. Apparently, when you use a GIN index on a column with populated properties, it only takes effect when using the include operator @>. I used this in a new test, which had a huge impact on the time: only 0,153ms! This is 15000 times faster than EAV and 25000 times faster than the ->> operator.

I think it was fast enough!

DB table size

Let's compare the table sizes for both approaches. In psql we can show the size of all tables and indexes with the command dti+

Replacing EAV with JSONB in ​​PostgreSQL

For the EAV approach, table sizes are around 3068MB and indexes are up to 3427MB for a total of 6,43GB. The JSONB approach uses 1817 MB for the table and 318 MB for the indexes, which is 2,08 GB. It turns out 3 times less! This fact surprised me a little because we store property names in every JSONB object.

But still, the numbers speak for themselves: in EAV, we store 2 integer foreign keys per attribute value, resulting in 8 bytes of additional data. Also, in EAV, all property values ​​are stored as text, while JSONB will use numeric and boolean values ​​internally where possible, resulting in a smaller footprint.

Results

In general, I think storing entity properties in JSONB format can greatly simplify the design and maintenance of your database. If you are doing a lot of queries, then everything stored in the same table as the entity will actually work more efficiently. And the fact that this simplifies the interaction between data is already a plus, but the resulting database is 3 times smaller in volume.

Also, according to the tests done, we can conclude that the performance loss is very small. In some cases, JSONB is even faster than EAV, which makes it even better. However, this benchmark does not of course cover all aspects (e.g. entities with a very large number of properties, a significant increase in the number of properties of existing data,…), so if you have any suggestions on how to improve them, please feel free to leave in the comments!

Source: habr.com

Add a comment