Features of designing a data model for NoSQL

Introduction

Features of designing a data model for NoSQL "You need to run as fast as you can just to stay in place,
and to get somewhere, you have to run at least twice as fast!”
(c) Alice in Wonderland

Some time ago I was asked to give a lecture analysts of our company on the topic of designing data models, because sitting on projects for a long time (sometimes for several years) we lose sight of what is happening around in the world of IT technologies. In our company (it just so happened) many projects do not use NoSQL databases (at least for now), so in my lecture I separately paid some attention to them using the example of HBase and tried to focus the presentation of the material on those who have never used them. have worked. In particular, I illustrated some of the features of designing a data model using an example that I read a few years ago in "Introduction to HB ase Schema Design" by Amandeep Khurana. When analyzing examples, I compared several options for solving the same problem with each other in order to better convey the main ideas to the audience.

Recently, “out of nothing to do,” I wondered (the long May weekend in quarantine mode is especially conducive to this), to what extent will theoretical calculations correspond to practice? In fact, this is how the idea for this article was born. A developer who has been working with NoSQL for more than a day may not learn something new from it (and therefore can skip half the article right away). But for analysts, who have not yet worked closely with NoSQL, I believe it will be useful for getting a basic understanding of the features of designing data models for HBase.

Parsing the example

In my opinion, before you start using NoSQL databases, you need to think carefully and weigh the pros and cons. Often the problem can most likely be solved on traditional relational DBMS. Therefore, it is better not to use NoSQL without good reason. If, nevertheless, it was decided to use a NoSQL database, then it should be noted that the design approaches here are somewhat different. Especially some of them may be unusual for those who have previously dealt only with relational DBMS (according to my observations). So, in the "relational" world, we usually go from domain modeling, and only then, if necessary, denormalize the model. In NoSQL we should immediately take into account the expected scenarios for working with data and initially denormalize the data. In addition, there are a number of other differences, which will be discussed below.

Consider the following "synthetic" problem, with which we will continue to work:

It is necessary to design a storage structure for the list of friends of users of some abstract social network. To simplify, we will assume that all our connections are directed (like on Instagram, not on Linkedin). The structure should allow you to effectively:

  • Answer the question if user A reads user B (reading pattern)
  • Allow to add/remove links in case of user A subscription/unsubscription from user B (data change template)

Of course, there are many options for solving the problem. In an ordinary relational database, we would most likely simply make a link table (possibly typed, if, for example, it is required to store a user group: family, work, etc., which includes this “friend”), and to optimize access speed would add indexes/partitioning. Most likely the final table would look something like this:

user_id
friend_id

Vasya
Peter

Vasya
Olya

hereinafter, for clarity and better understanding, I will indicate names instead of ID

In the case of HBase, we know that:

  • an efficient search that does not result in a full table scan is possible exclusively by key
    • in fact, therefore, writing SQL queries familiar to many to such databases is a bad idea; technically, of course, you can send a SQL query with Joins and other logic to HBase from the same Impala, but here's how effective it will be ...

Therefore, we are forced to use the user ID as a key. And the first thought on the topic “where and how to store the ID of friends?” maybe an idea of ​​storing them in columns. This most obvious and "naive" option will look something like this (let's call it Option 1 (default)for future reference):

row key
loudspeakers

Vasya
1: Petya
2: Olya
3: Dasha

Peter
1: Masha
2: Vasya

Here, each line corresponds to one network user. The columns are named: 1, 2, ... - by the number of friends, and the IDs of friends are stored in the columns. It is important to note that each row will have a different number of columns. In the example in the figure above, one row has three columns (1, 2 and 3), and the second row has only two (1 and 2) - here we ourselves used two HBase properties that relational databases do not have:

  • the ability to dynamically change the composition of the columns (add a friend -> add a column, delete a friend -> delete a column)
  • different rows can have different columns

Let's check our structure for compliance with the requirements of the task:

  • Reading data: in order to understand if Vasya is subscribed to Olya, we will need to subtract the whole line by the key RowKey = "Vasya" and sort through the values ​​of the columns until we "meet" Olya in them. Or iterate over the values ​​of all columns, "do not meet" Olya and return the answer False;
  • Change data: add a friend: for a similar problem, we also need to subtract the whole line by the key RowKey = "Vasya" to calculate the total number of his friends. We need this total number of friends to determine the number of the column in which we need to write the ID of a new friend.
  • Change data: delete a friend:
    • Need to subtract the whole line by the key RowKey = "Vasya" and sort through the columns in order to find the one in which the deleted friend is recorded;
    • Next, after deleting a friend, we need to “shift” all the data by one column so as not to get “gaps” in their numbering.

Let's now evaluate how these algorithms, which we will need to implement on the side of the "conditional application", will be productive, using O-symbols. Let's denote the size of our hypothetical social network as n. Then the maximum number of friends one user can have is (n-1). We can neglect this (-1) for our purposes in the future, since it is not essential within the framework of using O-symbols.

  • Reading data: it is necessary to subtract the entire line and iterate over all its columns in the limit. So the upper cost estimate will be approximately O(n)
  • Change data: add a friend: to determine the number of friends, you need to iterate through all the columns of the line, and then insert a new column => O (n)
  • Change data: delete a friend:
    • Similar to adding - it is required to go through all the columns in the limit => O (n)
    • After removing the columns, we need to “move” them. If you implement this "on the forehead", then in the limit you will need up to (n-1) operations. But here and further in the practical part, we will apply a different approach that will implement a “pseudo-shift” for a fixed number of operations - that is, constant time will be spent on it, regardless of n. This constant time (to be precise, O(2)) can be neglected compared to O(n). The approach is illustrated in the figure below: we simply copy the data from the “last” column to the one from which we want to delete data, after which we delete the last column:
      Features of designing a data model for NoSQL

In total, in all scenarios, we got an asymptotic computational complexity of O(n).
You probably already noticed that we almost always have to subtract the entire row from the database, and in two cases out of three, just to go through all the columns and calculate the total number of friends. Therefore, as an optimization attempt, you can add a “count” column in which to store the total number of friends of each network user. In this case, we can not read the entire line to calculate the total number of friends, but read only one column "count". The main thing is not to forget to update "count" when manipulating data. That. we get an improved Option 2 (count):

row key
loudspeakers

Vasya
1: Petya
2: Olya
3: Dasha
count: 3

Peter
1: Masha
2: Vasya

count: 2

Compared to the first option:

  • Reading data: to get an answer to the question "Does Vasya Olya read?" nothing has changed => O(n)
  • Change data: add a friend: We have simplified the insertion of a new friend, since now we do not need to subtract the entire line and iterate over its columns, but we can only get the value of the "count" column, and so on. immediately determine the column number to insert a new friend. This reduces the computational complexity to O(1)
  • Change data: delete a friend: When deleting a friend, we can also use this column to reduce the number of I / O operations when “shifting” data one cell to the left. But the need to iterate through the columns to find the one that needs to be deleted still remains, so => ​​O(n)
  • On the other hand, now when updating the data, we need to update the “count” column every time, but this takes constant time, which can be neglected within the O-symbols

In general, option 2 seems to be a little more optimal, but it is rather "evolution instead of revolution." To make a "revolution" we need Option 3 (col).
Let's turn everything "upside down": assign column name user ID! What will be written in the column itself is no longer important for us, let it be the number 1 (in general, from the useful one, you can store there, for example, the group “family / friends / etc.”). This approach may surprise an unprepared "man in the street" who had no experience with NoSQL databases before, but it is this approach that allows you to use the potential of HBase in this task much more efficiently:

row key
loudspeakers

Vasya
Peter: 1
Olga: 1
Dasha: 1

Peter
Masha: 1
Vasya: 1

Here we get several advantages at once. To understand them, let's analyze the new structure and estimate the computational complexity:

  • Reading data: in order to answer the question whether Vasya is subscribed to Olya, it is enough to read one column "Olya": if it is, then the answer is True, if not - False => O(1)
  • Change data: add a friend: Adding a friend: just add a new column "Friend ID" => O(1)
  • Change data: delete a friend: just remove the "Friend ID" column => O(1)

As you can see, a significant advantage of this storage model is that in all the scenarios we need, we operate with only one column, avoiding subtracting the entire row from the base, and even more so, iterating through all the columns of this row. I could stop there, but...

You can get puzzled and go a little further along the path of optimizing performance and reducing I / O operations when accessing the database. What if we store the complete relationship information directly in the row key itself? That is, to make the key a composite type userID.friendID? In this case, we don’t even need to read the columns of the line at all (Option 4(row)):

row key
loudspeakers

Vasya. Petya
Peter: 1

Vasya. Olya
Olga: 1

Vasya.Dasha
Dasha: 1

Petya.Masha
Masha: 1

Petya.Vasya
Vasya: 1

Obviously, the evaluation of all data manipulation scenarios in such a structure, as in the previous version, will be O (1). The difference with option 3 will be solely in the efficiency of I / O operations in the database.

Well, the last "bow". It is easy to see that in option 4, our row key will have a variable length, which may possibly affect performance (here we recall that HBase stores data as a set of bytes and rows in tables are sorted by key). Plus we have a delimiter that may need to be handled in some scenarios. To eliminate this influence, you can use hashes from userID and friendID, and since both hashes will have a constant length, you can simply concatenate them, without a separator. Then the data in the table will look like this (Option 5(hash)):

row key
loudspeakers

dc084ef00e94aef49be885f9b01f51c01918fa783851db0dc1f72f83d33a5994
Peter: 1

dc084ef00e94aef49be885f9b01f51c0f06b7714b5ba522c3cf51328b66fe28a
Olga: 1

dc084ef00e94aef49be885f9b01f51c00d2c2e5d69df6b238754f650d56c896a
Dasha: 1

1918fa783851db0dc1f72f83d33a59949ee3309645bd2c0775899fca14f311e1
Masha: 1

1918fa783851db0dc1f72f83d33a5994dc084ef00e94aef49be885f9b01f51c0
Vasya: 1

Obviously, the algorithmic complexity of working with such a structure according to the scenarios we are considering will be the same as in option 4 - that is, O (1).
In total, let's summarize all our estimates of computational complexity in one table:

Adding a friend
Friend Check
Removing a friend

Option 1 (default)
He)
He)
He)

Option 2 (count)
O (1)
He)
He)

Option 3 (column)
O (1)
O (1)
O (1)

Option 4 (row)
O (1)
O (1)
O (1)

Option 5 (hash)
O (1)
O (1)
O (1)

As you can see, options 3-5 seem to be the most preferable and theoretically ensure the execution of all necessary data manipulation scenarios in constant time. In the condition of our task, there is no explicit requirement to obtain a list of all the user's friends, but in real project activities, it would be good for us, as good analysts, to “foresee” that such a task may arise and “lay straws”. Therefore, my sympathies are on the side of option 3. But it is likely that in a real project this request could already be solved by other means, therefore, without a general vision of the entire task, it is better not to draw final conclusions.

Experiment preparation

I would like to test the above theoretical reasoning in practice - this was the goal of the idea that arose over the long weekend. To do this, it is necessary to evaluate the speed of our “conditional application” in all the described scenarios for using the database, as well as the growth of this time with the growth of the size of the social network (n). The target parameter that interests us and which we will measure during the experiment is the time spent by the “conditional application” to perform one “business operation”. By "business operation" we mean one of the following:

  • Adding one new friend
  • Checking if user A is a friend of user B
  • Removing one friend

Thus, taking into account the requirements indicated in the initial formulation, the verification scenario emerges as follows:

  • Data recording. Generate randomly an initial network of size n. To get closer to the "real world", the number of friends each user has is also a random variable. Measure the time during which our "conditional application" writes all the generated data to HBase. Then divide the resulting time by the total number of added friends - so we get the average time for one "business operation"
  • Reading data. For each user, make a list of "identities" for which you need to get an answer, whether the user is subscribed to them or not. List length = approximately the number of user's friends, and for half of the friends checked, the answer should be "Yes", and for the other half - "No". The check is performed in such an order that the answers “Yes” and “No” alternate (that is, in every second case we will have to go through all the columns of the line for options 1 and 2). The total check time is then divided by the number of friends checked to get the average time to check one subject.
  • Deleting data. Remove all friends from a user. Moreover, the deletion order is random (that is, we “mix” the original list used to record the data). The total check time is then divided by the number of friends removed to get the average time per check.

Scenarios need to be run for each of the 5 data model options and for different sizes of the social network to see how time changes with its growth. Within one n, network connections and the list of users to be checked must, of course, be the same for all 5 options.
For a better understanding, below is an example of generated data for n = 5. The written “generator” gives three dictionaries of ID-shnikov at the output:

  • the first one is for insertion
  • the second is to check
  • the third is to remove

{0: [1], 1: [4, 5, 3, 2, 1], 2: [1, 2], 3: [2, 4, 1, 5, 3], 4: [2, 1]} # всего 15 друзей

{0: [1, 10800], 1: [5, 10800, 2, 10801, 4, 10802], 2: [1, 10800], 3: [3, 10800, 1, 10801, 5, 10802], 4: [2, 10800]} # всего 18 проверяемых субъектов

{0: [1], 1: [1, 3, 2, 5, 4], 2: [1, 2], 3: [4, 1, 2, 3, 5], 4: [1, 2]} # всего 15 друзей

As you can see, all IDs greater than 10 in the dictionary to check are just those that will certainly give a False answer. Insertion, checking and removal of "friends" are performed exactly in the sequence specified in the dictionary.

The experiment was run on a Windows 10 laptop running HBase in one docker container and Python running Jupyter Notebook in another. Docker was allocated 2 CPU cores and 2 GB of RAM. All logic, as well as emulations of the “conditional application”, as well as “strapping” for generating test data and measuring time, were written in Python. To work with HBase, the library was used happy base, to calculate hashes (MD5) for option 5 - hashlib

Taking into account the computing power of a particular laptop, the launch for n = 10, 30, … was experimentally chosen. 170 - when the total running time of the full testing cycle (all scenarios for all options for all n) was even more or less reasonable and fit during one tea party (on average 15 minutes).

Here it is necessary to make a remark that in this experiment, we first of all evaluate not absolute performance figures. Even a relative comparison of different two options may not be entirely correct. Now we are interested in the nature of the change in time depending on n, since, taking into account the above configuration of the "test stand", it is very difficult to obtain time estimates "cleaned" from the influence of random and other factors (and such a task was not set).

The result of the experiment

The first test is how the time it takes to complete the list of friends changes. The result is in the chart below.
Features of designing a data model for NoSQL
Variants 3-5, as expected, show almost constant "business operation" time, which does not depend on the growth of the network size and an indistinguishable difference in performance.
Option 2 also shows a constant, but slightly worse performance, and almost exactly 2 times as compared to options 3-5. And this cannot but rejoice, since it correlates with theory - in this option, the number of I / O operations to / from HBase is just 2 times more. This may serve as indirect evidence that our test bench, in principle, gives good accuracy.
Option 1 also turns out to be the slowest as expected and demonstrates a linear increase in the time spent on adding one friend on the size of the network.
Let's look now at the results of the second test.
Features of designing a data model for NoSQL
Options 3-5 again behave as expected - constant time, independent of the size of the network. Options 1 and 2 demonstrate a linear increase in time with an increase in the size of the network and similar performance. Moreover, option 2 turns out to be a little slower, apparently due to the need to read and process an additional “count” column, which becomes more noticeable as n grows. But I still refrain from drawing any conclusions, since the accuracy of this comparison is relatively low. In addition, these ratios (which option, 1 or 2, is faster) changed from run to run (while maintaining the nature of the dependence and "going head to head").

Well, the last graph is the result of testing the removal.

Features of designing a data model for NoSQL

Again, no surprises here. Options 3-5 carry out the removal in constant time.
And, interestingly, options 4 and 5, unlike the previous scenarios, show a noticeable slightly worse performance than option 3. Apparently, the row deletion operation is more expensive than the column deletion operation, which is generally logical.

Options 1 and 2, as expected, show a linear increase in time. At the same time, option 2 is consistently slower than option 1 - due to the additional I/O operation for "maintenance" of the count column.

General conclusions of the experiment:

  • Options 3-5 are more efficient because they take advantage of HBase; at the same time, their performance differs relative to each other by a constant and does not depend on the size of the network.
  • The difference between options 4 and 5 was not fixed. But that doesn't mean option 5 shouldn't be used. It is quite likely that the experimental scenario used, taking into account the performance characteristics of the test stand, did not allow it to be detected.
  • The nature of the increase in the time required to perform "business operations" with data, in general, confirmed the previously obtained theoretical calculations for all options.

Finale

The crude experiments carried out should not be taken as absolute truth. There are many factors that were not taken into account and distorted the results (these fluctuations are especially visible on the graphs with a small network size). For example, the speed of thrift, which is used by happybase, the volume and method of implementing the logic that I wrote in Python (I do not presume to say that the code was written optimally and effectively used the capabilities of all components), perhaps HBase caching features, Windows 10 background activity on my laptop, etc. In general, we can assume that all theoretical calculations have experimentally shown their validity. Well, or at least to refute them with such a “swoop in the forehead” did not work.

In conclusion, recommendations for everyone who is just starting to design data models in HBase: abstract from previous experience with relational databases and remember the “commandments”:

  • When designing, we go from the task and data manipulation patterns, and not from the domain model
  • Efficient access (without full table scan) - only by key
  • Denormalization
  • Different rows can contain different columns
  • Dynamic composition of columns

Source: habr.com

Add a comment