Redundancy codes: in simple words about how to store data reliably and cheaply

Redundancy codes: in simple words about how to store data reliably and cheaply

This is what redundancy looks like

Redundancy codes* are widely used in computer systems to increase the reliability of data storage. Yandex uses them in a lot of projects. For example, using redundancy codes instead of replicating in our internal object storage saves millions without compromising reliability. But despite its wide distribution, a clear description of how redundancy codes work is very rare. Those who want to understand are faced with something like the following (from Wikipedia):

Redundancy codes: in simple words about how to store data reliably and cheaply

My name is Vadim, in Yandex I am developing an internal MDS object storage. In this article, I will describe in simple terms the theoretical foundations of redundancy codes (Reed-Solomon and LRC codes). I'll tell you how it works, without complex mathematics and rare terms. At the end, I will give examples of using redundancy codes in Yandex.

A number of mathematical details I will not go into in detail, but I will give links for those who want to dive deeper. I also note that some mathematical definitions may not be strict, since the article is not intended for mathematicians, but for engineers who want to understand the essence of the issue.

* In English literature, redundancy codes are often referred to as erasure codes.

1. The essence of redundancy codes

The essence of all redundancy codes is extremely simple: store (or transfer) data so that it does not disappear when errors occur (disk failures, data transfer errors, etc.).

In most* redundancy codes, the data is divided into n data blocks, for which m blocks of redundancy codes are counted, for a total of n + m blocks. The redundancy codes are constructed in such a way that it is possible to recover n data blocks using only a fraction of n + m blocks. Further, we will consider only block redundancy codes, that is, those in which data is divided into blocks.

Redundancy codes: in simple words about how to store data reliably and cheaply

To restore all n blocks of data, you need to have at least n of n + m blocks, since you cannot get n blocks with only n-1 blocks (in this case, you would have to take 1 block “out of thin air”). Are n arbitrary blocks out of n + m blocks enough to recover all the data? It depends on the type of redundancy codes, for example, Reed-Solomon codes allow you to recover all data using arbitrary n blocks, and LRC redundancy codes do not always.

Data Storage

In data storage systems, as a rule, each of the data blocks and blocks of redundancy codes is written to a separate disk. Then, if an arbitrary disk fails, the original data can still be restored and read. Data can be recovered even if several disks fail at the same time.

Data transfer

Redundancy codes can be used to securely transfer data over an unreliable network. The transmitted data is divided into blocks, redundancy codes are counted for them. Both data blocks and blocks of redundancy codes are transmitted over the network. If errors occur in arbitrary blocks (up to a certain number of blocks), the data can still be transmitted error-free over the network. Reed-Solomon codes, for example, are used for data transmission over optical communication lines and in satellite communications.

* There are also redundancy codes in which data is not divided into blocks, such as Hamming codes and CRC codes, which are widely used for data transmission in Ethernet networks. These are codes for error-correcting coding, they are designed to detect errors, not to correct them (Hamming code also allows you to partially correct errors).

2. Reed-Solomon codes

Reed-Solomon codes are one of the most widely used redundancy codes, invented as early as the 1960s and first widely used in the 1980s for serial production of CDs.

There are two key questions for understanding Reed-Solomon codes: 1) how to create blocks of redundancy codes; 2) how to recover data using blocks of redundancy codes. Let's find answers to them.
For simplicity, we will further assume that n=6 and m=4. Other schemes are considered by analogy.

How to create blocks of redundancy codes

Each block of redundancy codes is considered independently of the others. All n data blocks are used to count each block. In the diagram below, X1-X6 are data blocks, P1-P4 are redundancy code blocks.

Redundancy codes: in simple words about how to store data reliably and cheaply

All data blocks must be the same size, zero bits can be used for alignment. The redundancy code blocks received will have the same size as the data blocks. All data blocks are divided into words (for example, 16 bits each). Let's say we split data blocks into k words. Then all blocks of redundancy codes will also be divided into k words.

Redundancy codes: in simple words about how to store data reliably and cheaply

To count the i-th word of each redundancy block, the i-th words of all data blocks will be used. They will be calculated according to the following formula:

Redundancy codes: in simple words about how to store data reliably and cheaply

Here the x values ​​are data block words, p are redundancy code block words, all alpha, beta, gamma and delta are specially selected numbers, the same for all i. It must be said right away that all these values ​​​​are not ordinary numbers, but elements of the Galois field, the operations +, -, *, / are not operations familiar to all of us, but special operations introduced on the elements of the Galois field.

Why Galois fields are needed

Redundancy codes: in simple words about how to store data reliably and cheaply

It would seem that everything is simple: we break the data into blocks, blocks - into words, using the words of data blocks we count the words of blocks of redundancy codes - we get blocks of redundancy codes. In general, this is how it works, but the devil is in the details:

  1. As mentioned above, the word size is fixed, in our example 16 bits. The formulas above for Reed-Solomon codes are such that when using ordinary integers, the result of calculating p may not be representable using a word of legal size.
  2. When recovering data, the formulas above will be treated as a system of equations that must be solved in order to recover the data. In the process of solving, it may become necessary to divide integers by each other, resulting in a real number that cannot be accurately represented in computer memory.

These problems prevent the use of integers for Reed-Solomon codes. The solution to the problem is original, it can be described as follows: let's come up with special numbers that can be represented using words of the desired length (for example, 16 bits), and the result of performing all operations on which (addition, subtraction, multiplication, division) will also be presented in computer memory using words of the desired length.

Such "special" numbers have long been studied by mathematics, they are called fields. A field is a set of elements with the operations of addition, subtraction, multiplication, and division defined for them.

Galois fields* are fields for which there is a unique result of each operation (+, -, *, /) for any two elements of the field. Galois fields can be constructed for numbers that are powers of 2: 2, 4, 8, 16, etc. (actually a power of any prime p, but in practice we are only interested in powers of 2). For example, for 16-bit words, this is a field containing 65 elements, for each pair of which you can find the result of any operation (+, -, *, /). The values ​​x, p, alpha, beta, gamma, delta from the equations above will be considered elements of the Galois field for calculations.

Thus, we have a system of equations that can be used to construct blocks of redundancy codes by writing an appropriate computer program. Using the same system of equations, you can perform data recovery.

* This is not a strict definition, but rather a description.

How to recover data

Restoration is needed when some of the blocks out of n + m blocks are missing. It can be both data blocks and blocks of redundancy codes. The absence of data blocks and/or blocks of redundancy codes would mean that the corresponding variables x and/or p are unknown in the equations above.

The equations for Reed-Solomon codes can be viewed as a system of equations in which all alpha, beta, gamma, delta values ​​are constants, all x and p corresponding to the available blocks are known variables, and the remaining x and p are unknown.

For example, let data blocks 1, 2, 3 and redundancy code block 2 are not available, then for the i-th group of words there will be the following system of equations (unknowns are marked in red):

Redundancy codes: in simple words about how to store data reliably and cheaply

We have a system of 4 equations with 4 unknowns, so we can solve it and recover the data!

From this system of equations, a number of conclusions follow about data recovery for Reed-Solomon codes (n data blocks, m blocks of redundancy codes):

  • Data can be recovered if any m blocks or less are lost. If m + 1 or more blocks are lost, the data cannot be restored: it is impossible to solve a system of m equations with m + 1 unknowns.
  • To restore even one block of data, you need to use any n of the remaining blocks, and you can use any of the redundancy codes.

What else do you need to know

In the description above, I bypass a number of important issues that require a deeper dive into mathematics. In particular, I do not say anything about the following:

  • The system of equations for Reed-Solomon codes must have a (unique) solution for any combination of unknowns (no more than m unknowns). Based on this requirement, alpha, beta, gamma and delta values ​​are selected.
  • The system of equations must be able to automatically build (depending on which blocks are not available) and solve.
  • It is necessary to build a Galois field: for a given word length, be able to find the result of any operation (+, -, *, /) for any two elements.

At the end of the article there are links to the literature on these important issues.

Choice of n and m

How to choose n and m in practice? In practice, in data storage systems, redundancy codes are used to save space, so m is always chosen less than n. Their specific values ​​depend on a number of factors, including:

  • Reliability of data storage. The larger m, the more disk failures you can survive, that is, the higher the reliability.
  • Storage redundancy. The higher the m/n ratio, the higher the storage redundancy will be, and the more expensive the system will cost.
  • Request processing time. The larger the sum of n + m, the longer will be the response time for queries. Since to read data (during recovery) you need to read n blocks stored on n different disks, the read time will be determined by the slowest disk.

In addition, storing data in several DCs imposes additional restrictions on the choice of n and m: when 1 DC is disabled, the data must still be available for reading. For example, when storing data in 3 DCs, the following condition must be met: m >= n/2, otherwise, a situation is possible when the data is not available for reading when 1 DC is turned off.

3. LRC - Local Reconstruction Codes

To recover data using Reed-Solomon codes, one has to use n arbitrary data blocks. This is a very significant disadvantage for distributed storage systems, because in order to restore data on one broken disk, you will have to read data from most of the others, creating a large additional load on disks and the network.

The most common errors are the inaccessibility of one block of data due to a failure or overload of one disk. Is it possible to somehow reduce the overhead for data recovery in this (most common) case? It turns out you can: there are LRC redundancy codes specifically for this.

LRC (Local Reconstruction Codes) are redundancy codes invented by Microsoft for use in Windows Azure Storage. The idea of ​​LRC is as simple as possible: divide all data blocks into two (or more) groups and count part of redundancy code blocks for each group separately. Then part of the blocks of redundancy codes will be counted using all data blocks (in LRC they are called global redundancy codes), and part - using one of the two groups of data blocks (they are called local redundancy codes).

LRC is denoted by three numbers: nrl, where n is the number of data blocks, r is the number of global redundancy code blocks, l is the number of local redundancy code blocks. To read data when one data block is unavailable, you need to read only n / l blocks - this is l times less than in Reed-Solomon codes.

For example, consider the LRC 6-2-2 scheme. X1-X6 - 6 data blocks, P1, P2 - 2 global redundancy blocks, P3, P4 - 2 local redundant blocks.

Redundancy codes: in simple words about how to store data reliably and cheaply

Redundancy code blocks P1, P2 are counted using all data blocks. Redundancy code block P3 with data blocks X1-X3, redundancy code block P4 with data blocks X4-X6.

The rest is done in LRC by analogy with Reed-Solomon codes. The equations for counting the words of redundancy code blocks are as follows:

Redundancy codes: in simple words about how to store data reliably and cheaply

To select the numbers alpha, beta, gamma, delta, you need to fulfill a number of conditions that guarantee the possibility of data recovery (that is, solving a system of equations). You can read more about them in article.
Also, in practice, the XOR operation is used to calculate the local redundancy codes P3, P4.

A number of conclusions follow from the system of equations for LRC:

  • To restore any 1 data block, it is enough to read n/l blocks (n/2 in our example).
  • If r + l blocks are unavailable, and all blocks are in the same group, then the data cannot be restored. This is easy to explain with an example. Let blocks X1–X3 and P3 be unavailable: these are r + l blocks from one group, 4 in our case. Then we have a system of 3 equations with 4 unknowns that cannot be solved.
  • In all other cases of unavailability of r + l blocks (when at least one block is available from each group), the data in the LRC can be restored.

Thus, LRC outperforms Reed-Solomon codes in data recovery after single errors. In Reed-Solomon codes, to restore even one data block, you need to use n blocks, and in LRC, to restore one data block, it is enough to use n/l blocks (n/2 in our example). On the other hand, LRC loses to Reed-Solomon codes in terms of the maximum number of allowable errors. In the examples above, Reed-Solomon codes can recover data for any 4 errors, and for LRC there are 2 combinations of 4 errors when data cannot be recovered.

More importantly, it depends on the specific situation, but often the overhead savings that LRC gives outweigh the slightly less storage reliability.

4. Other redundancy codes

In addition to Reed-Solomon and LRC codes, there are many other redundancy codes. Different redundancy codes use different mathematics. Here are some other redundancy codes:

  • Redundancy code using the XOR operator. The XOR operation is performed on n data blocks, resulting in 1 block of redundancy codes, that is, an n+1 scheme (n data blocks, 1 redundancy code). Used in RAID 5, where blocks of data and redundancy codes are cyclically written to all disks in the array.
  • An even-odd algorithm based on the XOR operation. Allows you to build 2 blocks of redundancy codes, that is, the scheme n + 2.
  • STAR algorithm based on the XOR operation. Allows you to build 3 blocks of redundancy codes, that is, the scheme n + 3.
  • Pyramide codes are another redundancy codes from Microsoft.

5. Use in Yandex

A number of Yandex infrastructure projects use redundancy codes for reliable data storage. Here are some examples:

  • Internal object storage MDS, which I wrote about at the beginning of the article.
  • YT — Yandex MapReduce system.
  • YDB (Yandex DataBase) is a newSQL distributed database.

MDS uses LRC redundancy codes, 8-2-2 scheme. Data with redundancy codes is written to 12 different disks in different servers in 3 different DCs: 4 servers in each DC. Read more about this in article.

YT uses both Reed-Solomon codes (Scheme 6-3), which were implemented first, and LRC redundancy codes (Scheme 12-2-2), with LRC being the preferred storage method.

YDB uses redundancy codes based on even-odd (Figure 4-2). About redundancy codes in YDB already told on Highload.

The use of different schemes of redundancy codes is due to different requirements for systems. For example, in MDS, data stored using LRC is placed in 3 DCs at once. It is important for us that the data remains available for reading when 1 of any DC fails, so the blocks must be distributed over the DC so that if any DC is unavailable, the number of inaccessible blocks is not more than allowed. In the 8-2-2 scheme, 4 blocks can be placed in each DC, then when any DC is turned off, 4 blocks will be unavailable, and the data can be read. Whatever scheme we choose when placing in 3 DCs, in any case it should be (r + l) / n >= 0,5, that is, the storage redundancy will be at least 50%.

In YT, the situation is different: each YT cluster is entirely located in 1 DC (different clusters in different DCs), so there is no such restriction. The 12-2-2 scheme gives a redundancy of 33%, that is, it is cheaper to store data, while they can also survive up to 4 simultaneous disk outages, like the scheme in MDS.

There are many more features of the use of redundancy codes in data storage and processing systems: the nuances of data recovery, the impact of recovery on query execution time, features of data recording, etc. I am going to talk separately about these and other features of the use of redundancy codes in practice, if the topic is will be interesting.

6. References

  1. A series of articles about Reed-Solomon codes and Galois fields: https://habr.com/ru/company/yadro/blog/336286/
    https://habr.com/ru/company/yadro/blog/341506/
    They take a deeper look at mathematics in an accessible language.
  2. Article from Microsoft about LRC: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/LRC12-cheng20webpage.pdf
    Section 2 briefly explains the theory, then discusses the experience of applying LRC in practice.
  3. even-odd scheme: https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F12/handouts/papers/p245-blaum.pdf
  4. STAR scheme: https://www.usenix.org/legacy/event/fast05/tech/full_papers/huang/huang.pdf
  5. pyramid codes: https://www.microsoft.com/en-us/research/publication/pyramid-codes-flexible-schemes-to-trade-space-for-access-efficiency-in-reliable-data-storage-systems/
  6. Redundancy codes in MDS: https://habr.com/ru/company/yandex/blog/311806
  7. Redundancy codes in YT: https://habr.com/ru/company/yandex/blog/311104/
  8. Redundancy codes in YDB: https://www.youtube.com/watch?v=dCpfGJ35kK8

Source: habr.com

Add a comment