My implementation of a ring buffer in NOR flash

prehistory

There are vending machines of our own design. Inside the Raspberry Pi and some wiring on a separate board. A coin acceptor, a bill acceptor, a bank terminal are connected ... Everything is controlled by a self-written program. The entire history of work is written to a journal on a USB flash drive (MicroSD), which is then transmitted via the Internet (using a USB modem) to the server, where it is added to the database. Sales information is loaded in 1s, there is also a simple web interface for monitoring, etc.

That is, the journal is vital - for accounting (there is revenue, sales, etc.), monitoring (all kinds of failures and other force majeure circumstances); this, one might say, is all the information that we have about this machine.

Problem

Flash drives show themselves as very unreliable devices. They break down with surprising regularity. This leads both to machine downtime and (if for some reason the log could not be transferred online) to data loss.

This is not the first experience of using flash drives, before that there was another project with more than a hundred devices, where the magazine was stored on USB flash drives, there were also problems with reliability, at times the number of failures per month was in the tens. We tried different flash drives, including branded ones on SLC memory, and some models are more reliable than others, but replacing flash drives did not radically solve the problem.

Attention! Longread! If you are not interested in “why”, but are only interested in “how”, you can go right away In the end Article.

Solution

The first thing that comes to mind is to abandon the MicroSD, put, for example, an SSD, and boot from it. Theoretically possible, probably, but relatively expensive, and not so reliable (a USB-SATA adapter is added; failure statistics for budget SSDs are also not encouraging).

USB HDD does not look like a particularly attractive solution either.

Therefore, we came up with this option: leave the boot from MicroSD, but use them in read-only mode, and store the work log (and other information unique to a particular piece of iron - serial number, sensor calibrations, etc) somewhere else.

The topic of read-only FS for Raspberry has already been studied up and down, I will not dwell on the implementation details in this article (but if there is interest, maybe I will write a mini-article on this topic). The only point that I want to note: both from personal experience and from the reviews of those who have already implemented a gain in reliability, there is. Yes, it is impossible to completely get rid of breakdowns, but it is quite possible to significantly reduce their frequency. Yes, and the cards are becoming unified, which greatly simplifies the replacement for service personnel.

Hardware

There were no particular doubts about the choice of memory type - NOR Flash.
Arguments:

  • simple connection (most often the SPI bus, the experience of using which is already there, so no "iron" problems are foreseen);
  • ridiculous price;
  • standard operation protocol (the implementation is already in the Linux kernel, if you wish, you can take a third-party one, which are also present, or even write your own, since everything is simple);
  • reliability and resource:
    from a typical datasheet: data is stored for 20 years, 100000 erase cycles for each block;
    from third party sources: extremely low BER, postulated no need for error correction codes (in some papers, ECC is considered for NOR, but usually they still mean MLC NOR, it happens).

Let's estimate the requirements for volume and resource.

It would be desirable, that the data for several days is guaranteed to be saved. This is necessary so that in case of any communication problems, the sales history is not lost. We will focus on 5 days, for this period (even including weekends and holidays) you can solve the problem.

We now have about 100kb of the log per day (3-4 thousand entries), but this figure is gradually growing - the detail is increasing, new events are being added. Plus, sometimes there are bursts (some sensor starts spamming with false positives, for example). We will calculate for 10 thousand records of 100 bytes - megabytes per day.

In total, 5MB of clean (highly compressible) data comes out. More to them (rough guess) 1MB service data.

That is, we need an 8MB chip if we do not use compression, or 4MB if we use it. Quite real numbers for this type of memory.

As for the resource: if we plan that the entire memory will be rewritten no more than once every 5 days, then in 10 years of service we get less than a thousand rewriting cycles.
I remind you that the manufacturer promises a hundred thousand.

A little about NOR vs NAND

Today, of course, NAND memory is much more popular, but for this project I would not use it: NAND, unlike NOR, necessarily requires the use of error correction codes, a table of bad blocks, etc., and legs for NAND chips usually much more.

The disadvantages of NOR are:

  • small volume (and, accordingly, high price per megabyte);
  • low exchange rate (largely due to the fact that a serial interface is used, usually SPI or I2C);
  • slow erase (depending on the size of the block, it takes from fractions of a second to several seconds).

It seems to be nothing critical for us, so let's continue.

If you are interested in details, a chip was chosen at25df321a (however, this is not significant, there are a lot of analogs on the market that are compatible in pinout and command system; even if we want to put a chip from another manufacturer and / or another volume, then everything will work without changing the code).

I use the driver built into the Linux kernel, on Raspberry, thanks to device tree overlay support, everything is very simple - you need to put the compiled overlay in /boot/overlays and slightly modify /boot/config.txt.

dts file example

Honestly, I'm not sure what is written without errors, but it works.

/*
 * Device tree overlay for at25 at spi0.1
 */

/dts-v1/;
/plugin/;

/ {
    compatible = "brcm,bcm2835", "brcm,bcm2836", "brcm,bcm2708", "brcm,bcm2709"; 

    /* disable spi-dev for spi0.1 */
    fragment@0 {
        target = <&spi0>;
        __overlay__ {
            status = "okay";
            spidev@1{
                status = "disabled";
            };
        };
    };

    /* the spi config of the at25 */
    fragment@1 {
        target = <&spi0>;
        __overlay__ {
            #address-cells = <1>;
            #size-cells = <0>;
            flash: m25p80@1 {
                    compatible = "atmel,at25df321a";
                    reg = <1>;
                    spi-max-frequency = <50000000>;

                    /* default to false:
                    m25p,fast-read ;
                    */
            };
        };
    };

    __overrides__ {
        spimaxfrequency = <&flash>,"spi-max-frequency:0";
        fastread = <&flash>,"m25p,fast-read?";
    };
};

And another line in config.txt

dtoverlay=at25:spimaxfrequency=50000000

I will omit the description of the connection of the microcircuit to the Raspberry Pi. On the one hand, I am not an expert in electronics, on the other hand, everything is banal here even for me: the microcircuit has only 8 legs, of which we need ground, power, SPI (CS, SI, SO, SCK); the levels are the same as those of the Raspberry Pi, no additional wiring is required - just connect the indicated 6 pins.

Formulation of the problem

As usual, the problem statement goes through several iterations, it seems to me that the time has come for the next one. So let's stop, put together what has already been written, and clarify the details that have been left in the shadows.

So, we have decided that the log will be stored in SPI NOR Flash.

What is NOR Flash for those who don't know

This is a non-volatile memory with which you can do three operations:

  1. Reading:
    The most common reading: we pass the address and read as many bytes as we need;
  2. Record:
    Recording in NOR flash looks like a normal one, but it has one peculiarity: you can only change 1 to 0, but not vice versa. For example, if we had 0x55 in a memory cell, then after writing 0x0f to it, 0x05 will already be stored there (see table just below);
  3. Erase:
    Of course, we need to be able to do the reverse operation - change 0 to 1, and this is exactly what the erase operation exists for. Unlike the first two, it operates not with bytes, but with blocks (the minimum erase block in the selected chip is 4kb). Erase destroys the entire block, and this is the only way to change 0 to 1. Therefore, when working with flash memory, it is often necessary to align data structures to the erase block boundary.
    Recording in NOR Flash:

Binary data

It was
01010101

Recorded
00001111

It became
00000101

The log itself is a sequence of variable length entries. A typical record length is about 30 bytes (although records of several kilobytes in length sometimes occur). In this case, we are working with them just as a set of bytes, but, if you are interested, CBOR is used inside the records.

In addition to the log, we need to store some "configuration" information, both updated and not: a certain device ID, sensor calibrations, the "device is temporarily disabled" flag, etc.
This information is a set of key-value records, also stored in CBOR. We do not have much of this information (a few kilobytes at most), it is updated infrequently.
In what follows, we will call it context.

If you remember how this article began, it is very important to ensure the reliability of data storage and, if possible, uninterrupted operation even in the event of hardware failures / data corruption.

What sources of problems can be considered?

  • Power off during write/erase operations. This is from the category of "there is no reception against scrap."
    Information from discussion on stackexchange: when the power is turned off while working with flash, what erase (set to 1), what write (set to 0) lead to undefined behavior: data can be written, partially written (say, we transferred 10 bytes / 80 bits, but we managed to write only 45 bits), it is also possible that some of the bits will be in an “intermediate” state (reading can give both 0 and 1);
  • Errors of the flash-memory itself.
    BER, although very low, cannot be equal to zero;
  • Bus errors
    The data transmitted via SPI is not protected in any way, both single bit errors and synchronization errors may well occur - loss or insertion of bits (which leads to massive data distortion);
  • Other errors/glitches
    Code bugs, Raspberry glitches, alien interference…

I have formulated the requirements, the fulfillment of which, in my opinion, is necessary to ensure reliability:

  • records should go to flash memory immediately, delayed writing is not considered; - if an error occurs, it should be detected and processed as early as possible; - the system should, if possible, recover from errors.
    (an example from life “how it should not be”, which, I think, everyone has met: after an emergency reboot, the file system “beaten” and the operating system does not boot)

Ideas, approaches, reflections

When I started thinking about this problem, a lot of ideas were running through my head, for example:

  • use data compression;
  • use tricky data structures, for example, store the titles of the records separately from the records themselves, so that if an error occurs in a record, the rest can be easily read;
  • use bit fields to control the completion of the recording when the power is turned off;
  • store checksums for anything and everything;
  • use some kind of error-correcting coding.

Some of these ideas were used, some were abandoned. Let's go in order.

Data Compression

The events themselves, which we record in the journal, are quite the same type and repeatable (“thrown a coin of 5 rubles”, “pressed the button for issuing change”, ...). Therefore, compression should be sufficiently effective.

The overhead for compression is insignificant (we have a fairly powerful processor, even on the first Pi there was one core with a frequency of 700 MHz, on current models there are several cores with a frequency of more than a gigahertz), the exchange rate with the storage is low (several megabytes per second), the size of the records is small. In general, if compression does have an impact on performance, then only positive (absolutely uncritical, just stating). Plus, we don’t have real embedded, but regular Linux - so the implementation should not require much effort (just link the library and use a few functions from it).

A piece of the log was taken from a working device (1.7Mb, 70 thousand entries) and first checked for compressibility using gzip, lz4, lzop, bzip2, xz, zstd available on the computer.

  • gzip, xz, zstd showed close results (40Kb).
    I was surprised that the fashionable xz showed itself here at the level of gzip or zstd;
  • lzip with default settings gave a slightly worse result;
  • lz4 and lzop showed not very good results (150Kb);
  • bzip2 showed surprisingly good results (18Kb).

So, the data is compressed very well.
So (if we do not find fatal flaws) there will be compression! Simply because more data will fit on the same flash drive.

Let's think about the disadvantages.

The first problem: we have already agreed that each record should immediately go to the flash. Typically, the archiver collects data from the input stream until it decides that it is time to write to the output. We need to immediately get a compressed block of data and store it in non-volatile memory.

I see three ways:

  1. Compress each entry using dictionary compression instead of the algorithms discussed above.
    It works well, but I don't like it. To ensure a more or less decent level of compression, the dictionary must be "sharpened" for specific data, any change will lead to the fact that the compression level drops catastrophically. Yes, the problem is solved by creating a new version of the dictionary, but this is a headache - we will need to store all versions of the dictionary; in each entry, we will need to indicate with which version of the dictionary it was compressed ...
  2. Compress each record with "classic" algorithms, but independently of the others.
    The compression algorithms under consideration are not designed to work with records of this size (tens of bytes), the compression ratio will obviously be less than 1 (that is, an increase in the amount of data instead of compression);
  3. Do FLUSH after each entry.
    Many compression libraries support FLUSH. This is a command (or a parameter to the compression procedure), upon receipt of which the archiver generates a compressed stream so that on its basis it is possible to restore all uncompressed data that has already been received. Such an analogue sync on file systems or commit in sql.
    What is important, subsequent compression operations will be able to use the accumulated dictionary and the compression ratio will not suffer as much as in the previous version.

I think it is obvious that I chose the third option, let's dwell on it in more detail.

Found great article about FLUSH in zlib.

Made a knee test based on the article, took 70 thousand log entries from a real device, with a page size of 60Kb (we will return to the page size) got:

Initial data
Compression gzip -9 (no FLUSH)
zlib with Z_PARTIAL_FLUSH
zlib with Z_SYNC_FLUSH

Volume, Kb
1692
40
352
604

At first glance, the price introduced by FLUSH is prohibitively high, but in fact we have a poor choice - either not compress at all, or compress (and very effectively) with FLUSH. We should not forget that we have 70 thousand records, the redundancy introduced by Z_PARTIAL_FLUSH is only 4-5 bytes per record. And the compression ratio turned out to be almost 5:1, which is more than an excellent result.

It may seem surprising, but in fact Z_SYNC_FLUSH is a more efficient way to do FLUSH

In the case of using Z_SYNC_FLUSH, the last 4 bytes of each entry will always be 0x00, 0x00, 0xff, 0xff. And if they are known to us, then we can not store them, so the total size is only 324Kb.

The article I linked to has an explanation:

A new type 0 block with empty contents is appended.

A type 0 block with empty contents consists of:

  • the three-bit block header;
  • 0 to 7 bits equal to zero, to achieve byte alignment;
  • the four-byte sequence 00 00 FF FF.

As you can see, in the last block, these 4 bytes are preceded by 3 to 10 zero bits. However, practice has shown that zero bits are actually at least 10.

It turns out that such short data blocks are usually (always?) encoded using a type 1 block (fixed block), which necessarily ends with 7 zero bits, in total we get 10-17 guaranteed zero bits (and the rest will be zero with a probability of about 50%).

So, on the test data in 100% of cases before 0x00, 0x00, 0xff, 0xff there is one zero byte, and in more than a third of the case - two zero bytes (perhaps the fact is that I use binary CBOR, and when using text JSON, blocks of type 2 would be more common - dynamic block, respectively, there would be blocks without additional zero bytes before 0x00, 0x00, 0xff, 0xff).

In total, on the available test data, you can fit in less than 250Kb of compressed data.

We can save a little more by juggling bits: now we ignore the presence of several zero bits at the end of the block, a few bits at the beginning of the block also do not change ...
But then I made a strong-willed decision to stop, otherwise at such a pace you can reach the development of your own archiver.

In total, I got 3-4 bytes per record from my test data, the compression ratio turned out to be more than 6:1. Honestly, I didn’t expect such a result, in my opinion, everything that is better than 2: 1 is already a result that justifies the use of compression.

Everything is fine, but zlib (deflate) is still an archaic well-deserved and slightly old-fashioned compression algorithm. The mere fact that the last 32Kb from the uncompressed data stream is used as a dictionary looks strange today (that is, if some data block is very similar to what was in the input stream 40Kb ago, then it will begin to be archived again, and will not refer to a previous entry). In fashionable modern archivers, the dictionary size is often measured in megabytes, not kilobytes.

So we continue our mini-research of archivers.

bzip2 was tested next (remember, without FLUSH it showed a fantastic compression ratio, almost 100:1). Alas, it performed very poorly with FLUSH, the size of the compressed data turned out to be larger than the uncompressed one.

My assumptions about the reasons for the failure

Libbz2 offers only one flush option, which seems to clear the dictionary (analogue of Z_FULL_FLUSH in zlib), there is no need to talk about some kind of effective compression after that.

And the last one to try was zstd. Depending on the parameters, it compresses either at the level of gzip, but much faster, or better than gzip.

Alas, with FLUSH it also proved to be “not very good”: the size of the compressed data came out to be about 700Kb.

Я asked a question on the project page in github, I received an answer that it is worth counting on up to 10 bytes of service data for each block of compressed data, which is close to the results obtained, it will not work to catch up with deflate.

This is where I decided to stop in experiments with archivers (let me remind you that xz, lzip, lzo, lz4 did not show themselves at the testing stage without FLUSH, and I did not consider more exotic compression algorithms).

We return to the problems of archiving.

The second (as they say in order, not in value) problem is that the compressed data is a single stream in which there are constant references to previous sections. Thus, if some section of compressed data is damaged, we lose not only the uncompressed data block associated with it, but also all subsequent ones.

There is an approach to solve this problem:

  1. Prevent the appearance of a problem - add redundancy to the compressed data, which will allow you to identify and correct errors; we will talk about this later;
  2. Minimize the consequences in case of a problem
    We have already said earlier that it is possible to compress each block of data independently, and the problem will disappear by itself (corruption of the data of one block will lead to the loss of data only of this block). However, this is an extreme case in which data compression will be inefficient. The opposite extreme: use all 4MB of our chip as a single archive, which will give us excellent compression, but disastrous consequences in case of data corruption.
    Yes, there is a compromise in terms of reliability. But you need to remember that we are developing a data storage format for non-volatile memory with an extremely low BER and a declared data retention period of 20 years.

In the course of experiments, I found that more or less noticeable loss of the compression level begins on blocks of compressed data with a size of less than 10Kb.
It was previously mentioned that the memory used is paged, I see no reason why you should not use the correspondence "one page - one block of compressed data".

That is, the minimum reasonable page size is 16Kb (with a margin for service information). However, such a small page size imposes significant restrictions on the maximum record size.

Although I don't expect records of more than a few kilobytes in compressed form yet, I decided to use 32Kb pages (128 pages per chip in total).

Summary:

  • We store data compressed using zlib (deflate);
  • For each entry, set Z_SYNC_FLUSH;
  • Trim the trailing bytes for each compressed record (e.g. 0x00, 0x00, 0xff, 0xff); in the header we indicate how many bytes we cut off;
  • Data is stored in pages of 32Kb; inside the page there is a single stream of compressed data; On each page, the compression starts again.

And, before finishing with compression, I would like to draw your attention to the fact that we get only a few bytes of compressed data per record, so it is extremely important not to inflate the service information, every byte counts here.

Storing Data Headers

Since we have records of variable length, we need to somehow determine the placement / boundaries of the records.

I know three approaches:

  1. All records are stored in a continuous stream, first comes the record header containing the length, and then the record itself.
    In this variant, both headers and data can be of variable length.
    In fact, we get a singly linked list, which is used all the time;
  2. Headers and entries themselves are stored in separate streams.
    By using headers of constant length, we ensure that corruption of one header does not affect the others.
    A similar approach is used, for example, in many file systems;
  3. Records are stored in a continuous stream, the record boundary is determined by some marker (character/sequence of characters that/which are prohibited inside data blocks). If a marker is found inside the record, then we replace it with some sequence (we escape it).
    A similar approach is used, for example, in the PPP protocol.

I will illustrate.

Option 1:
My implementation of a ring buffer in NOR flash
Everything is very simple here: knowing the length of the record, we can calculate the address of the next header. So we move through the headings until we come across an area filled with 0xff (free area) or the end of the page.

Option 2:
My implementation of a ring buffer in NOR flash
Due to the variable length of the post, we can't predict in advance how many posts (and therefore headers) per page we'll need. You can separate the headers and the data itself on different pages, but I prefer a different approach: we place both the headers and the data on the same page, but the headers (of constant size) come from the beginning of the page, and the data (variable length) - from the end. As soon as they “meet” (there is not enough free space for a new entry), we consider this page to be full.

Option 3:
My implementation of a ring buffer in NOR flash
There is no need to store the length or other information about the location of the data in the header, there are enough markers that indicate the boundaries of the records. However, the data has to be processed when writing / reading.
As a marker, I would use 0xff (which the page is filled with after erase), so the free area will definitely not be treated as data.

Comparison table:

Option 1
Option 2
Option 3

Error tolerance

+
+

Compactness
+

+

Complexity of implementation
*
**
**

Option 1 has a fatal flaw: if one of the headers is damaged, the entire subsequent chain is destroyed. The remaining options allow you to restore part of the data even with massive damage.
But here it is appropriate to recall that we decided to store the data in a compressed form, so and so we lose all the data on the page after the “broken” record, so even though there is a minus in the table, we do not take it into account.

Compactness:

  • in the first option, we need to store only the length in the header, if we use integers of variable length, then in most cases we can get by with one byte;
  • in the second option, we need to store the start address and length; the record should be of constant size, I estimate 4 bytes per record (two bytes per offset, and two bytes per length);
  • for the third option, just one character is enough to indicate the beginning of the recording, plus the recording itself will grow by 1-2% due to escaping. In general, approximate parity with the first option.

Initially, I considered the second option as the main one (and even wrote an implementation). I abandoned it only when I finally decided to use compression.

Perhaps someday I will still use a similar option. For example, if I have to deal with the storage of data for a ship cruising between Earth and Mars - completely different requirements for reliability, cosmic radiation, ...

As for the third option: I gave it two stars for the complexity of implementation, simply because I don’t like to mess around with escaping, changing the length in the process, etc. Yes, perhaps biased, but I have to write the code - why force yourself to do what you don’t like.

Summary: we choose the storage option in the form of chains "header with length - data of variable length" because of the efficiency and ease of implementation.

Using bit fields to control the success of write operations

I don't remember where I got the idea, but it looks something like this:
For each entry, we allocate several bits for storing flags.
As we said earlier, after an erase, all bits are filled with 1, and we can change 1 to 0, but not vice versa. So for "flag not set" we use 1, for "flag is set" - 0.

Here's what putting a variable-length record in flash might look like:

  1. Set the flag “length recording started”;
  2. We write down the length;
  3. Set the flag “data recording has begun”;
  4. We write data;
  5. Set the flag "recording ended".

In addition, we will have a flag “an error has occurred”, a total of 4 bit flags.

In this case, we have two stable states "1111" - the recording did not start and "1000" - the recording was successful; in case of an unexpected interruption of the recording process, we will receive intermediate states, which we can then detect and process.

The approach is interesting, but it only protects against a sudden power outage and similar failures, which, of course, is important, but this is far from the only (and not even the main) reason for possible failures.

Summary: moving on in search of a good solution.

Checksums

Checksums also make it possible to make sure (with sufficient probability) that we are reading exactly what should have been written. And, unlike the bit fields discussed above, they always work.

If we consider the list of potential sources of problems that we talked about above, then the checksum is able to recognize the error, regardless of its origin. (with the exception, perhaps, of malicious aliens - they can fake the checksum too).

So if our goal is to verify that the data is intact, checksums are a great idea.

The choice of the algorithm for calculating the checksum did not cause questions - CRC. On the one hand, mathematical properties allow 100% to catch errors of some types, on the other hand, on random data, this algorithm usually shows the probability of collisions is not much more than the theoretical limit My implementation of a ring buffer in NOR flash. Although this is not the fastest algorithm, not always the minimum in terms of the number of collisions, but it has a very important quality: in the tests I met, there were no patterns on which it would clearly fail. Stability is the main quality in this case.

An example of a volumetric study: Part 1, Part 2 (links to narod.ru, sorry).

However, the task of choosing a checksum is not complete, CRC is a whole family of checksums. You need to decide on the length, and then choose a polynomial.

The choice of checksum length is not as simple a matter as it seems at first glance.

I will illustrate:
Let us have the error probability in each byte My implementation of a ring buffer in NOR flash and the ideal checksum, let's calculate the average number of errors per million records:

Data, bytes
Checksum, bytes
undetected errors
False error detections
Total false positives

1
0
1000
0
1000

1
1
4
999
1003

1
2
≈0
1997
1997

1
4
≈0
3990
3990

10
0
9955
0
9955

10
1
39
990
1029

10
2
≈0
1979
1979

10
4
≈0
3954
3954

1000
0
632305
0
632305

1000
1
2470
368
2838

1000
2
10
735
745

1000
4
≈0
1469
1469

It would seem that everything is simple - depending on the length of the protected data, choose the length of the checksum with a minimum of false positives - and it's done.

However, there is a problem with short checksums: although they are good at detecting single bit errors, they can be mistaken for completely random data with a fairly high probability. There was already an article on Habré describing problem in real life.

Therefore, to make random checksum matching virtually impossible, checksums of 32 bits or more should be used. (for lengths greater than 64 bits, cryptographic hash functions are usually used).

Despite the fact that I wrote earlier that we need to save space by all means, we will still use a 32-bit checksum (16 bits are small, the probability of a collision is more than 0.01%; and 24 bits, as they say, neither here nor there) .

An objection may arise here: did we save each byte when choosing compression in order to now give 4 bytes at once? Wouldn't it have been better not to compress and not add a checksum? Of course not, no compression does not meanthat we don't need integrity checking.

At the choice of the polynomial, we will not reinvent the wheel, but take the now popular CRC-32C.
This code detects 6 bit errors on packets up to 22 bytes (perhaps the most common case for us), 4 bit errors on packets up to 655 bytes (also a common case for us), 2 or any odd number of bit errors on packets of any reasonable length.

If anyone is interested in the details

Wikipedia article about CRC.

crc-32c code options on Koopman website - perhaps the main CRC specialist on the planet.

В his article Yes another interesting code, which provides slightly better parameters for the packet lengths that are relevant to us, but I did not consider the difference significant, and I was competent enough to choose custom code instead of the standard and well-researched one.

Also, since our data is compressed, the question arises: to calculate the checksum of compressed or uncompressed data?

Arguments for calculating the checksum of uncompressed data:

  • we ultimately need to check the safety of data storage - so we directly check it (at the same time, possible errors in the implementation of compression / decompression, damage caused by broken memory, etc. will be checked);
  • the deflate algorithm in zlib has a fairly mature implementation and should not fall with "crooked" input data, moreover, it is often able to independently detect errors in the input stream, reducing the overall probability of not detecting an error (did a test with inverting a single bit in a short record, zlib detected an error in about a third of cases).

Arguments against calculating the checksum of uncompressed data:

  • CRC is “sharpened” precisely for the few bit errors that are typical for flash memory (a bit error in a compressed stream can give a massive change in the output stream, on which, purely theoretically, we can “catch” a collision);
  • I don't really like the idea of ​​passing potentially broken data to the decompressor, Who knowshow he will react.

In this project, I decided to move away from the generally accepted practice of storing the checksum of uncompressed data.

Summary: we use CRC-32C, we calculate the checksum from the data in the form in which they are written to flash (after compression).

Redundancy

The use of redundant coding does not, of course, make it possible to exclude data loss, however, it can significantly (often by many orders of magnitude) reduce the probability of irrecoverable data loss.

We can use different kinds of redundancy to correct errors.
Hamming codes can correct single bit errors, Reed-Solomon character codes, multiple copies of data together with checksums, or encoding like RAID-6 can help recover data even in the event of massive damage.
Initially, I was set on the widespread use of error-correcting coding, but then I realized that we first need to have an idea of ​​\uXNUMXb\uXNUMXbwhich errors we want to protect against, and then choose the coding.

We said earlier that bugs need to be caught as quickly as possible. At what points can we encounter errors?

  1. Incomplete recording (for some reason, during the recording, the power went out, Raspberry hung, ...)
    Alas, in the event of such an error, it remains only to ignore invalid records and consider the data lost;
  2. Write errors (for some reason, something different from what was written was written to the flash memory)
    We can immediately detect such errors if we do a control reading immediately after recording;
  3. Data distortion in memory during storage;
  4. Read errors
    To correct it, it is enough to repeat the reading several times in case of a checksum mismatch.

That is, only errors of the third type (spontaneous data corruption during storage) cannot be corrected without error-correcting coding. It seems that such errors are still extremely unlikely.

Summary: it was decided to abandon redundant coding, but if the operation shows the fallacy of this decision, then return to the consideration of the issue (with already accumulated statistics on failures, which will allow choosing the optimal type of coding).

Other

Of course, the format of the article does not allow to substantiate every bit in the format (Yeah, I'm already exhausted), so I will briefly go over some points that have not been touched on earlier.

  • It was decided to make all pages "equal"
    That is, there will be no special pages with metadata, separate threads, etc., instead, a single thread that rewrites all pages in turn.
    This ensures uniform page wear, no single point of failure, and just like it;
  • Be sure to provide format versioning.
    The format without the version number in the header is evil!
    It is enough to add a field with a certain Magic Number (signature) to the page header, which will indicate the version of the format used (I don’t think there will even be a dozen of them in practice);
  • Use for records (of which there are a lot) a header of variable length, trying for most cases to make it 1 byte long;
  • To encode the length of the header and the length of the trimmed part of the compressed record, use binary codes of variable length.

Helped a lot online generator Huffman codes. In just a few minutes, we managed to pick up the necessary codes of variable length.

Description of data storage format

Byte order

Fields larger than one byte are stored in big-endian format (network byte order), that is, 0x1234 is written as 0x12, 0x34.

Pagination

All flash memory is divided into pages of equal size.

The default page size is 32Kb, but not more than 1/4 of the total size of the memory chip (for a 4Mb chip, 128 pages are obtained).

Each page stores data independently of the others (that is, data from one page does not refer to data from another page).

All pages are numbered in natural order (in ascending order of addresses), starting from number 0 (zero page starts from address 0, the first - from 32Kb, the second - from 64Kb, etc.)

The memory chip is used as a cyclic buffer (ring buffer), that is, first the recording goes to page number 0, then to page number 1, ..., when we fill the last page, a new cycle begins and the recording continues from page zero.

Inside the page

My implementation of a ring buffer in NOR flash
At the beginning of the page, a 4-byte page header is stored, then the header checksum (CRC-32C), then records are stored in the format “header, data, checksum”.

The page title (dirty green in the diagram) consists of:

  • two-byte Magic Number field (it is also a sign of the format version)
    for the current version of the format, it counts as 0xed00 ⊕ номер страницы;
  • two-byte counter "Page Version" (number of the memory overwriting cycle).

Entries on the page are stored in a compressed form (the deflate algorithm is used). All entries on one page are compressed in one stream (a common dictionary is used), compression starts anew on each new page. That is, to decompress any record, all previous records from this page (and only from this page) are required.

Each entry is compressed with the Z_SYNC_FLUSH flag, with 4 bytes 0x00, 0x00, 0xff, 0xff at the end of the compressed stream, prepended, perhaps, with one or two more null bytes.
We discard this sequence (4, 5 or 6 bytes long) when writing to flash memory.

The record header is 1, 2 or 3 bytes containing:

  • one bit (T) indicating the type of entry: 0 - context, 1 - log;
  • a variable length field (S) from 1 to 7 bits, specifying the length of the header and the "tail" to be added to the record for unpacking;
  • record length (L).

S value table:

S
Header length, bytes
Discarded on write, bytes

0
1
5 (00 00 00 ff ff)

10
1
6 (00 00 00 00 ff ff)

110
2
4 (00 00 ff ff)

1110
2
5 (00 00 00 ff ff)

11110
2
6 (00 00 00 00 ff ff)

1111100
3
4 (00 00 ff ff)

1111101
3
5 (00 00 00 ff ff)

1111110
3
6 (00 00 00 00 ff ff)

I tried to illustrate, I do not know how clearly it turned out:
My implementation of a ring buffer in NOR flash
Yellow is the T field, white is the S field, green is L (the length of the compressed data in bytes), blue is the compressed data, red is the final bytes of the compressed data that are not written to flash memory.

Thus, we can write the record headers of the most common length (up to 63 + 5 bytes in compressed form) in one byte.

After each entry, a CRC-32C checksum is stored, in which the inverted value of the previous checksum is used as the initial value (init).

CRC has the property of "duration", the following formula works (plus or minus the inversion of bits in the process): My implementation of a ring buffer in NOR flash.
That is, in fact, we calculate the CRC of all previous bytes of headers and data on this page.

Immediately following the checksum is the header of the next entry.

The header is constructed in such a way that its first byte is always different from 0x00 and 0xff (if instead of the first byte of the header we meet 0xff, then this is an unused area; 0x00 signals an error).

Exemplary Algorithms

Reading from flash memory

Any reading comes with a checksum check.
If the checksum does not match, the reading is repeated several times in the hope of reading the correct data.

(this makes sense, Linux doesn't cache reading from NOR Flash, checked)

Write to flash memory

We write the data.
We read them.

If the read data does not match the written data, we fill the area with zeros and signal an error.

Preparing a new chip for work

For initialization, a header with version 1 is written to the first (more precisely, zero) page.
After that, the initial context is written to this page (contains the UUID of the machine and default settings).

Everything, flash memory is ready to go.

Loading the machine

When loading, the first 8 bytes of each page (header + CRC) are read, pages with an unknown Magic Number or incorrect CRC are ignored.
From the "correct" pages, pages with the maximum version are selected, and the page with the highest number is taken from them.
The first record is read, the correctness of the CRC is checked, the presence of the "context" flag. If everything is fine, this page is considered current. If not, we roll back to the previous one until we find a “live” page.
and on the page found, we read all the records, those that are used with the “context” flag.
We save the zlib dictionary (it will be needed to add to this page).

That's it, the download is complete, the context is restored, you can work.

Adding a log entry

We compress the entry with the correct dictionary, specifying Z_SYNC_FLUSH. See if the compressed entry fits on the current page.
If it does not fit (or there were CRC errors on the page), we start a new page (see below).
We write down the record and CRC. If an error occurs, start a new page.

New page

We select a free page with a minimum number (we consider a free page with an incorrect checksum in the header or with a version less than the current one). If there are no such pages, we select the page with the minimum number from those that have a version equal to the current one.
We make the selected page erase. We check the contents with 0xff. If something is wrong, we take the next free page, and so on.
On the erased page we write the title, the first record is the current state of the context, the next is the unwritten log record (if any).

Format Applicability

In my opinion, it turned out to be a good format for storing any more or less compressible streams of information (plain text, JSON, MessagePack, CBOR, possibly protobuf) in NOR Flash.

Of course, the format is "sharpened" for SLC NOR Flash.

It should not be used with high BER media such as NAND or MLC NOR (Is such a memory available for sale at all? I met mentions only in works on correction codes).

Moreover, it should not be used with devices that have their own FTL: USB flash, SD, MicroSD, etc. (for such memory, I made a format with a page size of 512 bytes, a signature at the beginning of each page and unique record numbers - sometimes it was possible to restore all data from a “buggy” flash drive by simple sequential reading).

Depending on the tasks, the format without changes can be used on flash drives from 128Kbps (16Kb) to 1Gbps (128Mb). If desired, it can also be used on larger chips, only, probably, you need to adjust the page size (But here the question of economic feasibility already arises, the price of a large volume of NOR Flash is not encouraging).

If the format seemed interesting to someone, and he wants to use it in an open project - write, I will try to find the time, comb the code and put it on github.

Conclusion

As you can see, in the end the format turned out to be simple. and even boring.

It is difficult to reflect the evolution of my point of view in the article, but believe me: initially I wanted to create something fancy, indestructible, capable of surviving even after a nuclear explosion in the immediate vicinity. However, the mind (I hope) still won and gradually the priorities shifted to simplicity and compactness.

Could it be that I'm wrong? Yes, sure. It may well turn out, for example, that we have purchased a batch of low-quality microcircuits. Or for some other reason, the equipment will not live up to reliability expectations.

Do I have a plan for this? I think that after reading the article you have no doubt that there is a plan. And not even one.

If a little more seriously, the format was developed both as a working option and as a “trial balloon”.

At the moment, everything is working fine on the table, just one of these days the solution will be deployed (approximately) on a hundred devices, let's see what will happen in "combat" operation (fortunately, I hope the format allows you to reliably detect failures; so it will be possible to collect full-fledged statistics). In a few months it will be possible to draw conclusions (and if not lucky - then earlier).

If, as a result of the use, serious problems are found and improvements are required, then I will definitely write about it.

Literature

I didn’t want to make a long boring list of used works, after all, everyone has Google.

Here I decided to leave a list of finds that seemed especially interesting to me, but gradually they migrated directly to the text of the article, and one item remained on the list:

  1. Utility infection from the author of zlib. Able to display the contents of deflate/zlib/gzip archives in an understandable way. If you have to deal with the internal device of the deflate (or gzip) format, I highly recommend it.

Source: habr.com

Add a comment