Data compression in Apache Ignite. Sber experience

Data compression in Apache Ignite. Sber experienceWhen working with large amounts of data, the problem of lack of disk space can sometimes become acute. One solution to this problem is compression, which allows you to afford more storage on the same hardware. In this article, we will look at how data compression works in Apache Ignite. The article will describe only the methods of compression on disk implemented within the product. Other methods of data compression (over the network, in memory), both implemented and not, will remain outside the scope.

So, with the persistence mode enabled, as a result of changing the data in the caches, Ignite starts writing to disk:

  1. Contents of caches
  2. Write Ahead Log (WAL)

There has been a mechanism for WAL compression for quite some time now called WAL compaction. In the recently released Apache Ignite 2.8, two more mechanisms have appeared that allow you to compress data on disk, these are disk page compression for compressing the contents of caches and WAL page snapshot compression for compressing some WAL records. More on all three of these mechanisms below.

Disk page compression

How it works

To begin with, let's very briefly dwell on how Ignite stores data. For storage, page memory is used. The page size is set at node start and cannot be changed at later stages, and the page size must be a power of two and a multiple of the file system block size. Pages are loaded into RAM from disk as needed, the size of data on disk may exceed the amount of allocated RAM. If there is not enough space in RAM to load pages from disk, old, no longer used pages will be evicted from RAM.

Data is stored on disk in the following way: a separate file is created for each partition of each cache group, in this file, in ascending order of the index, pages go one after another. The full page ID contains the cache group ID, partition number, and page index in the file. Thus, by the full page ID, we can uniquely determine the file and offset in the file for each page. You can read more about the page memory device in the article on the Apache Ignite Wiki: Ignite Persistent Storeβ€”under the hood.

The disk page compression mechanism, as you might guess from the name, works at the page level. When this mechanism is enabled, work with data in RAM is performed as is, without any compression, but at the time of saving pages from RAM to disk, they are compressed.

But compressing each page individually is not yet a solution to the problem, you need to somehow reduce the size of the resulting data files. If the page size is no longer fixed, we can no longer write pages to the file one by one, as this can create a number of problems:

  • We will not be able to calculate the offset by which it is located in the file using the page index.
  • It is not clear what to do with pages that are not at the end of the file and change their size. If the page size is reduced, the space it has freed up is wasted. If the page size increases, you need to look for a new place in the file for it.
  • If a page is shifted by a number of bytes that is not a multiple of the file system block size, then in order to read or write it, it will be necessary to touch one more file system block, which can lead to performance degradation.

In order not to solve these problems at its own level, disk page compression in Apache Ignite uses a file system mechanism called sparse files. A sparse (sparse) file is one in which some zero-filled regions can be marked as "holes". At the same time, no file system blocks will be allocated for storing these holes, as a result of which disk space is saved.

It is logical that in order to free up a file system block, the size of the hole must be greater than or equal to the file system block, which imposes an additional restriction on the page size while Apache Ignite: in order for compression to have at least some effect, the page size must be strictly larger than the size of the file system block . If the page size is equal to the block size, then we will never be able to release a single block, since in order to release a single block, it is necessary that the compressed page occupies 0 bytes. If the page size is equal to the size of 2 or 4 blocks, we will already be able to release at least one block if our page shrinks to at least 50% or 75%, respectively.

Thus, the final description of the operation of the mechanism: When writing a page to disk, an attempt is made to compress the page. If the size of the compressed page allows one or more blocks of the file system to be released, then the page is written in compressed form, a β€œhole” is punched in the place of the released blocks (a system call is made fallocate() with the "punch hole" flag). If the size of the compressed page does not allow freeing blocks, the page is saved as is, uncompressed. All page offsets are calculated in the same way as without compression, by multiplying the page index by the page size. No relocation of pages on your own is required. Page offsets, as well as without compression, fall on the boundaries of file system blocks.

Data compression in Apache Ignite. Sber experience

In the current implementation, Ignite can work with sparse files only under Linux OS, so disk page compression can only be enabled when using Ignite on this operating system.

Compression algorithms that can be used for disk page compression: ZSTD, LZ4, Snappy. In addition, there is an operation mode (SKIP_GARBAGE), in which only unused space in the page is discarded without applying compression to the remaining data, which allows reducing the load on the CPU compared to the previously listed algorithms.

Impact on performance

Unfortunately, I did not actually measure performance on real stands, since we do not plan to use this mechanism in production, but we can theoretically speculate where we will lose and where we will win.

To do this, we need to remember how reading and writing pages are performed when accessing them:

  • When performing a read operation, it is first searched in RAM, if the search fails, the page is loaded into RAM from disk by the same stream that performs the read.
  • When a write operation is performed, the page in RAM is marked as dirty, while the page is not physically saved to disk immediately in the thread performing the write. All dirty pages are saved to disk later in the checkpoint process by separate streams.

So the impact on read operations is:

  • Positive (disk IO), by reducing the number of file system blocks read.
  • Negative (CPU), due to the additional load required by the operating system to work with sparse files. It is also possible that additional IO operations will implicitly appear here to save a more complex structure of the sparse file (I, unfortunately, am not familiar with all the details of how sparse files work).
  • Negative (CPU), due to the need for page decompression.
  • There is no effect on write operations.
  • Influence on the checkpoint process (here everything is similar to read operations):
  • Positive (disk IO), by reducing the number of file system blocks written.
  • Negative (CPU, possibly disk IO), due to working with sparse files.
  • Negative (CPU), due to the need to compress pages.

Which scale will outweigh? It's all very dependent on the environment, but I'm inclined to believe that disk page compression will rather lead to performance degradation on most systems. Moreover, tests on other DBMS using a similar approach with sparse files show a performance drop with compression enabled.

How to enable and configure

As mentioned above, the minimum version of Apache Ignite that supports disk page compression is 2.8 and only the Linux operating system is supported. Switching on and setting up is done as follows:

  • The class-path must contain the ignite-compression module. By default, it is located in the Apache Ignite distribution in the libs/optional directory and is not included in the class-path. You can simply move the directory up one level to libs and then when run through ignite.sh it will automatically be included.
  • Persistence must be enabled (Enabled via DataRegionConfiguration.setPersistenceEnabled(true)).
  • The page size must be larger than the block size of the file system (can be set using DataStorageConfiguration.setPageSize() ).
  • For each cache whose data needs to be compressed, it is necessary to configure the compression method and (optionally) the compression level in the configuration (methods CacheConfiguration.setDiskPageCompression() , CacheConfiguration.setDiskPageCompressionLevel()).

WAL compaction

How it works

What is WAL and why is it needed? Very briefly: this is a log in which all events that eventually change page storage fall. It is needed primarily for the possibility of recovery in the event of a fall. Any operation, before giving control to the user, must first write the event to WAL, so that in the event of a crash, be able to play through the log and restore all operations for which the user received a successful response, even if these operations did not have time to be reflected in the page storage on disk (above already it has been described that the actual writing to page storage is done in a process called a "checkpoint" with some delay by separate threads).

Records in WAL are divided into logical and physical. Booleans are the keys and values ​​themselves. Physical β€” reflect page changes in page storage. While logical records can be useful for some other cases, physical records are needed only for recovery in case of a crash, and records are only needed since the last successful checkpoint. Here we will not go into details and explain why it works the way it does, but those who are interested can refer to the already mentioned article on the Apache Ignite Wiki: Ignite Persistent Storeβ€”under the hood.

There are often multiple physical records per logical record. That is, for example, one cache put operation affects several pages in page memory (a page with the data itself, pages with indexes, pages with free-lists). On some synthetic tests, it turned out that physical records occupied up to 90% of the volume of the WAL file. At the same time, they are needed for a very short time (by default, the interval between checkpoints is 3 minutes). It would be logical to get rid of this data after the loss of their relevance. This is exactly what the WAL compaction mechanism does, gets rid of physical records and compresses the remaining logical records using zip, while the file size is reduced very significantly (sometimes tens of times).

Physically, WAL consists of several segments (default 10) of a fixed size (default 64MB) that are overwritten in a circle. As soon as the current segment is filled, the next segment is assigned as the current segment, and the filled segment is copied to the archive in a separate stream. WAL compaction already works with archive segments. It also monitors the execution of the checkpoint in a separate thread and starts compressing archived segments, for which physical records are no longer needed.

Data compression in Apache Ignite. Sber experience

Impact on performance

Since WAL compaction runs on a separate thread, there should not be a direct impact on the operations performed. But it still gives additional background load on the CPU (compression) and disk (reading each WAL segment from the archive and writing the compressed segments), so if the system is running at its limit, it will also lead to performance degradation.

How to enable and configure

You can enable WAL compaction using the property WalCompactionEnabled Π² DataStorageConfiguration (DataStorageConfiguration.setWalCompactionEnabled(true)). Also, using the DataStorageConfiguration.setWalCompactionLevel() method, you can set the compression level if you are not satisfied with the default value (BEST_SPEED).

WAL page snapshot compression

How it works

Earlier we found out that in WAL records are divided into logical and physical. For each change to each page, a physical WAL record is generated in page memory. Physical records, in turn, are also divided into 2 subspecies: page snapshot record and delta record. Every time we change something on the page and transfer it from a clean state to a dirty one, a complete copy of this page (page snapshot record) is stored in WAL. Even if we changed only one byte in WAL, the record will be slightly larger than the page size. If we change something on an already dirty page, then a delta record is formed in WAL, which reflects only changes compared to the previous state of the page, but not the entire page. Since the page state is reset from dirty to clean during the checkpoint process, immediately after the start of the checkpoint, almost all physical records will consist only of page snapshots (since all pages immediately after the start of the checkpoint are clean), then as the next checkpoint is approached, the delta record fraction starts grow and reset again at the beginning of the next checkpoint. Measurements on some synthetic tests showed that the share of page snapshots in the total volume of physical records reaches 90%.

The idea of ​​WAL page snapshot compression is to compress page snapshots using a ready-made page compression tool (see disk page compression). At the same time, records in WAL are stored sequentially in append-only mode and there is no need to bind records to the boundaries of file system blocks, so here, unlike the disk page compression mechanism, we don’t need sparse files at all, so this mechanism will work not only on the OS linux. In addition, we no longer care how much we were able to compress the page. Even if we freed 1 byte, this is already a positive result and we can save compressed data in WAL, unlike disk page compression, where we save a compressed page only if we freed more than 1 block of the file system.

Pages are highly compressible data, their share in the total WAL volume is very high, so without changing the WAL file format, we can get a significant reduction in its size. Compression including logical records would require a format change and loss of compatibility, for example, for external consumers who may be interested in logical records, while not bringing a significant reduction in file size.

As for disk page compression, WAL page snapshot compression can use the ZSTD, LZ4, Snappy compression algorithms, as well as the SKIP_GARBAGE mode.

Impact on performance

As you can see, directly enabling WAL page snapshot compression only affects threads that write data to page memory, that is, those threads that change data in caches. Reading physical records from WAL occurs only once, at the moment the node is raised after the fall (and only in case of a fall during the checkpoint).

This affects data-changing streams in the following way: we get a negative effect (CPU) due to the need to compress the page each time before writing to disk and a positive effect (disk IO) due to a decrease in the amount of data being written. Accordingly, everything is simple here, if the system performance rests on the CPU, we get a slight degradation, if it is disk I / O, we get an increase.

Indirectly, reducing the WAL size also affects (positively) streams that dump WAL segments into the archive and WAL compaction streams.

Real performance tests in our environment on synthetic data showed a slight increase (throughput increased by 10% -15%, latency decreased by 10% -15%).

How to enable and configure

Minimum version of Apache Ignite: 2.8. Switching on and setting up is done as follows:

  • The class-path must contain the ignite-compression module. By default, it is located in the Apache Ignite distribution in the libs/optional directory and is not included in the class-path. You can simply move the directory up one level to libs and then when run through ignite.sh it will automatically be included.
  • Persistence must be enabled (Enabled via DataRegionConfiguration.setPersistenceEnabled(true)).
  • The compression mode must be set using the method DataStorageConfiguration.setWalPageCompression(), compression is disabled by default (DISABLED mode).
  • Optionally, you can set the compression level using the method DataStorageConfiguration.setWalPageCompression(), see the javadoc for the method for valid values ​​for each mode.

Conclusion

The considered data compression mechanisms in Apache Ignite can be used independently of each other, but any combination of them is also acceptable. Understanding how they work will allow you to determine how they fit your tasks in your environment and what you have to sacrifice when using them. Disk page compression is designed to compress the main storage and can give an average compression ratio. WAL page snapshot compression will give an average degree of compression already WAL files, while most likely it will even increase performance. WAL compaction will not have a positive effect on performance, but it will reduce the size of WAL files as much as possible by removing physical records.

Source: habr.com

Add a comment