Durable Data Storage and Linux File APIs

I, researching the stability of data storage in cloud systems, decided to test myself, to make sure that I understand the basic things. I started by reading the NVMe spec in order to understand what guarantees regarding data persistence (that is, guarantees that data will be available after a system failure) give us NMVe disks. I made the following main conclusions: you need to consider the data damaged from the moment the data write command is given, and until the moment they are written to the storage medium. However, in most programs, system calls are quite safely used to write data.

In this article, I explore the persistence mechanisms provided by the Linux file APIs. It seems that everything should be simple here: the program calls the command write(), and after the operation of this command is completed, the data will be securely stored on disk. But write() only copies application data to the kernel cache located in RAM. In order to force the system to write data to disk, some additional mechanisms must be used.

Durable Data Storage and Linux File APIs

In general, this material is a set of notes relating to what I have learned on a topic of interest to me. If we talk very briefly about the most important, it turns out that in order to organize sustainable data storage, you need to use the command fdatasync() or open files with flag O_DSYNC. If you're interested in learning more about what happens to data on the way from code to disk, take a look at this an article.

Features of using the write() function

System call write() defined in the standard IEEE POSIX as an attempt to write data to a file descriptor. After successful completion of work write() data read operations must return exactly the bytes that were previously written, doing so even if the data is being accessed from other processes or threads (here corresponding section of the POSIX standard). Here, in the section on the interaction of threads with normal file operations, there is a note that says that if two threads each call these functions, then each call must either see all the indicated consequences that the execution of the other call leads to, or not see at all no consequences. This leads to the conclusion that all file I/O operations must hold a lock on the resource being worked on.

Does this mean that the operation write() is atomic? From a technical point of view, yes. Data read operations must return either all or none of what was written with write(). But the operation write(), in accordance with the standard, does not have to end, having written down everything that she was asked to write down. It is allowed to write only part of the data. For example, we might have two streams each appending 1024 bytes to a file described by the same file descriptor. From the point of view of the standard, the result will be acceptable when each of the write operations can append only one byte to the file. These operations will remain atomic, but after they complete, the data they write to the file will be jumbled. Here very interesting discussion on this topic on Stack Overflow.

fsync() and fdatasync() functions

The easiest way to flush data to disk is to call the function fsync(). This function asks the operating system to move all modified blocks from the cache to disk. This includes all the metadata of the file (access time, file modification time, and so on). I believe this metadata is rarely needed, so if you know it's not important to you, you can use the function fdatasync(). In help by fdatasync() it says that during the operation of this function, such an amount of metadata is saved to disk, which is "necessary for the correct execution of the following data reading operations." And this is exactly what most applications care about.

One problem that can arise here is that these mechanisms do not guarantee that the file can be found after a possible failure. In particular, when a new file is created, one should call fsync() for the directory that contains it. Otherwise, after a crash, it may turn out that this file does not exist. The reason for this is that under UNIX, due to the use of hard links, a file can exist in multiple directories. Therefore, when calling fsync() there is no way for a file to know which directory data should also be flushed to disk (here you can read more about this). It looks like the ext4 file system is capable of automatically use fsync() to directories containing the corresponding files, but this may not be the case with other file systems.

This mechanism can be implemented differently in different file systems. I used blktrace to learn about what disk operations are used in ext4 and XFS file systems. Both issue the usual write commands to disk for both the contents of the files and the file system journal, flush the cache and exit by performing a FUA (Force Unit Access, writing data directly to disk, bypassing the cache) write to the journal. They probably do just that in order to confirm the fact of the transaction. On drives that don't support FUA, this causes two cache flushes. My experiments have shown that fdatasync() a little bit faster fsync()... Utility blktrace indicates that fdatasync() usually writes less data to disk (in ext4 fsync() writes 20 KiB, and fdatasync() - 16 KiB). Also, I found out that XFS is slightly faster than ext4. And here with the help blktrace was able to find out that fdatasync() flushes less data to disk (4 KiB in XFS).

Ambiguous situations when using fsync()

I can think of three ambiguous situations concerning fsync()which I have come across in practice.

The first such incident occurred in 2008. At that time, the Firefox 3 interface β€œfrozen” if a large number of files were being written to disk. The problem was that the implementation of the interface used a SQLite database to store information about its state. After each change that occurred in the interface, the function was called fsync(), which gave good guarantees of stable data storage. In the then used ext3 file system, the function fsync() flushed to disk all the "dirty" pages in the system, and not just those that were related to the corresponding file. This meant that clicking a button in Firefox could cause megabytes of data to be written to a magnetic disk, which could take many seconds. The solution to the problem, as far as I understood from it material, was to move the work with the database to asynchronous background tasks. This means that Firefox used to implement stricter storage persistence requirements than was really necessary, and the ext3 filesystem features only exacerbated this problem.

The second problem happened in 2009. Then, after a system crash, users of the new ext4 file system found that many newly created files were zero-length, but this did not happen with the older ext3 file system. In the previous paragraph, I talked about how ext3 dumped too much data on the disk, which slowed things down a lot. fsync(). To improve the situation, ext4 flushes only those "dirty" pages that are relevant to a particular file. And the data of other files remains in memory for a much longer time than with ext3. This was done to improve performance (by default, the data stays in this state for 30 seconds, you can configure this using dirty_expire_centisecs; here you can find more information on this). This means that a large amount of data can be irretrievably lost after a crash. The solution to this problem is to use fsync() in applications that need to provide stable data storage and protect them as much as possible from the consequences of failures. Function fsync() works much more efficiently with ext4 than with ext3. The disadvantage of this approach is that its use, as before, slows down some operations, such as installing programs. See details on this here ΠΈ here.

The third problem regarding fsync(), originated in 2018. Then, within the framework of the PostgreSQL project, it was found out that if the function fsync() encounters an error, it marks "dirty" pages as "clean". As a result, the following calls fsync() do nothing with such pages. Because of this, modified pages are stored in memory and never written to disk. This is a real disaster, because the application will think that some data is written to disk, but in fact it will not be. Such failures fsync() are rare, the application in such situations can do almost nothing to combat the problem. These days, when this happens, PostgreSQL and other applications crash. Here, in the article "Can Applications Recover from fsync Failures?", this problem is explored in detail. Currently the best solution to this problem is to use Direct I/O with the flag O_SYNC or with a flag O_DSYNC. With this approach, the system will report errors that may occur when performing specific data write operations, but this approach requires the application to manage the buffers itself. Read more about it here ΠΈ here.

Opening files using the O_SYNC and O_DSYNC flags

Let's return to the discussion of the Linux mechanisms that provide persistent data storage. Namely, we are talking about the use of the flag O_SYNC or flag O_DSYNC when opening files using system call open (). With this approach, each data write operation is performed as if after each command write() the system is given, respectively, commands fsync() ΠΈ fdatasync(). In POSIX specifications this is called "Synchronized I/O File Integrity Completion" and "Data Integrity Completion". The main advantage of this approach is that only one system call needs to be executed to ensure data integrity, and not two (for example βˆ’ write() ΠΈ fdatasync()). The main disadvantage of this approach is that all write operations using the corresponding file descriptor will be synchronized, which can limit the ability to structure the application code.

Using Direct I/O with the O_DIRECT flag

System call open() supports the flag O_DIRECT, which is designed to bypass the operating system cache, perform I / O operations, interacting directly with the disk. This, in many cases, means that the write commands issued by the program will be directly translated into commands aimed at working with the disk. But, in general, this mechanism is not a replacement for the functions fsync() or fdatasync(). The fact is that the disk itself can delay or cache appropriate commands for writing data. And, even worse, in some special cases, the I / O operations performed when using the flag O_DIRECT, broadcast into traditional buffered operations. The easiest way to solve this problem is to use the flag to open files O_DSYNC, which will mean that each write operation will be followed by a call fdatasync().

It turned out that the XFS filesystem had recently added a "fast path" for O_DIRECT|O_DSYNC-data records. If the block is overwritten using O_DIRECT|O_DSYNC, then XFS, instead of flushing the cache, will execute the FUA write command if the device supports it. I verified this using the utility blktrace on a Linux 5.4/Ubuntu 20.04 system. This approach should be more efficient, since it writes the minimum amount of data to disk and uses one operation, not two (write and flush the cache). I found a link to patch 2018 kernel that implements this mechanism. There is some discussion about applying this optimization to other filesystems, but as far as I know, XFS is the only filesystem that supports it so far.

sync_file_range() function

Linux has a system call sync_file_range(), which allows you to flush only part of the file to disk, not the entire file. This call initiates an asynchronous flush and does not wait for it to complete. But in the reference to sync_file_range() this command is said to be "very dangerous". It is not recommended to use it. Features and dangers sync_file_range() very well described in This material. In particular, this call seems to use RocksDB to control when the kernel flushes "dirty" data to disk. But at the same time there, to ensure stable data storage, it is also used fdatasync(). In code RocksDB has some interesting comments on this topic. For example, it looks like the call sync_file_range() when using ZFS does not flush data to disk. Experience tells me that rarely used code may contain bugs. Therefore, I would advise against using this system call unless absolutely necessary.

System calls to help ensure data persistence

I've come to the conclusion that there are three approaches that can be used to perform persistent I/O operations. They all require a function call fsync() for the directory where the file was created. These are the approaches:

  1. Function call fdatasync() or fsync() after function write() (better to use fdatasync()).
  2. Working with a file descriptor opened with a flag O_DSYNC or O_SYNC (better - with a flag O_DSYNC).
  3. Using the command pwritev2() with flag RWF_DSYNC or RWF_SYNC (preferably with a flag RWF_DSYNC).

Performance Notes

I didn't carefully measure the performance of the various mechanisms I investigated. The differences I noticed in the speed of their work are very small. This means that I can be wrong, and that in other conditions the same thing may show different results. First, I will talk about what affects performance more, and then, about what affects performance less.

  1. Overwriting file data is faster than appending data to a file (the performance gain can be 2-100%). Attaching data to a file requires additional changes to the file's metadata, even after the system call fallocate(), but the magnitude of this effect may vary. I recommend, for best performance, to call fallocate() to pre-allocate the required space. Then this space must be explicitly filled with zeros and called fsync(). This will cause the corresponding blocks in the file system to be marked as "allocated" instead of "unallocated". This gives a small (about 2%) performance improvement. Also, some disks may have a slower first block access operation than others. This means that filling the space with zeros can lead to a significant (about 100%) performance improvement. In particular, this can happen with disks. AWS EBS (this is unofficial data, I could not confirm them). The same goes for storage. GCP Persistent Disk (and this is already official information, confirmed by tests). Other experts have done the same observationrelated to different disks.
  2. The fewer system calls, the higher the performance (the gain can be about 5%). It looks like a call open() with flag O_DSYNC or call pwritev2() with flag RWF_SYNC faster call fdatasync(). I suspect that the point here is that with this approach, the fact that fewer system calls have to be performed to solve the same task (one call instead of two) plays a role. But the performance difference is very small, so you can easily ignore it and use something in the application that does not lead to the complication of its logic.

If you are interested in the topic of sustainable data storage, here are some useful materials:

  • I/O Access methods β€” an overview of the basics of input / output mechanisms.
  • Ensuring data reaches disk - a story about what happens to the data on the way from the application to the disk.
  • When should you fsync the containing directory - the answer to the question of when to apply fsync() for directories. In a nutshell, it turns out that you need to do this when creating a new file, and the reason for this recommendation is that in Linux there can be many references to the same file.
  • SQL Server on Linux: FUA Internals - here is a description of how persistent data storage is implemented in SQL Server on the Linux platform. There are some interesting comparisons between Windows and Linux system calls here. I am almost sure that it was thanks to this material that I learned about the FUA optimization of XFS.

Have you ever lost data that you thought was securely stored on disk?

Durable Data Storage and Linux File APIs

Durable Data Storage and Linux File APIs

Source: habr.com