Linux 6.2 kernel will include improvements to RAID5/6 in Btrfs

Btrfs improvements are proposed for inclusion in the Linux 6.2 kernel to fix the "write hole" issue in the RAID 5/6 implementation. The essence of the problem boils down to the fact that if a crash occurred during recording, it is initially impossible to understand which block on which of the RAID devices was written correctly, and in which the recording was not completed. If you attempt to rebuild a RAID in this situation, blocks corresponding to underwritten blocks may be corrupted because the state of the RAID blocks is out of sync. This problem occurs in any RAID1/5/6 arrays where special measures are not taken to combat this effect.

In a RAID implementation, like RAID1 in btrfs, this problem is solved by using checksums in both copies, if there is a mismatch, the data is simply restored from the second copy. This approach also works if some device starts to give incorrect data instead of a complete failure.

However, in the case of RAID5/6, the file system does not store checksums for parity blocks: in a normal situation, the correctness of the blocks is checked by the fact that they are all equipped with a checksum, and the parity block can be recreated from the data. However, in the case of partial recording, this approach may not work in certain situations. In this case, when restoring the array, it is possible that the blocks that fell under the incomplete record will be restored incorrectly.

In the case of btrfs, this problem is most relevant if the write being produced is smaller than the stripe. In this case, the file system must perform a read-modify-write (RMW) operation. If it encounters write-in-progress blocks, then the RMW operation may cause corruptions that will not be detected, regardless of the checksums. The developers have made changes in which the RMW operation checks the checksum of blocks before performing this operation, and if necessary, data recovery also performs a checksum check after writing. Unfortunately, in a situation with writing an incomplete stripe (RMW), this leads to additional overhead for calculating checksums, but significantly increases reliability. For RAID6, such logic is not yet ready, however, for such a failure in RAID6, it is necessary that the write fails on 2 devices at once, which is less likely.

Additionally, we can note the recommendations on the use of RAID5 / 6 from the developers, the essence of which boils down to the fact that in Btrfs the profile for storing metadata and data may differ. In this case, you can use the RAID1 (mirror) or even RAID1C3 (3 copies) profile for metadata, and RAID5 or RAID6 for data. This ensures reliable protection of metadata and the absence of a "write hole", on the one hand, and more efficient use of space, typical for RAID5/6, on the other. This avoids corruption in the metadata, and data corruption can be corrected.

It can also be noted that for SSDs in Btrfs in the 6.2 kernel, the asynchronous execution of the “discard” operation (marking freed blocks that can no longer be physically stored) will be activated by default. The advantage of this mode is high performance due to the efficient grouping of "discard" operations in a queue and further processing of the queue by a background handler, due to which normal FS operations do not slow down, as is the case with synchronous "discard" as blocks are freed, and the SSD can make better decisions. On the other hand, you will no longer need to use utilities like fstrim, since all available blocks will be cleared in the FS without the need for additional scanning and without slowing down operations.

Source: opennet.ru

Add a comment