ZFS Basics: Storage and Performance

ZFS Basics: Storage and Performance

This spring we have already discussed some introductory topics, for example, how to check the speed of your drives ΠΈ what is RAID. In the second of them, we even promised to continue studying the performance of various multi-disk topologies in ZFS. This is the next generation file system that is now being implemented everywhere: from Apple Lossless Audio CODEC (ALAC), to Ubuntu.

Well, today is the best day to get acquainted with ZFS, inquisitive readers. Just know that in the humble opinion of OpenZFS developer Matt Ahrens, "it's really hard."

But before we get to the numbers - and they will, I promise - for all options for an eight-disk ZFS configuration, we need to talk about How In general, ZFS stores data on disk.

Zpool, vdev and device

ZFS Basics: Storage and Performance
This full pool diagram includes three auxiliary vdevs, one of each class, and four for RAIDz2

ZFS Basics: Storage and Performance
There's usually no reason to create a pool of mismatched vdev types and sizes - but there's nothing stopping you from doing so if you want to.

To really understand the ZFS file system, you need to take a close look at its actual structure. First, ZFS unifies the traditional levels of volume and file system management. Secondly, it uses a transactional copy-on-write mechanism. These features mean that the system is structurally very different from conventional file systems and RAID arrays. The first set of basic building blocks to understand are the storage pool (zpool), virtual device (vdev), and real device (device).

zpool

The zpool storage pool is the topmost ZFS structure. Each pool contains one or more virtual devices. In turn, each of them contains one or more real devices (device). Virtual pools are self-contained blocks. One physical computer can contain two or more separate pools, but each is completely independent of the others. Pools cannot share virtual devices.

The redundancy of ZFS is at the virtual device level, not at the pool level. There is absolutely no redundancy at the pool level - if any drive vdev or special vdev is lost, then the entire pool is lost along with it.

Modern storage pools can survive the loss of a cache or virtual device log - although they can lose a small amount of dirty data if they lose the vdev log during a power outage or system crash.

There is a common misconception that ZFS "data stripes" are written across the entire pool. This is not true. Zpool is not funny RAID0 at all, it's rather funny JBOD with a complex variable distribution mechanism.

For the most part, the records are distributed among the available virtual devices according to the available free space, so in theory they will all be filled at the same time. In later versions of ZFS, the current vdev usage (utilization) is taken into account - if one virtual device is significantly busier than another (for example, due to read load), it will be temporarily skipped for writing, despite having the highest free space ratio.

The utilization detection mechanism built into modern ZFS write allocation methods can reduce latency and increase throughput during periods of unusually high load - but it is not carte blanche on involuntary mixing of slow HDDs and fast SSDs in one pool. Such an unequal pool will still operate at the speed of the slowest device, that is, as if it were entirely composed of such devices.

vdev

Each storage pool consists of one or more virtual devices (virtual device, vdev). In turn, each vdev contains one or more real devices. Most virtual devices are used for simple data storage, but there are several vdev helper classes, including CACHE, LOG, and SPECIAL. Each of these vdev types can have one of five topologies: single device (single-device), RAIDz1, RAIDz2, RAIDz3, or mirror (mirror).

RAIDz1, RAIDz2 and RAIDz3 are special varieties of what the old-timers would call double (diagonal) parity RAID. 1, 2 and 3 refer to how many parity blocks are allocated for each data strip. Instead of separate disks for parity, RAIDz virtual devices distribute this parity semi-evenly across disks. A RAIDz array can lose as many disks as it has parity blocks; if it loses another one, it will crash and take the storage pool with it.

In mirrored virtual devices (mirror vdev), each block is stored on each device in the vdev. Although two-wide mirrors are the most common, any arbitrary number of devices can be in a mirror - triples are often used in large installations for improved read performance and fault tolerance. A vdev mirror can survive any failure as long as at least one device in the vdev continues to function.

Single vdevs are inherently dangerous. Such a virtual device will not survive a single failure - and if used as storage or a special vdev, then its failure will lead to the destruction of the entire pool. Be very, very careful here.

CACHE, LOG, and SPECIAL VAs can be created using any of the above topologies - but remember that the loss of a SPECIAL VA means the loss of the pool, so a redundant topology is highly recommended.

device

This is probably the easiest term to understand in ZFS - it's literally a block random access device. Remember that virtual devices are made up of individual devices, while a pool is made up of virtual devices.

Disks - either magnetic or solid state - are the most common block devices that are used as the building blocks of vdev. However, any device with a descriptor in /dev will do, so entire hardware RAID arrays can be used as separate devices.

A simple raw file is one of the most important alternative block devices that a vdev can be built from. Test pools from sparse files is a very handy way to check pool commands and see how much space is available in a pool or virtual device of a given topology.

ZFS Basics: Storage and Performance
You can create a test pool from sparse files in just a few seconds - but don't forget to delete the entire pool and its components afterwards

Let's say you want to put a server on eight disks and plan to use 10 TB disks (~9300 GiB) - but you are not sure which topology best suits your needs. In the example above, we build a test pool from sparse files in seconds - and now we know that a RAIDz2 vdev of eight 10 TB disks provides 50 TiB of usable capacity.

Another special class of devices is SPARE (spare). Hot-swap devices, unlike regular devices, belong to the entire pool, and not to a single virtual device. If a vdev in the pool fails and a spare device is connected to the pool and available, then it will automatically join the affected vdev.

After connecting to the affected vdev, the spare device begins to receive copies or reconstructions of the data that should be on the missing device. In traditional RAID this is called rebuilding, while in ZFS it is called resilvering.

It is important to note that spare devices do not permanently replace failed devices. This is only a temporary replacement to reduce the amount of time vdev is degraded. After the administrator replaces the failed vdev, redundancy is restored to that permanent device, and SPARE is disconnected from the vdev and returned to work as a spare for the entire pool.

Data sets, blocks and sectors

The next set of building blocks to understand on our ZFS journey is less about the hardware and more about how the data itself is organized and stored. We're skipping a few levels here - such as metaslab - so as not to clutter up the details while maintaining an understanding of the overall structure.

Data set (dataset)

ZFS Basics: Storage and Performance
When we first create a dataset, it shows all available pool space. Then we set the quota - and change the mount point. Magic!

ZFS Basics: Storage and Performance
Zvol is for the most part just a dataset stripped of its filesystem layer, which we're replacing here with a perfectly normal ext4 filesystem.

A ZFS dataset is roughly the same as a standard mounted file system. Like a regular file system, at first glance it looks like "just another folder". But just like regular mountable filesystems, each ZFS dataset has its own set of basic properties.

First of all, a dataset can have an assigned quota. If set zfs set quota=100G poolname/datasetname, then you will not be able to write to the mounted folder /poolname/datasetname more than 100 GiB.

Notice the presence - and absence - of slashes at the beginning of each line? Each dataset has its own place in both the ZFS hierarchy and the system mount hierarchy. There is no leading slash in the ZFS hierarchy - you start with the pool name and then the path from one dataset to the next. For example, pool/parent/child for a dataset named child under the parent dataset parent in a pool with a creative name pool.

By default, the dataset's mount point will be equivalent to its name in the ZFS hierarchy, with a leading slash - the pool named pool mounted as /pool, data set parent mounted in /pool/parent, and the child dataset child mounted in /pool/parent/child. However, the dataset's system mount point can be changed.

If we specify zfs set mountpoint=/lol pool/parent/child, then the data set pool/parent/child mounted on the system as /lol.

In addition to datasets, we should mention volumes (zvols). A volume is roughly the same as a dataset, except that it doesn't actually have a file systemβ€”it's just a block device. You can, for example, create zvol With name mypool/myzvol, then format it with an ext4 file system, and then mount that file system - you now have an ext4 file system, but with all the security features of ZFS! This may seem silly on a single machine, but makes much more sense as a backend when exporting an iSCSI device.

Blocks

ZFS Basics: Storage and Performance
The file is represented by one or more blocks. Each block is stored on one virtual device. The block size is usually equal to the parameter recordsize, but can be reduced to 2^shiftif it contains metadata or a small file.

ZFS Basics: Storage and Performance
We really really not joking about the huge performance penalty if you set too small ashift

In a ZFS pool, all data, including metadata, is stored in blocks. The maximum block size for each data set is defined in the property recordsize (record size). The record size can be changed, but this will not change the size or location of any blocks that have already been written to the dataset - it only affects new blocks as they are written.

Unless otherwise specified, the current default record size is 128 KiB. It's kind of a tricky trade-off where performance isn't perfect, but it's not terrible in most cases either. Recordsize can be set to any value from 4K to 1M (with advanced settings recordsize you can install even more, but this is rarely a good idea).

Any block refers to the data of only one file - you can't cram two different files into one block. Each file consists of one or more blocks, depending on the size. If the file size is smaller than the record size, it will be stored in a smaller block size - for example, a block with a 2 KiB file will occupy only one 4 KiB sector on the disk.

If the file is large enough and requires several blocks, then all records with this file will be of size recordsize - including the last entry, the main part of which may be unused space.

zvols do not have a property recordsize β€” instead they have an equivalent property volblocksize.

Sectors

The last, most basic building block is the sector. It is the smallest physical unit that can be written to or read from the underlying device. For several decades, most disks used 512-byte sectors. Recently, most disks are configured for 4 KiB sectors, and some - especially SSDs - have 8 KiB sectors or even more.

The ZFS system has a property that allows you to manually set the sector size. This property ashift. Somewhat confusingly, ashift is a power of two. For example, ashift=9 means a sector size of 2^9, or 512 bytes.

ZFS queries the operating system for detailed information about each block device when it is added to a new vdev, and theoretically automatically installs ashift properly based on that information. Unfortunately, many drives lie about their sector size in order to maintain compatibility with Windows XP (which was unable to understand drives with other sector sizes).

This means that a ZFS administrator is strongly advised to know the actual sector size of their devices and manually set ashift. If ashift is set too low, then the number of read / write operations increases astronomically. So, writing 512-byte "sectors" into a real 4 KiB sector means having to write the first "sector", then read the 4 KiB sector, modify it with a second 512-byte "sector", write it back to the new 4 KiB sector, and so on. for each entry.

In the real world, such a penalty hits Samsung EVO SSDs, for which ashift=13, but these SSDs lie about their sector size, and therefore the default is set to ashift=9. If an experienced system administrator does not change this setting, then this SSD works slower conventional magnetic HDD.

For comparison, for too large size ashift there is practically no penalty. There is no real performance penalty, and the increase in unused space is infinitesimal (or zero with compression enabled). Therefore, we strongly recommend that even those drives that do use 512-byte sectors install ashift=12 or ashift=13to face the future with confidence.

Property ashift is set for each vdev virtual device, and not for the pool, as many mistakenly think - and does not change after installation. If you accidentally hit ashift when you add a new vdev to a pool, you have irretrievably polluted that pool with a low performance device and there is usually no other choice but to destroy the pool and start over. Even removing vdev will not save you from a broken configuration ashift!

Copy-on-write mechanism

ZFS Basics: Storage and Performance
If a regular file system needs to overwrite data, it changes each block where it is

ZFS Basics: Storage and Performance
A copy-on-write file system writes a new block version and then unlocks the old version

ZFS Basics: Storage and Performance
In the abstract, if we ignore the actual physical location of the blocks, then our "data comet" is simplified to a "data worm" that moves from left to right across the map of available space

ZFS Basics: Storage and Performance
Now we can get a good idea of ​​how copy-on-write snapshots work - each block can belong to multiple snapshots, and will persist until all associated snapshots are destroyed

The Copy on Write (CoW) mechanism is the fundamental basis of what makes ZFS such an amazing system. The basic concept is simple - if you ask a traditional file system to change a file, it will do exactly what you asked. If you ask a copy-on-write file system to do the same, it will say "ok" but lie to you.

Instead, a copy-on-write file system writes a new version of the modified block and then updates the file's metadata to unlink the old block and associate the new block you just wrote to it.

Detaching the old block and linking the new one is done in one operation, so it cannot be interrupted - if you power down after this happens, you have a new version of the file, and if you power down early, you have the old version. In any case, there will be no conflicts in the file system.

Copy-on-write in ZFS occurs not only at the file system level, but also at the disk management level. This means that ZFS is not affected by white space (a hole in the RAID) - a phenomenon when the strip had time to only partially record before the system crashed, with array damage after a reboot. Here the stripe is written atomically, vdev is always sequential, and Bob is your uncle.

ZIL: ZFS intent log

ZFS Basics: Storage and Performance
The ZFS system treats synchronous writes in a special way - it temporarily but immediately stores them in ZIL before writing them permanently later along with asynchronous writes.

ZFS Basics: Storage and Performance
Typically, data written to a ZIL is never read again. But it's possible after a system crash

ZFS Basics: Storage and Performance
SLOG, or secondary LOG device, is just a special - and preferably very fast - vdev, where the ZIL can be stored separately from the main storage

ZFS Basics: Storage and Performance
After a crash, all dirty data in ZIL is replayed - in this case, ZIL is on SLOG, so it is replayed from there

There are two main categories of write operations - synchronous (sync) and asynchronous (async). For most workloads, the vast majority of writes are asynchronous - the file system allows them to be aggregated and issued in batches, reducing fragmentation and greatly increasing throughput.

Synchronized recordings are a completely different matter. When an application requests a synchronous write, it tells the file system: "You need to commit this to non-volatile memory right nowuntil then, there's nothing else I can do." Therefore, synchronous writes should be committed to disk immediatelyβ€”and if that increases fragmentation or reduces throughput, then so be it.

ZFS handles synchronous writes differently than regular file systemsβ€”instead of immediately committing them to regular storage, ZFS commits them to a special storage area called the ZFS Intent Log, or ZIL. The trick is that these records also remain in memory, being aggregated along with normal asynchronous write requests, to be later flushed to storage as perfectly normal TXGs (Transaction Groups).

In normal operation, the ZIL is written to and never read again. When, after a few moments, the records from the ZIL are committed to the main storage in ordinary TXGs from RAM, they are detached from the ZIL. The only time something is read from the ZIL is when the pool is imported.

If ZFS fails - an operating system crash or a power outage - while there is data in the ZIL, that data will be read during the next pool import (for example, when the emergency system is restarted). Anything in the ZIL will be read, grouped into TXGs, committed to the main storage, and then detached from the ZIL during the import process.

One of the vdev helper classes is called LOG or SLOG, the secondary device of LOG. It has one purpose - to provide the pool with a separate, and preferably much faster, very write-resistant vdev to store the ZIL, instead of storing the ZIL on the main vdev store. The ZIL itself behaves the same no matter where it is stored, but if the LOG vdev has very high write performance, synchronous writes will be faster.

Adding a vdev with LOG to the pool does not work can not improve asynchronous write performance - even if you force all writes to ZIL with zfs set sync=always, they will still be linked to the main storage in TXG in the same way and at the same pace as without the log. The only direct performance improvement is the latency of synchronous writes (because faster log speeds up operations). sync).

However, in an environment that already requires a lot of synchronous writes, vdev LOG can indirectly speed up asynchronous writes and non-cached reads. Offloading ZIL entries to a separate vdev LOG means less contention for IOPS on primary storage, which improves the performance of all reads and writes to some extent.

Snapshots

The copy-on-write mechanism is also a necessary foundation for ZFS atomic snapshots and incremental asynchronous replication. The active file system has a pointer tree marking all records with current data - when you take a snapshot, you simply make a copy of this pointer tree.

When a record is overwritten on the active file system, ZFS first writes the new block version to unused space. It then detaches the old version of the block from the current file system. But if some snapshot refers to the old block, it still remains unchanged. The old block will not actually be restored as free space until all snapshots referencing this block are destroyed!

Replication

ZFS Basics: Storage and Performance
My Steam library in 2015 was 158 GiB and included 126 files. This is pretty close to the optimal situation for rsync - ZFS replication over the network was "only" 927% faster.

ZFS Basics: Storage and Performance
On the same network, replicating a single 40GB Windows 7 virtual machine image file is a completely different story. ZFS replication is 289 times faster than rsync - or "only" 161 times faster if you're savvy enough to call rsync with --inplace.

ZFS Basics: Storage and Performance
When a VM image is scaled, rsync issues scale with it. 1,9 TiB is not that big for a modern VM image - but it's big enough that ZFS replication is 1148 times faster than rsync, even with rsync's --inplace argument

Once you understand how snapshots work, it should be easy to grasp the essence of replication. Since a snapshot is just a tree of pointers to records, it follows that if we do zfs send snapshot, then we send both this tree and all records associated with it. When we send this zfs send Π² zfs receive on the target, it writes both the actual contents of the block and the tree of pointers that refer to the blocks to the target dataset.

Things get even more interesting on the second zfs send. We now have two systems, each containing poolname/datasetname@1, and you take a new snapshot poolname/datasetname@2. Therefore, in the original pool you have datasetname@1 ΠΈ datasetname@2, and in the target pool so far only the first snapshot datasetname@1.

Since we have a common snapshot between the source and the target datasetname@1, we can do it incremental zfs send over it. When we say to the system zfs send -i poolname/datasetname@1 poolname/datasetname@2, it compares two pointer trees. Any pointers that only exist in @2, obviously refer to new blocks - so we need the contents of these blocks.

On a remote system, processing an incremental send just as simple. First we write all new entries included in the stream send, and then add pointers to those blocks. Voila, we have @2 in the new system!

ZFS asynchronous incremental replication is a huge improvement over earlier non-snapshot based methods such as rsync. In both cases, only changed data is transferred - but rsync must first read the from the disk all the data on both sides to check the sum and compare it. In contrast, ZFS replication reads nothing but pointer trees - and any blocks that are not present in the shared snapshot.

Built-in compression

The copy-on-write mechanism also simplifies the inline compression system. In a traditional file system, compression is problematic - both the old version and the new version of the modified data reside in the same space.

If we consider a piece of data in the middle of a file that starts life as a megabyte of zeros from 0x00000000 and so on, it is very easy to compress it to one sector on disk. But what happens if we replace that megabyte of zeros with a megabyte of incompressible data like JPEG or pseudo-random noise? Unexpectedly, this megabyte of data will require not one, but 256 4 KiB sectors, and in this place on the disk only one sector is reserved.

ZFS doesn't have this problem, as modified records are always written to unused space - the original block only occupies one 4 KiB sector, and the new record will occupy 256, but this is not a problem - a recently modified fragment from the "middle" of the file would be written to unused space regardless of whether its size has changed or not, so for ZFS this is quite a regular situation.

Native ZFS compression is disabled by default, and the system offers pluggable algorithmsβ€”currently LZ4, gzip (1-9), LZJB, and ZLE.

  • LZ4 is a streaming algorithm that offers extremely fast compression and decompression and performance benefits for most use cases - even on fairly slow CPUs.
  • GZIP is a venerable algorithm that all Unix users know and love. It can be implemented with compression levels 1-9, with compression ratio and CPU usage increasing as it approaches level 9. The algorithm is well suited for all text (or other highly compressible) use cases, but otherwise often causes CPU issues βˆ’ use it with care, especially at higher levels.
  • LZJB is the original algorithm in ZFS. It is obsolete and should no longer be used, the LZ4 surpasses it in every way.
  • WRONG - zero level encoding, Zero Level Encoding. It does not touch normal data at all, but compresses large sequences of zeros. Useful for completely incompressible datasets (such as JPEG, MP4, or other already compressed formats) as it ignores incompressible data but compresses unused space in the resulting records.

We recommend LZ4 compression for almost all use cases; the performance penalty when encountering incompressible data is very small, and growth performance for typical data is significant. Copying a virtual machine image for a new installation of the Windows operating system (freshly installed OS, no data inside yet) with compression=lz4 passed 27% faster than with compression=noneIn this test in 2015.

ARC - adaptive replacement cache

ZFS is the only modern file system we know of that uses its own read caching mechanism, rather than relying on the operating system's page cache to store copies of recently read blocks in RAM.

Although the native cache is not without its problems - ZFS cannot respond to new memory allocation requests as quickly as the kernel, so the new challenge malloc() on memory allocation may fail if it needs the RAM currently occupied by ARC. But there are good reasons to use your own cache, at least for now.

All known modern operating systems, including MacOS, Windows, Linux and BSD, use the LRU (Least Recently Used) algorithm to implement the page cache. This is a primitive algorithm that pushes the cached block "up the queue" after each read, and pushes the blocks "down the queue" as needed to add new cache misses (blocks that should have been read from disk, not from the cache) up.

The algorithm usually works fine, but on systems with large working datasets, LRU easily leads to thrashing - evicting frequently needed blocks to make room for blocks that will never be read from the cache again.

ARC is a much less naive algorithm that can be thought of as a "weighted" cache. Each time a cached block is read, it gets a little "heavier" and harder to evict - and even after evicting a block tracked within a certain period of time. A block that has been evicted but then needs to be read back into the cache will also become "heavier".

The end result of all this is a cache with a much higher hit ratio, the ratio between cache hits (reads performed from the cache) and cache misses (reads from disk). This is an extremely important statistic - not only are the cache hits themselves served orders of magnitude faster, cache misses can also be served faster, since the more cache hits, the fewer concurrent disk requests and the lower the latency for those remaining misses that must be served with disk.

Conclusion

After learning the basic semantics of ZFS - how copy-on-write works, as well as the relationships between storage pools, virtual devices, blocks, sectors, and files - we're ready to discuss real-world performance with real numbers.

In the next part, we'll take a look at the actual performance of pools with mirrored vdevs and RAIDz, versus each other, and also versus the traditional Linux kernel RAID topologies we've explored. earlier.

At first, we wanted to cover only the basics - the ZFS topologies themselves - but after this let's get ready to talk about more advanced setup and tuning of ZFS, including the use of auxiliary vdev types such as L2ARC, SLOG and Special Allocation.

Source: habr.com

Add a comment