How to check disks with fio for sufficient performance for etcd

Note. transl.: This article is the result of a mini-research carried out by IBM Cloud engineers in search of a solution to a real problem related to the operation of the etcd database. A similar task was relevant for us, however, the course of reflections and actions of the authors may be interesting in a broader context.

How to check disks with fio for sufficient performance for etcd

Brief summary of the entire article: fio and etcd

The performance of an etcd cluster is highly dependent on the speed of the underlying storage. etcd exports various Prometheus metrics to monitor performance. One of them is wal_fsync_duration_seconds. In the documentation for etcd saysthat storage can be considered fast enough if the 99th percentile of this metric is less than 10ms…

If you are considering setting up an etcd cluster on Linux machines and want to check if drives (such as SSDs) are fast enough, we recommend using the popular I/O tester called thread. It is enough to run the following command (directory test-data must be located in the mounted partition of the tested drive):

fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest

It remains only to look at the output and check if the 99th percentile fits fdatasync in 10 ms. If so, then your drive is working fast enough. Here is an example output:

fsync/fdatasync/sync_file_range:
  sync (usec): min=534, max=15766, avg=1273.08, stdev=1084.70
  sync percentiles (usec):
   | 1.00th=[ 553], 5.00th=[ 578], 10.00th=[ 594], 20.00th=[ 627],
   | 30.00th=[ 709], 40.00th=[ 750], 50.00th=[ 783], 60.00th=[ 1549],
   | 70.00th=[ 1729], 80.00th=[ 1991], 90.00th=[ 2180], 95.00th=[ 2278],
   | 99.00th=[ 2376], 99.50th=[ 9634], 99.90th=[15795], 99.95th=[15795],
   | 99.99th=[15795]

A few notes:

  1. In the above example, we have adjusted the parameters --size ΠΈ --bs for a specific case. To get a meaningful result from fio, specify values ​​appropriate for your use case. How to choose them will be discussed below.
  2. During testing only fio loads the disk subsystem. In real life, it is likely that other processes will write to disk (besides those related to wal_fsync_duration_seconds). This additional load can increase wal_fsync_duration_seconds. In other words, if the 99th percentile from testing with fio, only slightly less than 10 ms, there is a good chance that storage performance is not sufficient.
  3. For the test you will need the version fio not less than 3.5, because older versions don't aggregate results fdatasync in the form of percentiles.
  4. The above conclusion is only a small excerpt from the general conclusion fio.

More about fio and etcd

A few words about WALs etcd

Generally, databases use proactive logging (write-ahead logging, WAL). etcd is also affected. A discussion of WAL is beyond the scope of this article, but for our purposes, what you need to know is that each etcd cluster member stores WAL in persistent storage. etcd writes some key-value storage operations (such as updates) to WAL before executing them. If a node crashes and restarts between snapshots, etcd can recover transactions since the previous snapshot based on the contents of the WAL.

Thus, each time a client adds a key to the KV store or updates the value of an existing key, etcd adds the description of the operation to the WAL, which is a regular file in the persistent store. etcd MUST be 100% sure that the WAL entry has actually been saved before proceeding. To achieve this on Linux, it is not enough to use the system call write, since the write operation itself to the physical media may be delayed. For example, Linux may keep a WAL entry in an in-memory kernel cache (eg, in the page cache) for some time. To ensure that the data is written to the media, a system call must be invoked after the write fdatasync - this is exactly what etcd does (as you can see in the following output strace; here 8 - WAL file descriptor):

21:23:09.894875 lseek(8, 0, SEEK_CUR)   = 12808 <0.000012>
21:23:09.894911 write(8, ".20210220361223255266632$1020103026"34"rn3fo"..., 2296) = 2296 <0.000130>
21:23:09.895041 fdatasync(8)            = 0 <0.008314>

Unfortunately, writing to persistent storage takes some time. Prolonged execution of fdatasync calls can affect the performance of etcd. In the repository documentation is indicated, that for sufficient performance it is necessary that the 99th percentile of the duration of all calls fdatasync when writing to a WAL file was less than 10 ms. There are other metrics related to storage, but this article will focus on this one.

Valuing storage with fio

You can evaluate whether a certain storage is suitable for use with etcd using the utility thread β€” a popular I/O tester. Keep in mind that disk I/O can happen in many different ways: sync/async, many different syscall classes, and so on. The other side of the coin is that fio extremely difficult to use. The utility has many parameters, and different combinations of their values ​​lead to completely different results. In order to get a reasonable estimate for etcd, you need to make sure that the write load generated by fio is as close as possible to etcd's WAL file write load:

  • This means that the generated fio the load should at least be a series of consecutive writes to the file, where each write consists of a system call writefollowed by fdatasync.
  • To enable sequential writing, you must specify the flag --rw=write.
  • That fio wrote using calls write (rather than other system calls - for example, pwrite), use the flag --ioengine=sync.
  • Finally, the flag --fdatasync=1 ensures that every write should fdatasync.
  • The other two parameters in our example are: --size ΠΈ --bs - may vary depending on the specific use case. The next section will describe their configuration.

Why we chose fio and how we learned how to set it up

This note comes from a real case we encountered. We had a cluster on Kubernetes v1.13 with monitoring on Prometheus. SSDs were used as storage for etcd v3.2.24. Etcd metrics showed too high latencies fdatasync, even when the cluster was idle. To us, these metrics seemed very dubious, and we were not sure what exactly they represented. In addition, the cluster consisted of virtual machines, so it was not possible to say whether the delay was due to virtualization or the SSD was to blame.

In addition, we considered various changes in the hardware and software configuration, so we needed a way to evaluate them. Of course, it would be possible to run etcd in each configuration and look at the corresponding Prometheus metrics, but that would require significant effort. What we needed was a simple way to evaluate a specific configuration. We wanted to test our understanding of the Prometheus metrics coming from etcd.

This required solving two problems:

  • First, what does the I/O load generated by etcd when writing to WAL files look like? What system calls are used? What is the size of the record blocks?
  • Secondly, let's say we have answers to the above questions. How to reproduce the corresponding load with fio? After all fio β€” extremely flexible utility with an abundance of parameters (this is easy to verify, for example, here - approx. transl.).

We solved both problems with the same command-based approach lsof ΠΈ strace:

  • With lsof you can view all file descriptors used by the process, as well as the files they refer to.
  • With strace you can analyze an already running process or run a process and watch it. The command displays all system calls made by this process and, if necessary, its descendants. The latter is important for processes that are forking, and etcd is one such process.

The first thing we did was to use strace to examine the etcd server in the Kubernetes cluster while it was idle.

So it was found that WAL record blocks are very densely grouped, the size of the majority was in the range of 2200-2400 bytes. That is why the command at the beginning of this article uses the flag --bs=2300 (bs is the size in bytes of each write block in fio).

Please note that the size of etcd write blocks may vary depending on the version, deployment, parameter values, etc. - it affects the duration fdatasync. If you have a similar use case, analyze with strace your etcd processes to get up-to-date values.

Then, in order to get a clear and comprehensive idea of ​​how etcd works with the file system, we started it from under strace with flags -ffttT. This made it possible to capture child processes and write the output of each to a separate file. In addition, detailed information about the start time and duration of each system call was obtained.

We also used the command lsofto confirm your understanding of the output strace in terms of which file descriptor was used for which purpose. I got the conclusion strace, similar to the one above. Statistical manipulations with synchronization times confirmed that the metric wal_fsync_duration_seconds from etcd matches calls fdatasync with WAL file descriptors.

To generate with fio a workload similar to that from etcd, the documentation of the utility was studied and the parameters suitable for our task were selected. We have verified that the correct system calls are in progress and confirmed their duration by running fio of strace (as it was done in case of etcd).

Particular attention was paid to determining the value of the parameter --size. It represents the total I/O load generated by the fio utility. In our case, this is the total number of bytes written to the media. It is directly proportional to the number of calls write (And fdatasync). For a specific bs number of calls fdatasync equally size / bs.

Since we were interested in the percentile, we wanted the number of samples to be large enough to be statistically significant. And decided that 10^4 (which corresponds to a size of 22 MB) will suffice. Smaller parameter values --size gave more pronounced noise (for example, calls fdatasync, which take much longer than usual and affect the 99th percentile).

It's up to you

The article shows how to use fio one can judge whether the media intended for use with etcd is fast enough. Now it's up to you! You can explore virtual machines with SSD-based storage in the service IBM Cloud.

PS from translator

With ready-made use cases fio For other tasks, see documentation or directly to project repositories (there are many more of them than mentioned in the documentation).

PPS from translator

Read also on our blog:

Source: habr.com

Add a comment