Storage speed suitable for etcd? Let's ask fio

Storage speed suitable for etcd? Let's ask fio

A short story about fio and etcd

Cluster performance etcd largely depends on the performance of its storage. etcd exports some metrics to Prometheusto provide the desired storage performance information. For example, the wal_fsync_duration_seconds metric. The documentation for etcd says: For storage to be considered fast enough, the 99th percentile of this metric must be less than 10ms. If you are planning to run an etcd cluster on Linux machines and want to evaluate if your storage is fast enough (e.g. SSD), you can use thread is a popular tool for testing I/O operations. Run the following command, where test-data is the directory under the storage mount point:

fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest

You just need to look at the results and check that the 99th percentile of the duration fdatasync less than 10 ms. If so, you have reasonably fast storage. Here is an example of the results:

  sync (usec): min=534, max=15766, avg=1273.08, stdev=1084.70
  sync percentiles (usec):
   | 1.00th=[ 553], 5.00th=[ 578], 10.00th=[ 594], 20.00th=[ 627],
   | 30.00th=[ 709], 40.00th=[ 750], 50.00th=[ 783], 60.00th=[ 1549],
   | 70.00th=[ 1729], 80.00th=[ 1991], 90.00th=[ 2180], 95.00th=[ 2278],
   | 99.00th=[ 2376], 99.50th=[ 9634], 99.90th=[15795], 99.95th=[15795],
   | 99.99th=[15795]

Notes

  • We have customized the --size and --bs options for our particular scenario. To get a useful result from fio, provide your own values. Where to get them? Read how we learned to configure fio.
  • During testing, all I/O load comes from fio. In a real-life scenario, there will likely be other write requests coming into the storage besides those related to wal_fsync_duration_seconds. The extra load will increase the value of wal_fsync_duration_seconds. So if the 99th percentile is close to 10ms, your storage is running out of speed.
  • Take the version thread not less than 3.5 (the previous ones don't show fdatasync duration percentiles).
  • Above is just a snippet of the results from fio.

Long story about fio and etcd

What is WAL in etcd

Usually databases use write-ahead log; etcd uses it too. We will not discuss the write-ahead log (WAL) in detail here. It is enough for us to know that each member of the etcd cluster maintains it in persistent storage. etcd writes each key-value operation (such as an update) to WAL before applying it to the store. If one of the storage members crashes and restarts between snapshots, it can locally restore transactions since the last snapshot by WAL content.

When a client adds a key to the key-value store or updates the value of an existing key, etcd records the operation in WAL, which is a regular file in persistent storage. etcd MUST be completely sure that the WAL entry actually occurred before continuing with processing. On Linux, one system call is not enough for this. write, since the actual write to physical storage may be delayed. For example, Linux may store a WAL entry in a cache in kernel memory (such as a page cache) for some time. And in order for the data to be accurately written to persistent storage, the fdatasync system call is needed after the write, and etcd just uses it (as you can see in the result of the work strace, where 8 is the WAL file descriptor):

21:23:09.894875 lseek(8, 0, SEEK_CUR)   = 12808 <0.000012>
21:23:09.894911 write(8, ". 20210220361223255266632$10 20103026"34"rn3fo"..., 2296) = 2296 <0.000130>
21:23:09.895041 fdatasync(8)            = 0 <0.008314>

Unfortunately, writing to persistent storage does not happen instantly. If the fdatasync call is slow, the performance of the etcd system will suffer. The documentation for etcd saysthat the storage is considered fast enough if, in the 99th percentile, fdatasync calls take less than 10ms to write to the WAL file. There are other useful metrics for storage, but in this post we are only talking about this metric.

Estimating storage with fio

If you need to evaluate if your storage is suitable for etcd, use fio, a very popular I/O load testing tool. It should be remembered that disk operations can be very different: synchronous and asynchronous, many classes of system calls, etc. As a result, fio is quite difficult to use. It has many parameters, and different combinations of their values ​​produce very different I/O workloads. To get adequate figures for etcd, you should make sure that the test write load from fio is as close as possible to the actual load from etcd when writing WAL files.

Therefore, fio should, at a minimum, create a load in the form of a series of consecutive writes to the file, each write will consist of a system call writefollowed by the fdatasync system call. Sequential writes to fio require the --rw=write option. For fio to use the write system call when writing, rather than write, you should specify the --ioengine=sync parameter. Finally, in order to call fdatasync after each write, you need to add the --fdatasync=1 parameter. The other two options in this example (--size and -bs) are script-specific. In the next section, we'll show you how to set them up.

Why exactly fio and how we learned to set it up

In this post, we describe a real case. We have a cluster Kubernetes v1.13 which we monitored with Prometheus. etcd v3.2.24 was hosted on an SSD. Etcd metrics showed fdatasync latencies too high, even when the cluster was doing nothing. The metrics were weird and we didn't really know what they meant. The cluster consisted of virtual machines, it was necessary to understand what the problem was: in physical SSDs or in the virtualization layer. In addition, we often made changes to the hardware and software configuration, and we needed a way to evaluate their results. We could run etcd in every configuration and look at Prometheus metrics, but that's too much of a hassle. We were looking for a fairly simple way to evaluate a specific configuration. We wanted to check if we understand Prometheus metrics from etcd correctly.

But for this, two problems had to be solved. First, what does the I/O load that etcd creates when writing to WAL look like? What system calls are used? What is the size of the records? Second, if we answer these questions, how do we reproduce a similar workload with fio? Don't forget that fio is a very flexible tool with many options. We solved both problems in one approach - using the commands lsof ΠΈ strace. lsof lists all file descriptors used by the process and their associated files. And with strace, you can examine an already running process, or start a process and examine it. strace prints all system calls from the process being examined (and its child processes). The latter is very important, since etcd is just taking a similar approach.

We first used strace to explore the etcd server for Kubernetes when there was no load on the cluster. We saw that almost all WAL records were about the same size: 2200–2400 bytes. Therefore, in the command at the beginning of the post, we specified the parameter -bs=2300 (bs means the size in bytes for each fio entry). Note that the size of the etcd entry depends on the etcd version, distribution, parameter values, etc., and affects the fdatasync duration. If you have a similar scenario, examine your etcd processes with strace to find out the exact numbers.

Then, to get a good idea of ​​what the etcd file system is doing, we started it with strace and the -ffttT options. So we tried to examine the child processes and record the output of each of them in a separate file, and also get detailed reports about the start and duration of each system call. We used lsof to confirm our analysis of the strace output and see which file descriptor was being used for which purpose. So with the help of strace, the results shown above were obtained. Synchronization time statistics confirmed that wal_fsync_duration_seconds from etcd is consistent with fdatasync calls with WAL file descriptors.

We went through the documentation for fio and chose options for our script so that fio would generate a load similar to etcd. We also checked system calls and their duration by running fio from strace, similar to etcd.

We have carefully chosen the value of the --size parameter to represent the entire I/O load from fio. In our case, this is the total number of bytes written to the storage. It turned out to be directly proportional to the number of write (and fdatasync) system calls. For a certain value of bs, the number of fdatasync calls = size/bs. Since we were interested in the percentile, we had to have enough samples to be sure, and we calculated that 10^4 would be enough for us (that's 22 mebibytes). If --size is smaller, outliers may occur (for example, several fdatasync calls take longer than usual and affect the 99th percentile).

Try it yourself

We showed you how to use fio and see if the storage is fast enough for etcd to perform well. Now you can try it out for yourself using, for example, virtual machines with SSD storage in IBM Cloud.

Source: habr.com

Add a comment