Why is my NVMe slower than an SSD?

Why is my NVMe slower than an SSD?
In this article, we will look at some of the nuances of the I / O subsystem and their impact on performance.

A couple of weeks ago I ran into a question why NVMe on one server is slower than SATA on another. I looked at the characteristics of the servers and realized that it was a trick question: NVMe was from the user segment, and SSD was from the server segment.

Obviously, it is not correct to compare products from different segments in different environments, but this is not an exhaustive technical answer. We will study the basics, conduct experiments and give an answer to the question posed.

What is fsync and where is it used

To speed up work with drives, data is buffered, that is, stored in volatile memory until a convenient opportunity presents itself to save the contents of the buffer to the drive. Opportunity criteria are determined by the operating system and drive characteristics. In the event of a power failure, all data in the buffer will be lost.

There are a number of tasks in which you need to be sure that the changes in the file are written to the drive, and do not lie in an intermediate buffer. This assurance can be obtained by using the POSIX-compliant fsync system call. The fsync call forces a write from the buffer to the drive.

Let's demonstrate the effect of buffers with an artificial example in the form of a short C program.

#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>

int main(void) {
    /* Открываем файл answer.txt на запись, если его нет -- создаём */
    int fd = open("answer.txt", O_WRONLY | O_CREAT);
    /* Записываем первый набор данных */
    write(fd, "Answer to the Ultimate Question of Life, The Universe, and Everything: ", 71);
    /* Делаем вид, что проводим вычисления в течение 10 секунд */
    sleep(10);
    /* Записываем результат вычислений */
    write(fd, "42n", 3); 

    return 0;
}

Comments explain the sequence of actions in the program well. The text “the answer to the main question of life, the universe and all that” will be buffered by the operating system, and if you restart the server by pressing the Reset button during “calculations”, the file will be empty. In our example, text loss is not a problem, so fsync is not needed. Databases do not share this optimism.

Databases are complex programs that work with many files at the same time, so they want to be sure that the data they write will be stored on the drive, since the consistency of data within the database depends on this. The databases are designed to record all completed transactions and be ready for a power outage at any time. This behavior obliges you to use fsync constantly in large quantities.

What affects the frequent use of fsync

With normal I/O, the operating system tries to optimize disk communication, since external drives are the slowest in the memory hierarchy. Therefore, the operating system tries to write as much data as possible in one access to the drive.

Let's demonstrate the impact of using fsync with a specific example. We have the following SSDs as test subjects:

  • Intel® DC SSD S4500 480 GB, connected via SATA 3.2, 6 Gb/s;
  • Samsung 970 EVO Plus 500GB, connected via PCIe 3.0 x4, ~31 Gbps.

Tests are conducted on an Intel® Xeon® W-2255 running Ubuntu 20.04. To test disks, sysbench 1.0.18 is used. The disks have a single partition formatted as ext4. Preparing for the test is to create 100 GB files:

sysbench --test=fileio --file-total-size=100G prepare

Running tests:

# Без fsync
sysbench --num-threads=16 --test=fileio --file-test-mode=rndrw --file-fsync-freq=0 run

# С fsync после каждой записи
sysbench --num-threads=16 --test=fileio --file-test-mode=rndrw --file-fsync-freq=1 run

The test results are presented in the table.

Test
Intel® S4500
Samsung 970 EVO+

Read without fsync, MiB/s
5734.89
9028.86

Write without fsync, MiB/s
3823.26
6019.24

Reading with fsync, MiB/s
37.76
3.27

Recording with fsync, MiB/s
25.17
2.18

It is easy to see that NVMe from the client segment confidently leads when the operating system itself decides how to work with disks, and loses when fsync is used. This raises two questions:

  1. Why does the read speed exceed the physical bandwidth of the link in the test without fsync?
  2. Why is a server segment SSD better at handling a large number of fsync requests?

The answer to the first question is simple: sysbench generates zero-filled files. Thus, the test was carried out over 100 gigabytes of zeros. Since the data is very uniform and predictable, various OS optimizations come into play, and they significantly speed up the execution.

If you question all the results of sysbench, then you can use fio.

# Без fsync
fio --name=test1 --blocksize=16k --rw=randrw --iodepth=16 --runtime=60 --rwmixread=60 --fsync=0 --filename=/dev/sdb

# С fsync после каждой записи
fio --name=test1 --blocksize=16k --rw=randrw --iodepth=16 --runtime=60 --rwmixread=60 --fsync=1 --filename=/dev/sdb

Test
Intel® S4500
Samsung 970 EVO+

Read without fsync, MiB/s
45.5
178

Write without fsync, MiB/s
30.4
119

Reading with fsync, MiB/s
32.6
20.9

Recording with fsync, MiB/s
21.7
13.9

The trend towards performance drop in NVMe when using fsync is clearly visible. You can move on to the second question.

Optimization or bluff

Earlier we said that the data is stored in a buffer, but did not specify in which one, since it was not important. Even now we will not delve into the intricacies of operating systems and single out two general types of buffers:

  • program;
  • hardware.

The software buffer refers to the buffers that are in the operating system, and the hardware buffer refers to the volatile memory of the disk controller. The fsync system call sends a command to the drive to write data from its buffer to the main storage, but it has no way to control the correct execution of the command.

Since the SSD performs better, two assumptions can be made:

  • the disk is designed for a load of a similar plan;
  • the disk "bluffs" and ignores the command.

Dishonest behavior of the drive can be noticed if you perform a test with a power failure. You can check this with a script. diskchecker.pl, which was created in 2005 year.

This script requires two physical machines - "server" and "client". The client writes a small amount of data to the drive under test, calls fsync, and sends the server information about what was written.

# Запускается на сервере
./diskchecker.pl -l [port]

# Запускается на клиенте
./diskchecker.pl -s <server[:port]> create <file> <size_in_MB>

After running the script, it is necessary to de-energize the “client” and do not return power for several minutes. It is important to disconnect the test subject from electricity, and not just perform a hard shutdown. After some time, the server can be connected and loaded into the OS. After booting the OS, you need to start again diskchecker.pl, but with an argument verify.

./diskchecker.pl -s <server[:port]> verify <file>

At the end of the check, you will see the number of errors. If they are 0, then the disk passed the test. To exclude a combination of circumstances that is successful for the disk, the experiment can be repeated several times.

Our S4500 showed no power loss errors, which means it's ready for loads with lots of fsync calls.

Conclusion

When choosing disks or entire ready-made configurations, you should keep in mind the specifics of the tasks that need to be solved. At first glance, it seems obvious that NVMe, that is, an SSD with a PCIe interface, is faster than a "classic" SATA SSD. However, as we have understood today, in specific conditions and with certain tasks this may not be the case.

How do you test server components when renting from an IaaS provider?
We are waiting for you in the comments.

Why is my NVMe slower than an SSD?

Source: habr.com

Add a comment