Lots of free RAM, NVMe Intel P4500, and everything slows down badly - the story of the unsuccessful addition of a swap partition

In this article, I will talk about a situation that recently happened with one of the servers in our VPS cloud, confusing me for several hours. I have been configuring and troubleshooting Linux servers for about 15 years, but this case does not fit into my practice at all - I made several false assumptions and got a little desperate before I could correctly determine the cause of the problem and solve it.

Preamble

We operate a medium-sized cloud, which we build on typical servers of the following configuration - 32 cores, 256 GB RAM and a 4500TB PCI-E Intel P4 NVMe drive. We really like this configuration, because it allows you to not think about the lack of IO, providing a correct limit at the level of VM instances (instances) types. Because NVMe Intel P4500 has impressive performance, we can provide both full IOPS to machines and backup storage to a backup server with zero IOWAIT at the same time.

We are among those old believers who do not use hyperconverged SDN and other stylish, fashionable, youthful things for storing VM volumes, believing that the simpler the system, the easier it is to troubleshoot in the conditions of "the main guru went to the mountains." As a result, we store VM volumes in QCOW2 format in XFS or EXT4, which is deployed on top of LVM2.

We are also forced to use QCOW2 by the product we use for orchestration - Apache CloudStack.

To perform a backup, we take a full image of the volume as an LVM2 snapshot (yes, we know that LVM2 snapshots are slow, but the Intel P4500 helps us out here too). We do lvmcreate -s .. and with the help dd send the backup to a remote server with ZFS storage. Here we are still slightly progressive - after all, ZFS can store data in a compressed form, and we can quickly restore them using DD or retrieve individual VM volumes with mount -o loop ....

You can, of course, not shoot a full image of an LVM2 volume, but mount the file system in the mode RO and copy the QCOW2 images themselves, however, we were faced with the fact that XFS became bad from this, and not immediately, but unpredictably. We really don't like it when hypervisor hosts get stuck all of a sudden on weekends, nights, or holidays due to bugs that don't know when. Therefore, for XFS, we do not use snapshot mounts in mode RO to extract volumes, but simply copy the entire LVM2 volume.

The speed of backup to the backup server is determined in our case by the performance of the backup server, which is about 600-800 MB / s for incompressible data, a further limiter is the 10Gbit / s channel that connects the backup server to the cluster.

At the same time, backup copies of 8 hypervisor servers are simultaneously uploaded to one backup server. Thus, the disk and network subsystems of the backup server, being slower, do not allow to overload the disk subsystems of hypervisor hosts, since they are simply unable to process, say, 8 GB / sec, which hypervisor hosts can easily issue.

The above copying process is very important for further narration, including the details - using a fast Intel P4500 drive, using NFS, and probably using ZFS.

History of backup

On each hypervisor node, we have a small 8 GB SWAP partition, and we β€œroll out” the hypervisor node itself using DD from the reference image. For the system volume on servers, we use 2xSATA SSD RAID1 or 2xSAS HDD RAID1 on an LSI or HP hardware controller. In general, it doesn't matter to us what's inside, since the system volume works for us in the "almost readonly" mode, except for SWAP. And since we have a lot of RAM on the server and it is 30-40% free, we don't think about SWAP.

The backup process. This task looks like this:

#!/bin/bash

mkdir -p /mnt/backups/volumes

DIR=/mnt/images-snap
VOL=images/volume
DATE=$(date "+%d")
HOSTNAME=$(hostname)

lvcreate -s -n $VOL-snap -l100%FREE $VOL
ionice -c3 dd iflag=direct if=/dev/$VOL-snap bs=1M of=/mnt/backups/volumes/$HOSTNAME-$DATE.raw
lvremove -f $VOL-snap

pay attention to ionice -c3, in fact this thing is completely useless for NVMe devices, since the IO scheduler for them is set as:

cat /sys/block/nvme0n1/queue/scheduler
[none] 

However, we have a number of legacy nodes with conventional SSD RAIDs, this is relevant for them, so they are moving AS IS. In general, this is just an interesting piece of code that explains the futility ionice in case of such a configuration.

Pay attention to the flag iflag=direct for DD. We use direct IO past the buffer cache to avoid unnecessary buffer IO swaps when reading. However, oflag=direct we don't, as we've seen performance issues with ZFS when using it.

This scheme has been used by us successfully for several years without problems.

And then it began... We found that one of the nodes stopped being backed up, and the previous one was completed with a monstrous IOWAIT under 50%. When trying to understand why copying does not occur, we encountered a phenomenon:

Volume group "images" not found

We began to think about "the end has come for the Intel P4500", however, before shutting down the server to replace the drive, it was still necessary to perform a backup. Fixed LVM2 by restoring metadata from an LVM2 backup:

vgcfgrestore images

We started the backup and saw this oil painting:
Lots of free RAM, NVMe Intel P4500, and everything slows down badly - the story of the unsuccessful addition of a swap partition

Again, they were very sad - it was clear that it was impossible to live like this, since all VPSs would suffer, which means we would suffer too. What happened is completely incomprehensible - iostat showed miserable IOPS and the highest IOWAIT. There were no ideas other than "let's replace NVMe", but an insight occurred in time.

Analysis of the situation step by step

Historical journal. A few days earlier, on this server, it was required to create a large VPS with 128 GB of RAM. There seemed to be enough memory, but for safety net they allocated another 32 GB for the swap partition. The VPS was created, successfully solved its task and the incident was forgotten, but the SWAP section remained.

Configuration Features. For all cloud servers option vm.swappiness was set to default 60. And SWAP was created on SAS HDD RAID1.

What happened (according to the editors). When backing up DD gave out a lot of data to write, which was placed in RAM buffers before being written to NFS. The core of the system, guided by the policy swappiness, was moving many pages of VPS memory to the swap area, which was on a slow HDD RAID1 volume. This led to the fact that IOWAIT grew very strongly, but not at the expense of NVMe IO, but at the expense of HDD RAID1 IO.

How the problem was solved. The 32GB swap partition has been disabled. It took 16 hours, you can read about how and why SWAP turns off so slowly. Settings have been changed swappiness to a value equal to 5 throughout the cloud.

How could this not happen. Firstly, if SWAP were on an SSD RAID or NVMe device, and secondly, if there were no NVMe device, but there would be a slower device that would not give out such an amount of data - ironically, the problem happened because that NVMe is too fast.

After that, everything began to work as before - with zero IOWAIT.

Source: habr.com

Add a comment