Steal: who steals CPU time from virtual machines

Steal: who steals CPU time from virtual machines

Hello! I want to tell in simple terms about the mechanics of the emergence of steal inside virtual machines and about some non-obvious artifacts that we managed to find out during his research, which I had to dive into as a technical director of a cloud platform Mail.ru Cloud Solutions. The platform runs on KVM.

CPU steal time is the time during which the virtual machine does not receive processor resources for its execution. This time is only considered in guest operating systems in virtualization environments. The reasons for where these allocated resources go, as in life, are very vague. But we decided to figure it out, even set up a number of experiments. It’s not that we now know everything about steal, but we’ll tell you something interesting now.

1. What is steal

So, steal is a metric that indicates the lack of processor time for processes inside a virtual machine. As described in the KVM kernel patch, steal is the amount of time the hypervisor is executing other processes on the host OS even though it has queued the virtual machine process for execution. That is, steal is calculated as the difference between the time when the process is ready to run and the time when the process is allocated CPU time.

The virtual machine kernel receives the steal metric from the hypervisor. At the same time, the hypervisor does not specify what other processes it is performing, just β€œwhile I’m busy, I can’t give you time.” On KVM, support for steal calculation is added in patches. There are two key points here:

  • The virtual machine learns about steal from the hypervisor. That is, from the point of view of losses, for processes on the virtual machine itself, this is an indirect measurement, which can be subject to various distortions.
  • The hypervisor does not share information with the virtual machine about what else it is doing - the main thing is that it does not devote time to it. Because of this, the virtual machine itself cannot detect distortions in the steal indicator, which could be assessed by the nature of competing processes.

2. What affects steal

2.1. Steal calculation

In fact, steal is considered in much the same way as the usual CPU utilization time. There is not much information about how recycling is considered. Probably because the majority considers this question obvious. But here, too, there are pitfalls. For an overview of this process, read article by Brendan Gregg: you will learn about a bunch of nuances when calculating utilization and about situations when this calculation will be erroneous for the following reasons:

  • Overheating of the processor, in which cycles are skipped.
  • Enable/disable turbo boost, which changes the processor clock frequency.
  • A change in the length of a time slice that occurs when using processor power-saving technologies such as SpeedStep.
  • Calculation of the average problem: a one-minute utilization estimate of 80% can hide a short-term burst of 100%.
  • A spin lock causes the processor to be reclaimed, but the user process sees no progress in its execution. As a result, the estimated utilization of the processor by the process will be XNUMX%, although the process will not physically consume processor time.

I did not find an article describing a similar calculation for steal (if you know, share it in the comments). But, judging by the sources, the calculation mechanism is the same as for recycling. It's just that another counter is added in the kernel, directly for the KVM process (virtual machine process), which counts the duration of the KVM process in the processor time waiting state. The counter takes information about the processor from its specification and sees if all its ticks are utilized by the virtual machine process. If everything, then we consider that the processor was engaged only in the process of the virtual machine. Otherwise, we inform that the processor was doing something else, steal appeared.

The steal count process is subject to the same problems as the normal steal count. Not to say that such problems appear often, but they look discouraging.

2.2. Types of virtualization on KVM

Generally speaking, there are three types of virtualization, and all of them are supported by KVM. The mechanism of occurrence of steal may depend on the type of virtualization.

Broadcast. In this case, the work of the operating system of the virtual machine with the physical devices of the hypervisor goes something like this:

  1. The guest operating system sends a command to its guest device.
  2. The guest device driver receives the command, generates a request for the device's BIOS, and sends it to the hypervisor.
  3. The hypervisor process translates a command into a command for a physical device, making it, among other things, more secure.
  4. The driver of the physical device accepts the modified command and sends it to the physical device itself.
  5. The results of executing commands go back along the same path.

The advantage of translation is that it allows you to emulate any device and does not require special preparation of the operating system kernel. But you have to pay for this, first of all, with speed.

Hardware virtualization. In this case, the device at the hardware level understands commands from the operating system. This is the fastest and best way. But, unfortunately, it is not supported by all physical devices, hypervisors and guest operating systems. Currently, the main devices that support hardware virtualization are processors.

Paravirtualization. The most common option for device virtualization on KVM and in general the most common virtualization mode for guest operating systems. Its peculiarity is that work with some hypervisor subsystems (for example, with the network or disk stack) or allocation of memory pages occurs using the hypervisor API, without translation of low-level commands. The disadvantage of this virtualization method is that the guest operating system kernel needs to be modified so that it can communicate with the hypervisor using this API. But this is usually solved by installing special drivers on the guest operating system. In KVM this API is called virtio API.

With paravirtualization, compared to translation, the path to the physical device is significantly reduced by sending commands directly from the virtual machine to the hypervisor process on the host. This allows you to speed up the execution of all instructions inside the virtual machine. In KVM, the virtio API is responsible for this, which only works for certain devices, such as a network or disk adapter. That is why virtio drivers are installed inside virtual machines.

The reverse side of this acceleration is that not all processes that run inside the virtual machine remain inside it. This creates some special effects that may cause steal to appear. I recommend starting a detailed study of this issue with An API for virtual I/O: virtio.

2.3. "Fair" scheduling

A virtual machine on a hypervisor is, in fact, a regular process that obeys the laws of scheduling (distribution of resources between processes) in the Linux kernel, so let's take a closer look at it.

Linux uses the so-called CFS, Completely Fair Scheduler, which has become the default scheduler since kernel 2.6.23. To understand this algorithm, you can read the Linux Kernel Architecture or the source. The essence of CFS lies in the distribution of processor time between processes depending on the duration of their execution. The more CPU time a process requires, the less CPU time it gets. This guarantees the "fair" execution of all processes - so that one process does not occupy all the processors all the time, and other processes can also be executed.

Sometimes this paradigm leads to interesting artifacts. Longtime Linux users will surely remember the freezing of a regular text editor on the desktop during the launch of resource-intensive applications such as the compiler. This happened because non-resource-intensive tasks of desktop applications competed with tasks that actively consume resources, such as the compiler. CFS thinks this is unfair, so it periodically stops the text editor and lets the processor handle compiler tasks. This was corrected with the mechanism sched_autogroup, but many other features of the distribution of processor time between tasks remained. Actually, this story is not about how bad everything is in CFS, but an attempt to draw attention to the fact that the "fair" distribution of processor time is not the most trivial task.

Another important point in the scheduler is preemption. This is necessary to expel the snickering process from the processor and let others work. The process of exile is called context switching, a processor context switch. At the same time, the entire context of the task is saved: the state of the stack, registers, etc., after which the process is sent to wait, and another takes its place. This is an expensive operation for the OS, and is rarely used, but in fact there is nothing wrong with it. Frequent context switching may indicate a problem in the OS, but usually it goes on continuously and does not indicate anything in particular.

Such a long story is needed to explain one fact: the more processor resources a process tries to consume in an honest Linux scheduler, the faster it will be stopped so that other processes can also work. Whether this is right or not is a difficult question, which is solved differently under different loads. In Windows, until recently, the scheduler was focused on priority processing of desktop applications, which could cause background processes to hang. Sun Solaris had five different classes of schedulers. When virtualization was launched, a sixth one was added, fair share schedulerbecause the previous five did not work adequately with Solaris Zones virtualization. I recommend starting a detailed study of this issue with books like Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture or Understanding the Linux Kernel.

2.4. How to monitor steal?

Monitoring steal inside a virtual machine, like any other processor metric, is simple: you can use any processor metric tool. The main thing is that the virtual machine should be on Linux. For some reason, Windows does not provide such information to its users. πŸ™

Steal: who steals CPU time from virtual machines
The output of the top command: detailing the load on the processor, in the rightmost column - steal

The difficulty arises when trying to get this information from the hypervisor. You can try to predict steal on the host machine, for example, by the Load Average (LA) parameter - the average value of the number of processes waiting in the execution queue. The method for calculating this parameter is not simple, but in general, if LA normalized by the number of processor threads is greater than 1, this indicates that the Linux server is overloaded with something.

What are all these processes waiting for? The obvious answer is processors. But the answer is not entirely correct, because sometimes the processor is free, and LA goes off scale. Remember how NFS falls off and how LA grows at the same time. Approximately the same can be with the disk, and with other input / output devices. But in fact, processes can wait for the end of any lock, both physical, associated with the I / O device, and logical, such as a mutex. It also includes locks at the hardware level (the same response from the disk), or logic (the so-called locking primitives, which include a bunch of entities, mutex adaptive and spin, semaphores, condition variables, rw locks, ipc locks ...).

Another feature of LA is that it is considered as an average value for the operating system. For example, 100 processes compete for one file, and then LA=50. Such a large value, it would seem, indicates that the operating system is bad. But for other crookedly written code, this can be a normal state, despite the fact that only it is bad, and other processes in the operating system do not suffer.

Because of this averaging (and not less than a minute), determining anything in terms of LA is not the most rewarding task, with very uncertain results in specific cases. If you try to figure it out, you will find that the articles on Wikipedia and other available resources describe only the simplest cases, without a deep explanation of the process. I send all those interested, again, here to Brendann Gregg  - follow the links. Who is lazy in English - translation of his popular article about LA.

3. Special effects

Now let's dwell on the main steal cases that we have encountered. I’ll tell you how they follow from all of the above and how they correlate with the indicators on the hypervisor.

Recycling. The simplest and most common: the hypervisor is recycled. Indeed, there are a lot of running virtual machines, high processor consumption inside them, a lot of competition, utilization by LA is greater than 1 (normalized by processor threads). Inside all virtualok everything slows down. Steal transmitted from the hypervisor also grows, it is necessary to redistribute the load or turn someone off. In general, everything is logical and understandable.

Paravirtualization versus lone instances. There is only one virtual machine on the hypervisor, it consumes a small part of it, but gives a large I / O load, for example, on a disk. And from somewhere a small steal appears in it, up to 10% (as several experiments show).

The case is interesting. Steal appears here just because of locks at the level of paravirtualized drivers. An interrupt is created inside the virtual machine, processed by the driver and goes to the hypervisor. Due to interrupt processing on the hypervisor, this looks like a sent request to the virtual machine, it is ready for execution and waiting for the processor, but it is not given processor time. The virtual machine thinks that this time is stolen.

This happens at the moment the buffer is sent, it goes to the kernel space of the hypervisor, and we start waiting for it. Although, from the point of view of the virtual machine, he should immediately return. Therefore, according to the steal calculation algorithm, this time is considered stolen. Most likely, there may be other mechanisms in this situation (for example, processing some more sys calls), but they should not differ much.

Scheduler against high-loaded virtual machines. When one virtual machine suffers from steal more than others, this is due precisely to the scheduler. The more the process loads the processor, the sooner the scheduler will kick it out so that the rest can also work. If a virtual machine consumes a little, it will hardly see steal: its process honestly sat and waited, it should be given more time. If the virtual machine produces the maximum load on all its cores, it is more often kicked out of the processor and they try not to give it much time.

Even worse, when the processes inside the virtual machine try to get more processor, because they can not cope with the processing of data. Then the operating system on the hypervisor, due to honest optimization, will give less and less processor time. This process occurs like an avalanche, and steal jumps to the skies, although other virtual machines may hardly notice it. And the more cores, the worse the machine that fell under the distribution. In short, highly loaded virtual machines with many cores suffer the most.

Low LA, but there is steal. If LA is about 0,7 (that is, the hypervisor seems to be underloaded), but steal is observed inside individual virtual machines:

  • The option already described above with paravirtualization. The virtual machine can receive metrics indicating steal, although the hypervisor is doing well. According to the results of our experiments, this steal option does not exceed 10% and should not have a significant impact on application performance inside the virtual machine.
  • The LA parameter is considered incorrect. More precisely, at each specific moment it is considered correct, but when averaged over one minute, it turns out to be underestimated. For example, if one virtual machine per third of the hypervisor consumes all its processors for exactly half a minute, then LA per minute on the hypervisor will be 0,15; four such virtual machines working simultaneously will give 0,6. And the fact that for half a minute on each of them there was a wild steal at 25% in terms of LA, can no longer be pulled out.
  • Again, because of the scheduler who decided that someone was eating too much, and let this someone wait. In the meantime, I'll switch the context, handle interrupts and take care of other important system things. As a result, some virtual machines do not see any problems, while others experience serious performance degradation.

4. Other distortions

There are a million more reasons for distorting the honest return of processor time on a virtual machine. For example, hyperthreading and NUMA add complexity to the calculations. They completely confuse the choice of the kernel for the execution of the process, because the scheduler uses coefficients - weights, which, when switching the context, make the calculation even more difficult.

There are distortions due to technologies such as turbo boost or, conversely, power saving mode, which, when calculating utilization, can artificially increase or decrease the frequency or even the time quantum on the server. Enabling turbo boost reduces the performance of one processor thread due to the performance increase of another. At this moment, information about the current processor frequency is not transmitted to the virtual machine, and it believes that someone is stealing its time (for example, it requested 2 GHz, but received half as much).

In general, there can be many reasons for distortions. On a particular system, you may find something else. It’s better to start with the books to which I gave links above, and reading statistics from the hypervisor with utilities like perf, sysdig, systemtap, of which dozens of.

5. findings

  1. A certain amount of steal can occur due to paravirtualization, and it can be considered normal. On the Internet they write that this value can be 5-10%. It depends on the applications inside the virtual machine and on what load it gives to its physical devices. Here it is important to pay attention to how applications feel inside virtual machines.
  2. The ratio of the load on the hypervisor and steal inside the virtual machine is not always unambiguously interconnected, both estimates of steal can be erroneous in specific situations at different loads.
  3. The scheduler has a bad attitude towards processes that ask for a lot. He tries to give less to those who ask for more. Big virtual machines are evil.
  4. A small steal can be the norm even without paravirtualization (taking into account the load inside the virtual machine, the load characteristics of neighbors, load distribution among threads, and other factors).
  5. If you want to figure out steal on a particular system, you have to explore different options, collect metrics, analyze them carefully, and think about how to evenly distribute the load. Deviations are possible from any cases, which must be confirmed experimentally or looked in the kernel debugger.

Source: habr.com

Add a comment