CPU limits and aggressive throttling in Kubernetes

Note. transl.: This cautionary tale of Omio, the European travel aggregator, takes readers from basic theory to the fascinating practical intricacies of Kubernetes configuration. Acquaintance with such cases helps not only to broaden one's horizons, but also to prevent non-trivial problems.

CPU limits and aggressive throttling in Kubernetes

Have you ever encountered the fact that the application "stuck" in place, stopped responding to requests for health checks and you could not understand the reason for this behavior? One possible explanation is related to the quota limit on CPU resources. It will be discussed in this article.

TL; DR:
We highly recommend disabling CPU limits in Kubernetes (or disabling CFS quotas in Kubelet) if using a version of the Linux kernel with a CFS quota error. At the core there is serious and well known a bug that leads to excessive throttling and delays
.

In Omio the entire infrastructure is managed by Kubernetes. All of our stateful and stateless workloads run exclusively on Kubernetes (we use the Google Kubernetes Engine). In the last six months, we began to observe random slowdowns. Applications freeze or stop responding to health checks, lose network connection, etc. Such behavior baffled us for a long time, and, finally, we decided to deal with the problem in earnest.

Abstract of the article:

  • A few words about containers and Kubernetes;
  • How CPU requests and limits are implemented;
  • How CPU limit works in multi-core environments;
  • How to track CPU throttling;
  • Problem solving and details.

A few words about containers and Kubernetes

Kubernetes is essentially the modern standard in the world of infrastructure. Its main task is container orchestration.

Containers

In the past, we had to build artifacts like Java JARs/WARs, Python Eggs, or executables to run on servers. However, to make them work, you had to do extra work: install the runtime (Java/Python), put the necessary files in the right places, ensure compatibility with a particular version of the operating system, and so on. In other words, you had to pay close attention to configuration management (which often caused contention between developers and system administrators).

Containers have changed everything. Now the container image is the artifact. It can be represented as a kind of extended executable file containing not only the program, but also a complete runtime environment (Java / Python / ...), as well as the necessary files / packages, pre-installed and ready to run. Containers can be deployed and run on different servers without any additional steps.

In addition, containers run in their own sandbox environment. They have their own virtual network adapter, their own file system with limited access, their own process hierarchy, their own restrictions on CPU and memory, etc. All this is implemented thanks to a special subsystem of the Linux kernel - namespaces (namespaces).

Kubernetes

As mentioned earlier, Kubernetes is a container orchestrator. It works like this: you provide it with a pool of machines, and then you say, "Hey Kubernetes, run ten instances of my container with 2 processors and 3 GB of memory each, and keep them running!". Kubernetes will take care of the rest. It will find free capacities, launch containers and restart them if necessary, roll out an update when changing versions, etc. Essentially, Kubernetes abstracts away the hardware and makes a variety of systems usable for deploying and running applications.

CPU limits and aggressive throttling in Kubernetes
Kubernetes from the point of view of a simple layman

What are requests and limits in Kubernetes

Okay, we figured out containers and Kubernetes. We also know that multiple containers can reside on the same machine.

You can draw an analogy with a communal apartment. A spacious room (cars/assemblies) is taken and rented to several tenants (containers). Kubernetes acts as a realtor. The question arises, how to keep tenants from conflicts with each other? What if one of them, say, decides to take the bathroom for half a day?

This is where requests and limits come into play. CPU Request needed for planning purposes only. This is sort of like a "wish list" of the container, and is used to select the most suitable node. At the same time, the CPU Limit can be compared with a lease agreement - as soon as we select a node for the container, that can not go beyond the limits. And here comes the problem...

How requests and limits are implemented in Kubernetes

Kubernetes uses the throttling mechanism (clock skipping) built into the kernel to implement CPU limits. If the application exceeds the limit, throttling is enabled (i.e. it receives fewer CPU cycles). Memory requests and limits are organized differently so they are easier to spot. To do this, it is enough to check the last restart status of the pod: is it β€œOOMKilled”. CPU throttling is not so easy, since K8s only makes available metrics by usage, not by cgroups.

CPU Request

CPU limits and aggressive throttling in Kubernetes
How CPU request is implemented

For the sake of simplicity, let's look at the process using the example of a machine with a 4-core CPU.

K8s uses the cgroups mechanism to manage resource allocation (memory and CPU). A hierarchical model is available for it: the child inherits the limits of the parent group. The distribution details are stored in the virtual file system (/sys/fs/cgroup). In the case of a processor, this /sys/fs/cgroup/cpu,cpuacct/*.

K8s uses file cpu.share to allocate processor resources. In our case, the root cgroup receives 4096 shares of CPU resources - 100% of the available processor power (1 core = 1024; this is a fixed value). The root group distributes resources proportionally depending on the shares of descendants specified in cpu.share, and those, in turn, do the same with their descendants, and so on. In a typical Kubernetes host, the root cgroup has three children: system.slice, user.slice ΠΈ kubepods. The first two subgroups are used to allocate resources between critical system loads and user programs outside of K8s. Last - kubepods - created by Kubernetes to distribute resources between pods.

The diagram above shows that the first and second subgroups received 1024 shares, while the kuberpod subgroup is allocated 4096 shares. How is this possible: after all, the root group has access to only 4096 shares, and the sum of the shares of her descendants significantly exceeds this number (6144)? The point is that the value has a logical meaning, so the Linux scheduler (CFS) uses it to allocate CPU resources proportionally. In our case, the first two groups receive 680 real shares (16,6% of 4096) and kubepod gets the rest 2736 shares. In case of downtime, the first two groups will not use the allocated resources.

Fortunately, the scheduler has a mechanism to avoid wasting unused CPU resources. It transfers "idle" capacity to the global pool, from which it is distributed to groups that need additional processor power (the transfer occurs in batches to avoid rounding losses). The same method applies to all descendants of descendants.

This mechanism ensures a fair distribution of processor power and ensures that no process "steals" resources from others.

CPU limit

Although the limit and request configurations in K8s look similar, their implementation is fundamentally different: most misleading and the least documented part.

K8s Cycles CFS quota mechanism to enforce limits. Their settings are specified in the files cfs_period_us ΠΈ cfs_quota_us in the cgroup directory (there is also a file cpu.share).

Unlike cpu.share, the quota is based on period of time, and not on the available processor power. cfs_period_us specifies the duration of the period (epoch) - it is always 100000 Β΅s (100 ms). There is an option in K8s to change this value, however it is only available in alpha for now. The scheduler uses the epoch to restart used quotas. second file, cfs_quota_us, specifies the available time (quota) in each epoch. Note that it is also specified in microseconds. The quota can exceed the length of an epoch; in other words, it may be greater than 100 ms.

Let's look at two scenarios on 16-core machines (the most common type of machine we have at Omio):

CPU limits and aggressive throttling in Kubernetes
Scenario 1: 2 threads and 200ms limit. No throttling

CPU limits and aggressive throttling in Kubernetes
Scenario 2: 10 threads and 200ms limit. Throttling starts after 20ms, access to CPU resources resumes after another 80ms

Let's say you set the CPU limit to 2 kernels; Kubernetes will translate this value into 200ms. This means that the container can use a maximum of 200ms of CPU time without throttling.

And here the fun begins. As mentioned above, the available quota is 200ms. If you have parallel work ten threads on a 12-core machine (see illustration for scenario 2), while all other pods are idle, the quota will be exhausted in just 20ms (because 10 * 20ms = 200ms), and all threads of this pod will hang Β» (throttle) for the next 80 ms. The situation is aggravated by the already mentioned scheduler bug, due to which excessive throttling occurs and the container cannot even work out the existing quota.

How to evaluate throttling in pods?

Just login to the pod and do cat /sys/fs/cgroup/cpu/cpu.stat.

  • nr_periods is the total number of scheduler periods;
  • nr_throttled β€” the number of throttled periods in the composition nr_periods;
  • throttled_time is the cumulative throttled time in nanoseconds.

CPU limits and aggressive throttling in Kubernetes

What is actually happening?

As a result, we get high throttling in all applications. Sometimes he is in one and a half times stronger than expected!

This leads to various errors - failures of readiness checks, container freezes, network connection breaks, timeouts inside service calls. Ultimately, this translates into increased latency and increased error rates.

Decision and consequences

Everything is simple here. We abandoned CPU limits and started updating the OS kernel in clusters to the latest version, in which the bug was fixed. The number of errors (HTTP 5xx) in our services immediately dropped significantly:

HTTP 5xx errors

CPU limits and aggressive throttling in Kubernetes
HTTP 5xx errors for one critical service

p95 response time

CPU limits and aggressive throttling in Kubernetes
Critical Service Request Latency, 95th percentile

Operating costs

CPU limits and aggressive throttling in Kubernetes
Number of instance hours spent

What's the catch?

As stated at the beginning of the article:

You can draw an analogy with a communal apartment ... Kubernetes acts as a realtor. But how to keep tenants from conflicts with each other? What if one of them, say, decides to take the bathroom for half a day?

Here's the catch. One careless container can consume all the available CPU resources on the machine. If you have a smart application stack (e.g., JVM, Go, Node VM properly configured), then this is not a problem: you can work in such conditions for a long time. But if applications are poorly optimized or not optimized at all (FROM java:latest), the situation can get out of control. At Omio, we have automated base Dockerfiles with adequate defaults for the core language stack, so this problem didn't exist.

We recommend monitoring the metrics USE (usage, saturation, and errors), API delays, and error rates. Make sure the results match your expectations.

references

This is our history. The following materials greatly helped to understand what is happening:

Kubernetes bug reports:

Have you encountered similar problems in your practice or have experience related to throttling in containerized production environments? Share your story in the comments!

PS from translator

Read also on our blog:

Source: habr.com

Add a comment