Kubernetes: Speed ​​up your services by removing CPU limits

Back in 2016 we at Buffer switched to Kubernetes, and now there are about 60 nodes (on AWS) and 1500 containers working on our k8s cluster running since. However, we switched to microservices by trial and error, and even after several years of our work with k8s, we still face new problems for ourselves. In this post we will talk about processor limits: why we considered them good practice and why they ended up not being so good.

Processor limits and throttling

Like many other Kubernetes users, Google highly recommends setting CPU limits. Without this configuration, containers in a node can take up all the CPU power, which, in turn, causes important Kubernetes processes (for example, kubelet) will stop responding to requests. Thus, setting CPU limits is a good way to protect your nodes.

CPU limits give the container the maximum CPU time it can use in a given period (100ms by default), and the container will never exceed that limit. In Kubernetes for throttling container and prevent it from exceeding the limit, a special tool is used CFS Quota, however, in the end, such artificial CPU limits underestimate the performance and increase the response time of your containers.

What can happen if we do not set processor limits?

Unfortunately, we ourselves had to face this problem. Each node has a process responsible for managing containers. kubelet, and it stopped responding to requests. The node, when this happens, will go into the state NotReady, and containers from it will be redirected somewhere else and create the same problems on new nodes. Not an ideal scenario, to say the least.

Manifestation of the problem of throttling and response

The key container tracking metric is trottling, it shows how many times your container has been throttled. We noticed with interest the presence of throttling in some containers, regardless of whether the load on the processor was maximum or not. For example, let's take a look at one of our main APIs:

Kubernetes: Speed ​​up your services by removing CPU limits

As you can see below, we have set the constraint in 800m (0.8 or 80% of the core), and the peak values ​​reach at best 200m (20% core). It would seem that before throttling the service, we still have a lot of processor power, however ...

Kubernetes: Speed ​​up your services by removing CPU limits
You may have noticed that even when the load on the processor is below the specified limits - much lower - throttling still works.

Faced with this, we soon discovered several resources (issue on github, presentation on zadano, post on omio) about the drop in performance and response time of services due to throttling.

Why do we see throttling at low CPU load? The short version is: “there is a bug in the Linux kernel that causes optional throttling of containers with given processor limits.” If you are interested in the nature of the problem, you can read the presentation (video и textual variants) by Dave Chiluk.

Removing processor restrictions (with extreme caution)

After lengthy discussions, we have decided to remove processor restrictions from all services that directly or indirectly affect the functionality that is critical for our users.

The decision was not an easy one as we highly value the stability of our cluster. In the past, we have already experimented with the instability of our cluster, and then the services consumed too many resources and slowed down the work of their entire node. Now everything was a little different: we had a clear understanding of what we expected from our clusters, as well as a good strategy for implementing the planned changes.

Kubernetes: Speed ​​up your services by removing CPU limits
Business correspondence on a pressing issue.

How to protect your nodes when restrictions are lifted?

Isolate "unrestricted" services:

In the past, we have already seen some nodes fall into the state notReady, primarily due to services that consumed too many resources.

We decided to place such services in separate (“tagged”) nodes so that they do not interfere with “related” services. As a result, thanks to the marks to some nodes and the addition of the parameter "toleration" ("tolerance") to "unrelated" services, we have achieved more control over the cluster, and it became easier for us to identify problems with the nodes. To carry out similar processes yourself, you can refer to documentation.

Kubernetes: Speed ​​up your services by removing CPU limits

Purpose of the correct request of the processor and memory:

Most of all, we feared that the process would gobble up too many resources and the node would stop responding. Since now (thanks to Datadog) we could clearly observe all the services on our cluster, I analyzed several months of operation of those that we planned to designate as "unrelated". I simply set the maximum CPU usage to a margin of 20%, and thus allocated space in the node in case k8s tries to assign other services to the node.

Kubernetes: Speed ​​up your services by removing CPU limits

As you can see on the graph, the maximum load on the processor has reached 242m CPU cores (0.242 processor cores). For the processor request, it is enough to take a number slightly larger than this value. Please note that since the services are user-centric, load peaks coincide with traffic.

Do the same with memory usage and queries, and voila - you're all set! For more security, you can add horizontal pod autoscaling. Thus, every time the resource load is high, autoscaling will create new pods, and kubernetes will allocate them to nodes with free space. In case there is no space left in the cluster itself, you can set yourself an alert or configure the addition of new nodes through their autoscaling.

Of the minuses, it is worth noting that we lost in "container density”, i.e. the number of containers running in one node. We may also have a lot of "looseness" at low traffic density, and there is also a chance that you will reach a high processor load, but autoscaling nodes should help with the latter.

The results

I'm happy to post these great results from the experiments of the last few weeks, we've already seen significant improvements in response across all modified services:

Kubernetes: Speed ​​up your services by removing CPU limits

We achieved the best result on our main page (buffer.com), where the service sped up to twenty two times!

Kubernetes: Speed ​​up your services by removing CPU limits

Has the Linux kernel bug been fixed?

Yes, the bug has already been fixed, and the fix has been added to the kernel distribution kits version 4.19 and higher.

However, when reading kubernetes issues on github for the second of September 2020 we are still seeing mentions of some Linux projects with a similar bug. I believe some Linux distributions still have this bug and are currently working on a fix.

If your distro version is below 4.19, I would recommend upgrading to the latest, but you should try removing CPU limits anyway and see if throttling persists. Below is a partial list of Kubernetes management services and Linux distributions:

What to do if the fix fixed the problem with throttling?

I'm not sure if the problem is completely solved. When we get to the fixed version of the kernel, I will test the cluster and update the post. If someone has already updated, I'll read your results with interest.

Conclusion

  • If you're running Docker containers on Linux (whether Kubernetes, Mesos, Swarm, or whatever), your containers may lose performance due to throttling;
  • Try updating to the latest version of your distribution in the hope that the bug has already been fixed;
  • Removing processor limits will solve the problem, but this is a dangerous trick that should be used with extreme caution (better to update the kernel first and compare the results);
  • If you've removed CPU limits, keep a close eye on your CPU and memory usage, and make sure your CPU resources exceed consumption;
  • A safe option would be to autoscaling pods to create new pods in case of high hardware load, so that kubernetes assigns them to free nodes.

I hope this post will help you improve the performance of your container systems.

PS Here the author is in correspondence with readers and commentators (in English).


Source: habr.com

Add a comment