Autoscaling and resource management in Kubernetes (overview and video report)

April 27 at the conference Strike-2019, within the section "DevOps", a report "Autoscaling and resource management in Kubernetes" was made. It explains how to use K8s to ensure high availability of applications and ensure their maximum performance.

Autoscaling and resource management in Kubernetes (overview and video report)

By tradition, we are pleased to present video with the report (44 minutes, much more informative than the article) and the main summary in text form. Go!

Let's analyze the topic of the report by words and start from the end.

Kubernetes

Suppose we have Docker containers on the host. For what? To ensure repeatability and isolation, which in turn allows you to make a simple and good deployment, CI / CD. We have a lot of such trucks with containers.

What gives Kubernetes in this case?

  1. We stop thinking about these machines and start working with the "cloud" a cluster of containers or pods (groups of containers).
  2. Moreover, we do not even think about individual pods, but manage even moreоlarger groups. Such high-level primitives allow us to say that there is a pattern to run a certain workload, but the right number of instances to run it. If we subsequently change the template, all instances will also change.
  3. With declarative API instead of executing a sequence of specific commands, we describe the “device of the world” (in YAML) that is created by Kubernetes. And again: when the description changes, its actual display will also change.

Resource management

CPU

Let us run nginx, php-fpm and mysql on the server. These services will actually have even more running processes, each of which requires computing resources:

Autoscaling and resource management in Kubernetes (overview and video report)
(the numbers on the slide are "parrots", the abstract need of each process for computing power)

In order to be able to work with this conveniently, it is logical to combine processes into groups (for example, all nginx processes into one “nginx” group). A simple and obvious way to do this is to put each group in a container:

Autoscaling and resource management in Kubernetes (overview and video report)

To continue, you need to remember what a container is (in Linux). Their appearance was made possible thanks to three key features in the kernel that have been implemented for a long time: capabilities, namespaces и cgroups. And other technologies contributed to further development (including convenient “shells” like Docker):

Autoscaling and resource management in Kubernetes (overview and video report)

In the context of the report, we are only interested in cgroups, because it is the control groups that are the part of the functionality of containers (Docker, etc.) that implements resource management. The processes combined into groups, as we wanted, are the control groups.

Let's go back to the CPU needs of these processes, and now for groups of processes:

Autoscaling and resource management in Kubernetes (overview and video report)
(I repeat that all numbers are an abstract expression of the need for resources)

At the same time, the CPU itself has a certain finite resource (in the example it is 1000), which everyone may lack (the sum of the needs of all groups is 150+850+460=1460). What will happen in this case?

The kernel begins to distribute resources and does it "fairly", giving the same amount of resources to each group. But in the first case, there are more than needed (333>150), so the excess (333-150=183) remains in reserve, which is also equally distributed between the other two containers:

Autoscaling and resource management in Kubernetes (overview and video report)

As a result: the first container had enough resources, the second one was not enough, the third one was not enough. This is the result of actions "honest" scheduler in LinuxCFS. Its operation can be controlled by assigning weight each of the containers. For example, like this:

Autoscaling and resource management in Kubernetes (overview and video report)

Let's look at the case of a lack of resources in the second container (php-fpm). All container resources are distributed equally between processes. As a result, the master process works well, and all workers slow down, having received less than half of what they need:

Autoscaling and resource management in Kubernetes (overview and video report)

This is how the CFS scheduler works. The weights that we assign to containers will be called requests. Why this is so - see further.

Let's look at the whole situation from the other side. As you know, all roads lead to Rome, and in the case of a computer, to the CPU. One CPU, many tasks - you need a traffic light. The simplest way to manage resources is "traffic light": they gave one process a fixed time to access the CPU, then the next, and so on.

Autoscaling and resource management in Kubernetes (overview and video report)

This approach is called hard quoting. (hard limiting). Let's just remember it Limits. However, if we distribute limits to all containers, a problem arises: mysql was driving along the road and at some point its need for the CPU ended, but all other processes are forced to wait until the CPU idle.

Autoscaling and resource management in Kubernetes (overview and video report)

Let's return to the Linux kernel and its interaction with the CPU - the overall picture is as follows:

Autoscaling and resource management in Kubernetes (overview and video report)

cgroup has two settings - in fact, these are two simple "twists" that allow you to determine:

  1. weight for the container (requests) is shares;
  2. percentage of total CPU time spent working on container tasks (limits) is quota.

How to measure CPU?

There are different ways:

  1. What's happened parrots, no one knows - you need to negotiate every time.
  2. Interest clearer, but relative: 50% of a server with 4 cores and 20 cores are completely different things.
  3. You can use the already mentioned weight, which Linux knows, but they are also relative.
  4. The most adequate option is to measure computing resources in seconds. Those. in seconds of processor time in relation to seconds of real time: they gave out 1 second of processor time in 1 real second - this is one entire CPU core.

To make it even easier to speak, they began to measure directly in nuclei, meaning by them the same CPU time relative to the real one. Since Linux understands weights and not such CPU time/cores, a mechanism was needed to translate from one to the other.

Consider a simple server example with 3 CPU cores, where three pods will be given weights (500, 1000, and 1500) that easily convert to their respective portions of their allocated cores (0,5, 1, and 1,5).

Autoscaling and resource management in Kubernetes (overview and video report)

If we take the second server, where there will be twice as many cores (6), and place the same pods there, the distribution of cores can be easily calculated by simply multiplying by 2 (1, 2 and 3, respectively). But an important moment occurs when the fourth pod appears on this server, the weight of which, for convenience, will be 3000. It takes away some of the CPU resources (half the cores), and they are recalculated from the rest of the pods (halved):

Autoscaling and resource management in Kubernetes (overview and video report)

Kubernetes and CPU Resources

In Kubernetes, CPU resources are usually measured in millidrach, i.e. 0,001 kernels are taken as the base weight. (The same thing is called CPU share in Linux/cgroups terminology, although more precisely, 1000 millicores = 1024 CPU shares.) K8s makes sure not to put more pods on the server than there are CPU resources for the sum of the weights of all pods.

How does this happen? When a server is added to a Kubernetes cluster, it reports how many CPU cores it has available. And when a new pod is created, the Kubernetes scheduler knows how many cores that pod will need. Thus, the pod will be defined on the server, where there are enough cores.

What will happen if not request is specified (i.e. the number of cores it needs is not defined for the pod)? Let's take a look at how Kubernetes generally counts resources.

A pod can have both requests (CFS scheduler) and limits (remember the traffic light?):

  • If they are equal, then the pod is assigned a QoS class guaranteed. This number of cores always available to him is guaranteed.
  • If request is less than the limit - QoS class burstable. Those. we expect a pod, for example, to always use 1 core, but this value is not a limit for it: sometimes pod can use more (when the server has free resources for this).
  • There is also a QoS class best effort - it includes the very pods for which request is not specified. Resources are given to them last.

Memory

With memory, the situation is similar, but slightly different - after all, the nature of these resources is different. In general, the analogy is as follows:

Autoscaling and resource management in Kubernetes (overview and video report)

Let's see how requests are implemented in memory. Let the pods live on the server, varying the amount of memory they consume, until one of them gets so big that it runs out of memory. In this case, the OOM killer appears and kills the largest process:

Autoscaling and resource management in Kubernetes (overview and video report)

This does not always suit us, so it is possible to regulate which processes are important to us and should not be killed. To do this, use the parameter oom_score_adj.

Let's return to the CPU QoS classes and draw an analogy with the oom_score_adj values ​​that determine memory consumption priorities for pods:

  • The lowest value of oom_score_adj for a pod is -998, which means that such a pod should be killed at the very last turn, this guaranteed.
  • The highest - 1000 - is best effort, such pods are killed first.
  • To calculate the remaining values ​​(burstable) there is a formula, the essence of which is that the more resources a pod has requested, the less likely it is to be killed.

Autoscaling and resource management in Kubernetes (overview and video report)

The second "twist" - limit_in_bytes - for limits. With it, everything is simpler: we simply assign the maximum amount of memory to be issued, and here (unlike the CPU) there is no question of what it (memory) should be measured in.

Total

Each pod in Kubernetes is given requests и limits - both parameters for CPU and for memory:

  1. based on requests, the Kubernetes scheduler works, which distributes pods among servers;
  2. based on all parameters, the pod's QoS class is determined;
  3. relative weights are calculated based on CPU requests;
  4. based on CPU requests, the CFS scheduler is configured;
  5. OOM killer is configured based on memory requests;
  6. "traffic light" is configured based on CPU limits;
  7. on the basis of memory limits the limit on cgroup'u is adjusted.

Autoscaling and resource management in Kubernetes (overview and video report)

In general, this picture answers all questions about how the main part of resource management happens in Kubernetes.

Autoscaling

K8s cluster-autoscaler

Imagine that the entire cluster is already busy and a new pod needs to be created. Until the pod can appear, it hangs in the status Pending. To make it appear, we can connect a new server to the cluster, or ... install a cluster-autoscaler that will do it for us: it will order a virtual machine from the cloud provider (via an API request) and connect it to the cluster, after which the pod will be added .

Autoscaling and resource management in Kubernetes (overview and video report)

This is the autoscaling of the Kubernetes cluster, which works great (in our experience). However, as elsewhere, there are some nuances here…

While we increased the cluster sizes, everything was fine, but what happens when the cluster began to be released? The problem is that migrating pods (to free hosts) is very technically difficult and expensive in terms of resources. Kubernetes has a very different approach.

Consider a cluster of 3 servers that has a Deployment. He has 6 pods: now it's 2 for each server. For some reason, we wanted to shut down one of the servers. To do this, we use the command kubectl drain, which:

  • prohibit sending new pods to this server;
  • will delete existing pods on the server.

Since Kubernetes keeps track of the number of pods (6), it simply will recreate them on other nodes, but not on the one being shut down, since it is already marked as unavailable to host new pods. This is a fundamental mechanic for Kubernetes.

Autoscaling and resource management in Kubernetes (overview and video report)

However, there is a nuance here as well. In a similar situation, for StatefulSet (instead of Deployment), the actions will be different. Now we already have a stateful application - for example, three pods with MongoDB, one of which had some kind of problem (data got corrupted or another error that prevented the pod from starting correctly). And we again decide to shut down one server. What will happen?

Autoscaling and resource management in Kubernetes (overview and video report)

MongoDB could die because it needs a quorum: for a cluster of three installations, at least two must be operational. However, this not happening - thanks to PodDisruptionBudget. This parameter defines the minimum required number of running pods. Knowing that one of the MongoDB pods is no longer running, and seeing that MongoDB has PodDisruptionBudget set to minAvailable: 2, Kubernetes won't let you delete the pod.

Bottom line: in order for the movement (and in fact, re-creation) of pods to work correctly when the cluster is released, you need to configure the PodDisruptionBudget.

Horizontal scaling

Let's consider another situation. There is an application running as Deployment in Kubernetes. User traffic comes to its pods (for example, there are three of them), and we measure a certain indicator in them (say, the load on the CPU). When the load increases, we fix it according to the schedule and increase the number of pods to distribute requests.

Today, in Kubernetes, this does not need to be done manually: automatic increase / decrease in the number of pods is configured depending on the values ​​​​of the measured load indicators.

Autoscaling and resource management in Kubernetes (overview and video report)

The main questions here are what to measure и how to interpret received values ​​(to make a decision about changing the number of pods). A lot of things can be measured.

Autoscaling and resource management in Kubernetes (overview and video report)

How to do it technically - collect metrics, etc. - I spoke in detail in the report about Monitoring and Kubernetes. And the main advice for choosing the optimal parameters is experiment!

There is USE method (Utilization Saturation and Errors), the meaning of which is as follows. Based on what does it make sense to scale, for example, php-fpm? Based on the fact that workers are running out, this is use. And if the workers are over and new connections are not accepted, this is already saturation. Both of these parameters must be measured, and depending on the values, scaling must be carried out.

Instead of a conclusion

The report has a sequel: about vertical scaling and how to choose the right resources. I will talk about this in future videos on our YouTube - Subscribe so you don't miss out!

Videos and slides

Video from the performance (44 minutes):

Report presentation:

PS

Other reports about Kubernetes in our blog:

Source: habr.com

Add a comment