ProHoster > Blog > Administration > Kubernetes: why is it so important to set up system resource management?
Kubernetes: why is it so important to set up system resource management?
As a rule, there is always a need to provide a dedicated pool of resources to any application for its correct and stable operation. But what if several applications run at the same capacity at once? How to provide the minimum necessary resources for each of them? How can resource consumption be limited? How to correctly distribute the load between the nodes? How to ensure that the horizontal scaling mechanism works in the event of an increase in application load?
You need to start with what main types of resources exist in the system - this, of course, is processor time and RAM. In k8s manifests, these types of resources are measured in the following units:
CPU - in cores
RAM - in bytes
Moreover, for each resource, it is possible to set two types of requirements - requests и limits. Requests - describes the minimum free resource requirements for a node to run a container (and a pod in general), while limits sets a hard limit on the resources available to the container.
It is important to understand that it is not necessary to explicitly define both types in the manifest, and the behavior will be as follows:
If only the limits of a resource are explicitly set, then requests for that resource is automatically set to limits (you can verify this by calling describe on the entity). Those. in fact, the operation of the container will be limited to the same amount of resources that it requires to run.
If only requests are explicitly set for a resource, then no upper restrictions are set on this resource - i.e. the container is limited only by the resources of the node itself.
It is also possible to configure resource management not only at the level of a specific container, but also at the namespace level using the following entities:
LimitRange - describes the restriction policy at the container / pod level in ns and is needed in order to describe the default restrictions on the container / pod, as well as prevent the creation of deliberately fat containers / pods (or vice versa), limit their number and determine the possible difference in the values of limits and requests
ResourceQuotas - describe the restriction policy in general for all containers in ns and is usually used to delimit resources by environment (useful when environments are not strictly delimited at the node level)
The following are examples of manifests where resource limits are set:
Those. in this case, to run a container with nginx, you will need at least 1G free RAM and 0.2 CPU on the node, while at most the container can eat 0.2 CPU and all available RAM on the node.
Those. the sum of all request containers in the default ns cannot exceed 300m for CPU and 1G for OP, and the sum of all limit - 700m for CPU and 2G for OP.
Those. in the default namespace for all containers, by default, request will be set to 100m for CPU and 1G for OP, limit - 1 CPU and 2G. This also sets a limit on the possible values in request/limit for CPU (50m < x < 2) and RAM (500M < x < 4G).
Those. each pod in the default ns will be limited to 4 vCPUs and 1G.
Now I would like to tell you what advantages the installation of these restrictions can give us.
Load balancing mechanism between nodes
As you know, the k8s component is responsible for the distribution of pods among the nodes, such as scheduler, which works according to a certain algorithm. This algorithm, in the course of choosing the optimal node for launch, goes through two stages:
Filtration
Ranging
Those. according to the described policy, nodes are initially selected on which it is possible to launch a pod based on a set predicates (including checking whether the node has enough resources to run the pod - PodFitsResources), and then for each of these nodes, according to Priorities points are awarded (including the more free resources a node has, the more points it is assigned - LeastResourceAllocation/LeastRequestedPriority/BalancedResourceAllocation) and it is launched on the node with the highest number of points (if several nodes satisfy this condition at once, then a random one is selected) .
At the same time, you need to understand that the scheduler, when assessing the available resources of the node, is guided by the data that is stored in etcd - i.e. by the requested/limit resource of each pod running on this node, but not by the actual resource consumption. This information can be obtained from the output of the command kubectl describe node $NODE, For example:
Here we see all the pods running on a particular node, as well as the resources that each of the pods requests. And here is what the scheduler logs look like when the cronjob-cron-events-1573793820-xt6q9 pod is launched (this information will appear in the scheduler log when logging level 10 is set in the arguments of the launch command -v=10 ):
broad gull
I1115 07:57:21.637791 1 scheduling_queue.go:908] About to try and schedule pod nxs-stage/cronjob-cron-events-1573793820-xt6q9
I1115 07:57:21.637804 1 scheduler.go:453] Attempting to schedule pod: nxs-stage/cronjob-cron-events-1573793820-xt6q9
I1115 07:57:21.638285 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s5 is allowed, Node is running only 16 out of 110 Pods.
I1115 07:57:21.638300 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s6 is allowed, Node is running only 20 out of 110 Pods.
I1115 07:57:21.638322 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s3 is allowed, Node is running only 20 out of 110 Pods.
I1115 07:57:21.638322 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s4 is allowed, Node is running only 17 out of 110 Pods.
I1115 07:57:21.638334 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s10 is allowed, Node is running only 16 out of 110 Pods.
I1115 07:57:21.638365 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s12 is allowed, Node is running only 9 out of 110 Pods.
I1115 07:57:21.638334 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s11 is allowed, Node is running only 11 out of 110 Pods.
I1115 07:57:21.638385 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s1 is allowed, Node is running only 19 out of 110 Pods.
I1115 07:57:21.638402 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s2 is allowed, Node is running only 21 out of 110 Pods.
I1115 07:57:21.638383 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s9 is allowed, Node is running only 16 out of 110 Pods.
I1115 07:57:21.638335 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s8 is allowed, Node is running only 18 out of 110 Pods.
I1115 07:57:21.638408 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s13 is allowed, Node is running only 8 out of 110 Pods.
I1115 07:57:21.638478 1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s10 is allowed, existing pods anti-affinity terms satisfied.
I1115 07:57:21.638505 1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s8 is allowed, existing pods anti-affinity terms satisfied.
I1115 07:57:21.638577 1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s9 is allowed, existing pods anti-affinity terms satisfied.
I1115 07:57:21.638583 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s7 is allowed, Node is running only 25 out of 110 Pods.
I1115 07:57:21.638932 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: BalancedResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 2343 millicores 9640186880 memory bytes, score 9
I1115 07:57:21.638946 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: LeastResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 2343 millicores 9640186880 memory bytes, score 8
I1115 07:57:21.638961 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: BalancedResourceAllocation, capacity 39900 millicores 66620170240 memory bytes, total request 4107 millicores 11307422720 memory bytes, score 9
I1115 07:57:21.638971 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: BalancedResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 5847 millicores 24333637120 memory bytes, score 7
I1115 07:57:21.638975 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: LeastResourceAllocation, capacity 39900 millicores 66620170240 memory bytes, total request 4107 millicores 11307422720 memory bytes, score 8
I1115 07:57:21.638990 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: LeastResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 5847 millicores 24333637120 memory bytes, score 7
I1115 07:57:21.639022 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: TaintTolerationPriority, Score: (10)
I1115 07:57:21.639030 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: TaintTolerationPriority, Score: (10)
I1115 07:57:21.639034 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: TaintTolerationPriority, Score: (10)
I1115 07:57:21.639041 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: NodeAffinityPriority, Score: (0)
I1115 07:57:21.639053 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: NodeAffinityPriority, Score: (0)
I1115 07:57:21.639059 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: NodeAffinityPriority, Score: (0)
I1115 07:57:21.639061 1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: InterPodAffinityPriority, Score: (0)
I1115 07:57:21.639063 1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: SelectorSpreadPriority, Score: (10)
I1115 07:57:21.639073 1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: InterPodAffinityPriority, Score: (0)
I1115 07:57:21.639077 1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: SelectorSpreadPriority, Score: (10)
I1115 07:57:21.639085 1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: InterPodAffinityPriority, Score: (0)
I1115 07:57:21.639088 1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: SelectorSpreadPriority, Score: (10)
I1115 07:57:21.639103 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: SelectorSpreadPriority, Score: (10)
I1115 07:57:21.639109 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: SelectorSpreadPriority, Score: (10)
I1115 07:57:21.639114 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: SelectorSpreadPriority, Score: (10)
I1115 07:57:21.639127 1 generic_scheduler.go:781] Host nxs-k8s-s10 => Score 100037
I1115 07:57:21.639150 1 generic_scheduler.go:781] Host nxs-k8s-s8 => Score 100034
I1115 07:57:21.639154 1 generic_scheduler.go:781] Host nxs-k8s-s9 => Score 100037
I1115 07:57:21.639267 1 scheduler_binder.go:269] AssumePodVolumes for pod "nxs-stage/cronjob-cron-events-1573793820-xt6q9", node "nxs-k8s-s10"
I1115 07:57:21.639286 1 scheduler_binder.go:279] AssumePodVolumes for pod "nxs-stage/cronjob-cron-events-1573793820-xt6q9", node "nxs-k8s-s10": all PVCs bound and nothing to do
I1115 07:57:21.639333 1 factory.go:733] Attempting to bind cronjob-cron-events-1573793820-xt6q9 to nxs-k8s-s10
Here we see that initially the scheduler performs filtering and forms a list of 3 nodes on which launch is possible (nxs-k8s-s8, nxs-k8s-s9, nxs-k8s-s10). It then performs scoring on several parameters (including BalancedResourceAllocation, LeastResourceAllocation) for each of these nodes in order to determine the most suitable node. Ultimately, the pod is planned on the node with the most points (here, two nodes at once have the same number of points 100037, so a random one is chosen - nxs-k8s-s10).
Hack and predictor Aviator: if there are pods running on the node for which no limits are set, then for k8s (in terms of resource consumption) this will be equivalent to if such pods were absent on this node at all. Therefore, if you conditionally have a pod with a gluttonous process (for example, wowza) and no restrictions are set for it, then a situation may arise when this pod has actually eaten all the resources of the node, but for k8s this node is considered unloaded and it will be awarded the same number of points when ranking (namely, in points with an assessment of available resources) as a node on which there are no working pods, which ultimately can lead to an uneven load distribution between the nodes.
Pod eviction
As you know, each pod is assigned one of 3 QoS classes:
guaranteed - assigned when request and limit are set for each container in the pod for memory and cpu, and these values \uXNUMXb\uXNUMXbmust match
burstable - at least one container in the pod has request and limit, while request < limit
best effort - when no container in the pod is resource limited
At the same time, when there is a shortage of resources (disk, memory) on a node, kubelet starts ranking and evicting subs according to a certain algorithm that takes into account the priority of the pod and its QoS class. For example, if we are talking about RAM, then based on the QoS class, points are awarded according to the following principle:
Those. with the same priority, kubelet will first of all evict pods with the best effort QoS class from the node.
Hack and predictor Aviator: if you want to reduce the probability of evicting the desired pod from the node in case of a lack of resources on it, then along with the priority, you must also take care of setting the request/limit for it.
Application Pod Horizontal Autoscaler (HPA) mechanism
When the task is to automatically increase and decrease the number of pods depending on the use of resources (system - CPU / RAM or user - rps), such a k8s entity as HPA (Horizontal Pod Autoscaler). The algorithm of which is as follows:
The current indications of the observable resource are determined (currentMetricValue)
The desired values for the resource are determined (desiredMetricValue), which are set for system resources using request
The current number of replicas is determined (currentReplicas)
The following formula calculates the desired number of replicas (desiredReplicas)
desiredReplicas = [ currentReplicas * ( currentMetricValue / desiredMetricValue )]
In this case, scaling will not occur when the coefficient (currentMetricValue / desiredMetricValue) is close to 1 (at the same time, we can set the permissible error ourselves, by default it is 0.1).
Let's consider the work of hpa on the example of the app-test application (described as Deployment), where it is necessary to change the number of replicas, depending on the CPU consumption:
Those. we see that the pod with the application is initially launched in two instances, each of which contains two containers nginx and nginx-exporter, each of which is given requests for the CPU.
Those. we created an hpa that will monitor the Deployment app-test and adjust the number of pods with the application based on cpu (we expect the pod to consume 30% of the CPU it requests), with the number of replicas in the range of 2-10.
Now, let's look at the hpa mechanism, if we apply a load to one of the pods:
# kubectl top pod
NAME CPU(cores) MEMORY(bytes)
app-test-78559f8f44-pgs58 101m 243Mi
app-test-78559f8f44-cj4jz 4m 240Mi
In total we have the following:
DesiredMetricValue - according to the hpa settings, we have 30%
Current value (currentMetricValue) — for calculation, the controller-manager calculates the average value of resource consumption in %, i.e. conditionally does the following:
Gets absolute pod metrics from the metric server, i.e. 101m and 4m
Calculates the mean absolute value, i.e. (101m + 4m) / 2 = 53m
Gets the absolute value for the desired resource consumption (requests from all containers are summed for this) 60m + 30m = 90m
Calculates the average percentage of CPU consumption relative to the request Pod, i.e. 53m / 90m * 100% = 59%
Now we have everything we need to determine whether we need to change the number of replicas, for this we calculate the coefficient:
ratio = 59% / 30% = 1.96
Those. the number of replicas should be increased by ~2 times and be [2 * 1.96] = 4.
Conclusion: As you can see, for this mechanism to work, a necessary condition is, among other things, the presence of requests for all containers in the monitored pod.
Mechanism of horizontal autoscaling of nodes (Cluster Autoscaler)
In order to level the negative impact on the system during load bursts, having a configured hpa is not enough. For example, according to the settings in the hpa controller manager, it decides that the number of replicas needs to be increased by 2 times, however, there are no free resources on the nodes to run such a number of pods (i.e., the node cannot provide the requested resources of the requests pod) and these pods go to the Pending state.
In this case, if the provider has an appropriate IaaS / PaaS (for example, GKE / GCE, AKS, EKS, etc.), a tool such as Node Autoscaler. It allows you to set the maximum and minimum number of nodes in the cluster and automatically adjust the current number of nodes (by calling the cloud provider API to order / delete a node) when there is a shortage of resources in the cluster and pods cannot be scheduled (are in the Pending state).
Conclusion: to be able to autoscale nodes, you need to set requests in pod containers so that k8s can correctly assess the load on the nodes and accordingly report that there are no resources in the cluster to launch the next pod.
Conclusion
It should be noted that setting container resource limits is not a prerequisite for the successful launch of the application, but it is still better to do this for the following reasons:
For more accurate scheduler work in terms of load balancing between k8s nodes
To reduce the likelihood of a Pod evict event
For horizontal pod autoscaling (HPA) to work
For horizontal autoscaling of nodes (Cluster Autoscaling) for cloud providers