Kubernetes: why is it so important to set up system resource management?

As a rule, there is always a need to provide a dedicated pool of resources to any application for its correct and stable operation. But what if several applications run at the same capacity at once? How to provide the minimum necessary resources for each of them? How can resource consumption be limited? How to correctly distribute the load between the nodes? How to ensure that the horizontal scaling mechanism works in the event of an increase in application load?

Kubernetes: why is it so important to set up system resource management?

You need to start with what main types of resources exist in the system - this, of course, is processor time and RAM. In k8s manifests, these types of resources are measured in the following units:

  • CPU - in cores
  • RAM - in bytes

Moreover, for each resource, it is possible to set two types of requirements - requests и limits. Requests - describes the minimum free resource requirements for a node to run a container (and a pod in general), while limits sets a hard limit on the resources available to the container.

It is important to understand that it is not necessary to explicitly define both types in the manifest, and the behavior will be as follows:

  • If only the limits of a resource are explicitly set, then requests for that resource is automatically set to limits (you can verify this by calling describe on the entity). Those. in fact, the operation of the container will be limited to the same amount of resources that it requires to run.
  • If only requests are explicitly set for a resource, then no upper restrictions are set on this resource - i.e. the container is limited only by the resources of the node itself.

It is also possible to configure resource management not only at the level of a specific container, but also at the namespace level using the following entities:

  • LimitRange - describes the restriction policy at the container / pod level in ns and is needed in order to describe the default restrictions on the container / pod, as well as prevent the creation of deliberately fat containers / pods (or vice versa), limit their number and determine the possible difference in the values ​​​​of limits and requests
  • ResourceQuotas - describe the restriction policy in general for all containers in ns and is usually used to delimit resources by environment (useful when environments are not strictly delimited at the node level)

The following are examples of manifests where resource limits are set:

  • At the level of a specific container:

    containers:
    - name: app-nginx
      image: nginx
      resources:
        requests:
          memory: 1Gi
        limits:
          cpu: 200m

    Those. in this case, to run a container with nginx, you will need at least 1G free RAM and 0.2 CPU on the node, while at most the container can eat 0.2 CPU and all available RAM on the node.

  • At the integer ns level:

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: nxs-test
    spec:
      hard:
        requests.cpu: 300m
        requests.memory: 1Gi
        limits.cpu: 700m
        limits.memory: 2Gi

    Those. the sum of all request containers in the default ns cannot exceed 300m for CPU and 1G for OP, and the sum of all limit - 700m for CPU and 2G for OP.

  • Default limits for containers in ns:

    apiVersion: v1
    kind: LimitRange
    metadata:
      name: nxs-limit-per-container
    spec:
     limits:
       - type: Container
         defaultRequest:
           cpu: 100m
           memory: 1Gi
         default:
           cpu: 1
           memory: 2Gi
         min:
           cpu: 50m
           memory: 500Mi
         max:
           cpu: 2
           memory: 4Gi

    Those. in the default namespace for all containers, by default, request will be set to 100m for CPU and 1G for OP, limit - 1 CPU and 2G. This also sets a limit on the possible values ​​in request/limit for CPU (50m < x < 2) and RAM (500M < x < 4G).

  • Pod level limits ns:

    apiVersion: v1
    kind: LimitRange
    metadata:
     name: nxs-limit-pod
    spec:
     limits:
     - type: Pod
       max:
         cpu: 4
         memory: 1Gi

    Those. each pod in the default ns will be limited to 4 vCPUs and 1G.

Now I would like to tell you what advantages the installation of these restrictions can give us.

Load balancing mechanism between nodes

As you know, the k8s component is responsible for the distribution of pods among the nodes, such as scheduler, which works according to a certain algorithm. This algorithm, in the course of choosing the optimal node for launch, goes through two stages:

  1. Filtration
  2. Ranging

Those. according to the described policy, nodes are initially selected on which it is possible to launch a pod based on a set predicates (including checking whether the node has enough resources to run the pod - PodFitsResources), and then for each of these nodes, according to Priorities points are awarded (including the more free resources a node has, the more points it is assigned - LeastResourceAllocation/LeastRequestedPriority/BalancedResourceAllocation) and it is launched on the node with the highest number of points (if several nodes satisfy this condition at once, then a random one is selected) .

At the same time, you need to understand that the scheduler, when assessing the available resources of the node, is guided by the data that is stored in etcd - i.e. by the requested/limit resource of each pod running on this node, but not by the actual resource consumption. This information can be obtained from the output of the command kubectl describe node $NODE, For example:

# kubectl describe nodes nxs-k8s-s1
..
Non-terminated Pods:         (9 in total)
  Namespace                  Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                         ------------  ----------  ---------------  -------------  ---
  ingress-nginx              nginx-ingress-controller-754b85bf44-qkt2t    0 (0%)        0 (0%)      0 (0%)           0 (0%)         233d
  kube-system                kube-flannel-26bl4                           150m (0%)     300m (1%)   64M (0%)         500M (1%)      233d
  kube-system                kube-proxy-exporter-cb629                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         233d
  kube-system                kube-proxy-x9fsc                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         233d
  kube-system                nginx-proxy-k8s-worker-s1                    25m (0%)      300m (1%)   32M (0%)         512M (1%)      233d
  nxs-monitoring             alertmanager-main-1                          100m (0%)     100m (0%)   425Mi (1%)       25Mi (0%)      233d
  nxs-logging                filebeat-lmsmp                               100m (0%)     0 (0%)      100Mi (0%)       200Mi (0%)     233d
  nxs-monitoring             node-exporter-v4gdq                          112m (0%)     122m (0%)   200Mi (0%)       220Mi (0%)     233d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                487m (3%)          822m (5%)
  memory             15856217600 (2%)  749976320 (3%)
  ephemeral-storage  0 (0%)             0 (0%)

Here we see all the pods running on a particular node, as well as the resources that each of the pods requests. And here is what the scheduler logs look like when the cronjob-cron-events-1573793820-xt6q9 pod is launched (this information will appear in the scheduler log when logging level 10 is set in the arguments of the launch command -v=10 ):

broad gull

I1115 07:57:21.637791       1 scheduling_queue.go:908] About to try and schedule pod nxs-stage/cronjob-cron-events-1573793820-xt6q9                                                                                                                                           
I1115 07:57:21.637804       1 scheduler.go:453] Attempting to schedule pod: nxs-stage/cronjob-cron-events-1573793820-xt6q9                                                                                                                                                    
I1115 07:57:21.638285       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s5 is allowed, Node is running only 16 out of 110 Pods.                                                                               
I1115 07:57:21.638300       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s6 is allowed, Node is running only 20 out of 110 Pods.                                                                               
I1115 07:57:21.638322       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s3 is allowed, Node is running only 20 out of 110 Pods.                                                                               
I1115 07:57:21.638322       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s4 is allowed, Node is running only 17 out of 110 Pods.                                                                               
I1115 07:57:21.638334       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s10 is allowed, Node is running only 16 out of 110 Pods.                                                                              
I1115 07:57:21.638365       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s12 is allowed, Node is running only 9 out of 110 Pods.                                                                               
I1115 07:57:21.638334       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s11 is allowed, Node is running only 11 out of 110 Pods.                                                                              
I1115 07:57:21.638385       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s1 is allowed, Node is running only 19 out of 110 Pods.                                                                               
I1115 07:57:21.638402       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s2 is allowed, Node is running only 21 out of 110 Pods.                                                                               
I1115 07:57:21.638383       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s9 is allowed, Node is running only 16 out of 110 Pods.                                                                               
I1115 07:57:21.638335       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s8 is allowed, Node is running only 18 out of 110 Pods.                                                                               
I1115 07:57:21.638408       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s13 is allowed, Node is running only 8 out of 110 Pods.                                                                               
I1115 07:57:21.638478       1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s10 is allowed, existing pods anti-affinity terms satisfied.                                                                         
I1115 07:57:21.638505       1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s8 is allowed, existing pods anti-affinity terms satisfied.                                                                          
I1115 07:57:21.638577       1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s9 is allowed, existing pods anti-affinity terms satisfied.                                                                          
I1115 07:57:21.638583       1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s7 is allowed, Node is running only 25 out of 110 Pods.                                                                               
I1115 07:57:21.638932       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: BalancedResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 2343 millicores 9640186880 memory bytes, score 9        
I1115 07:57:21.638946       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: LeastResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 2343 millicores 9640186880 memory bytes, score 8           
I1115 07:57:21.638961       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: BalancedResourceAllocation, capacity 39900 millicores 66620170240 memory bytes, total request 4107 millicores 11307422720 memory bytes, score 9        
I1115 07:57:21.638971       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: BalancedResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 5847 millicores 24333637120 memory bytes, score 7        
I1115 07:57:21.638975       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: LeastResourceAllocation, capacity 39900 millicores 66620170240 memory bytes, total request 4107 millicores 11307422720 memory bytes, score 8           
I1115 07:57:21.638990       1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: LeastResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 5847 millicores 24333637120 memory bytes, score 7           
I1115 07:57:21.639022       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: TaintTolerationPriority, Score: (10)                                                                                                        
I1115 07:57:21.639030       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: TaintTolerationPriority, Score: (10)                                                                                                         
I1115 07:57:21.639034       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: TaintTolerationPriority, Score: (10)                                                                                                         
I1115 07:57:21.639041       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: NodeAffinityPriority, Score: (0)                                                                                                            
I1115 07:57:21.639053       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: NodeAffinityPriority, Score: (0)                                                                                                             
I1115 07:57:21.639059       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: NodeAffinityPriority, Score: (0)                                                                                                             
I1115 07:57:21.639061       1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: InterPodAffinityPriority, Score: (0)                                                                                                                   
I1115 07:57:21.639063       1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: SelectorSpreadPriority, Score: (10)                                                                                                                   
I1115 07:57:21.639073       1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: InterPodAffinityPriority, Score: (0)                                                                                                                    
I1115 07:57:21.639077       1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: SelectorSpreadPriority, Score: (10)                                                                                                                    
I1115 07:57:21.639085       1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: InterPodAffinityPriority, Score: (0)                                                                                                                    
I1115 07:57:21.639088       1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: SelectorSpreadPriority, Score: (10)                                                                                                                    
I1115 07:57:21.639103       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: SelectorSpreadPriority, Score: (10)                                                                                                         
I1115 07:57:21.639109       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: SelectorSpreadPriority, Score: (10)                                                                                                          
I1115 07:57:21.639114       1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: SelectorSpreadPriority, Score: (10)                                                                                                          
I1115 07:57:21.639127       1 generic_scheduler.go:781] Host nxs-k8s-s10 => Score 100037                                                                                                                                                                            
I1115 07:57:21.639150       1 generic_scheduler.go:781] Host nxs-k8s-s8 => Score 100034                                                                                                                                                                             
I1115 07:57:21.639154       1 generic_scheduler.go:781] Host nxs-k8s-s9 => Score 100037                                                                                                                                                                             
I1115 07:57:21.639267       1 scheduler_binder.go:269] AssumePodVolumes for pod "nxs-stage/cronjob-cron-events-1573793820-xt6q9", node "nxs-k8s-s10"                                                                                                               
I1115 07:57:21.639286       1 scheduler_binder.go:279] AssumePodVolumes for pod "nxs-stage/cronjob-cron-events-1573793820-xt6q9", node "nxs-k8s-s10": all PVCs bound and nothing to do                                                                             
I1115 07:57:21.639333       1 factory.go:733] Attempting to bind cronjob-cron-events-1573793820-xt6q9 to nxs-k8s-s10

Here we see that initially the scheduler performs filtering and forms a list of 3 nodes on which launch is possible (nxs-k8s-s8, nxs-k8s-s9, nxs-k8s-s10). It then performs scoring on several parameters (including BalancedResourceAllocation, LeastResourceAllocation) for each of these nodes in order to determine the most suitable node. Ultimately, the pod is planned on the node with the most points (here, two nodes at once have the same number of points 100037, so a random one is chosen - nxs-k8s-s10).

Hack and predictor Aviator: if there are pods running on the node for which no limits are set, then for k8s (in terms of resource consumption) this will be equivalent to if such pods were absent on this node at all. Therefore, if you conditionally have a pod with a gluttonous process (for example, wowza) and no restrictions are set for it, then a situation may arise when this pod has actually eaten all the resources of the node, but for k8s this node is considered unloaded and it will be awarded the same number of points when ranking (namely, in points with an assessment of available resources) as a node on which there are no working pods, which ultimately can lead to an uneven load distribution between the nodes.

Pod eviction

As you know, each pod is assigned one of 3 QoS classes:

  1. guaranteed - assigned when request and limit are set for each container in the pod for memory and cpu, and these values ​​\uXNUMXb\uXNUMXbmust match
  2. burstable - at least one container in the pod has request and limit, while request < limit
  3. best effort - when no container in the pod is resource limited

At the same time, when there is a shortage of resources (disk, memory) on a node, kubelet starts ranking and evicting subs according to a certain algorithm that takes into account the priority of the pod and its QoS class. For example, if we are talking about RAM, then based on the QoS class, points are awarded according to the following principle:

  • Guaranteed:-998
  • BestEffort: 1000
  • Burstable: min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Those. with the same priority, kubelet will first of all evict pods with the best effort QoS class from the node.

Hack and predictor Aviator: if you want to reduce the probability of evicting the desired pod from the node in case of a lack of resources on it, then along with the priority, you must also take care of setting the request/limit for it.

Application Pod Horizontal Autoscaler (HPA) mechanism

When the task is to automatically increase and decrease the number of pods depending on the use of resources (system - CPU / RAM or user - rps), such a k8s entity as HPA (Horizontal Pod Autoscaler). The algorithm of which is as follows:

  1. The current indications of the observable resource are determined (currentMetricValue)
  2. The desired values ​​for the resource are determined (desiredMetricValue), which are set for system resources using request
  3. The current number of replicas is determined (currentReplicas)
  4. The following formula calculates the desired number of replicas (desiredReplicas)
    desiredReplicas = [ currentReplicas * ( currentMetricValue / desiredMetricValue )]

In this case, scaling will not occur when the coefficient (currentMetricValue / desiredMetricValue) is close to 1 (at the same time, we can set the permissible error ourselves, by default it is 0.1).

Let's consider the work of hpa on the example of the app-test application (described as Deployment), where it is necessary to change the number of replicas, depending on the CPU consumption:

  • Application Manifest

    kind: Deployment
    apiVersion: apps/v1beta2
    metadata:
    name: app-test
    spec:
    selector:
    matchLabels:
    app: app-test
    replicas: 2
    template:
    metadata:
    labels:
    app: app-test
    spec:
    containers:
    - name: nginx
    image: registry.nixys.ru/generic-images/nginx
    imagePullPolicy: Always
    resources:
    requests:
    cpu: 60m
    ports:
    - name: http
    containerPort: 80
    - name: nginx-exporter
    image: nginx/nginx-prometheus-exporter
    resources:
    requests:
    cpu: 30m
    ports:
    - name: nginx-exporter
    containerPort: 9113
    args:
    - -nginx.scrape-uri
    - http://127.0.0.1:80/nginx-status

    Those. we see that the pod with the application is initially launched in two instances, each of which contains two containers nginx and nginx-exporter, each of which is given requests for the CPU.

  • HPA Manifesto

    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    metadata:
    name: app-test-hpa
    spec:
    maxReplicas: 10
    minReplicas: 2
    scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: app-test
    metrics:
    - type: Resource
    resource:
    name: cpu
    target:
    type: Utilization
    averageUtilization: 30

    Those. we created an hpa that will monitor the Deployment app-test and adjust the number of pods with the application based on cpu (we expect the pod to consume 30% of the CPU it requests), with the number of replicas in the range of 2-10.

    Now, let's look at the hpa mechanism, if we apply a load to one of the pods:

     # kubectl top pod
    NAME                                                   CPU(cores)   MEMORY(bytes)
    app-test-78559f8f44-pgs58            101m         243Mi
    app-test-78559f8f44-cj4jz            4m           240Mi

In total we have the following:

  • DesiredMetricValue - according to the hpa settings, we have 30%
  • Current value (currentMetricValue) — for calculation, the controller-manager calculates the average value of resource consumption in %, i.e. conditionally does the following:
    1. Gets absolute pod metrics from the metric server, i.e. 101m and 4m
    2. Calculates the mean absolute value, i.e. (101m + 4m) / 2 = 53m
    3. Gets the absolute value for the desired resource consumption (requests from all containers are summed for this) 60m + 30m = 90m
    4. Calculates the average percentage of CPU consumption relative to the request Pod, i.e. 53m / 90m * 100% = 59%

Now we have everything we need to determine whether we need to change the number of replicas, for this we calculate the coefficient:

ratio = 59% / 30% = 1.96

Those. the number of replicas should be increased by ~2 times and be [2 * 1.96] = 4.

Conclusion: As you can see, for this mechanism to work, a necessary condition is, among other things, the presence of requests for all containers in the monitored pod.

Mechanism of horizontal autoscaling of nodes (Cluster Autoscaler)

In order to level the negative impact on the system during load bursts, having a configured hpa is not enough. For example, according to the settings in the hpa controller manager, it decides that the number of replicas needs to be increased by 2 times, however, there are no free resources on the nodes to run such a number of pods (i.e., the node cannot provide the requested resources of the requests pod) and these pods go to the Pending state.

In this case, if the provider has an appropriate IaaS / PaaS (for example, GKE / GCE, AKS, EKS, etc.), a tool such as Node Autoscaler. It allows you to set the maximum and minimum number of nodes in the cluster and automatically adjust the current number of nodes (by calling the cloud provider API to order / delete a node) when there is a shortage of resources in the cluster and pods cannot be scheduled (are in the Pending state).

Conclusion: to be able to autoscale nodes, you need to set requests in pod containers so that k8s can correctly assess the load on the nodes and accordingly report that there are no resources in the cluster to launch the next pod.

Conclusion

It should be noted that setting container resource limits is not a prerequisite for the successful launch of the application, but it is still better to do this for the following reasons:

  1. For more accurate scheduler work in terms of load balancing between k8s nodes
  2. To reduce the likelihood of a Pod evict event
  3. For horizontal pod autoscaling (HPA) to work
  4. For horizontal autoscaling of nodes (Cluster Autoscaling) for cloud providers

Also read other articles on our blog:

Source: habr.com

Add a comment