Nine Kubernetes Performance Tips

Nine Kubernetes Performance Tips

Hi all! My name is Oleg Sidorenkov, I work at DomClick as an infrastructure team leader. We have been using the Cube for sale for more than three years, and during this time we have experienced many different interesting moments with it. Today I will tell you how, with the right approach, you can squeeze even more performance out of vanilla Kubernetes for your cluster. Ready steady go!

You all know very well that Kubernetes is a scalable open source system for container orchestration; well, or 5 binaries that do magic by managing the life cycle of your microservices in a server environment. In addition, this is a fairly flexible tool that can be assembled like a Lego constructor for maximum customization for different tasks.

And everything seems to be fine: throw servers into the cluster, like firewood into a firebox, and don’t know grief. But if you are for the environment, then you will think: “How can I keep the fire in the stove and regret the forest?”. In other words, how to find ways to improve infrastructure and reduce costs.

1. Keep track of team and application resources

Nine Kubernetes Performance Tips

One of the most banal but effective methods is the introduction of requests/limits. Separate applications by namespaces, and namespaces by development teams. Set the application before deploying values ​​for the consumption of processor time, memory, ephemeral storage.

resources:
   requests:
     memory: 2Gi
     cpu: 250m
   limits:
     memory: 4Gi
     cpu: 500m

By experience, we came to the conclusion: it is not worth inflating requests from limits by more than two times. The cluster size is calculated based on requests, and if you set the application to a difference in resources, for example, by 5-10 times, then imagine what will happen to your node when it is filled with pods and suddenly receives a load. Nothing good. At a minimum, throttling, and as a maximum, say goodbye to the worker and get a cyclic load on the rest of the nodes after the pods start moving.

In addition, with the help limitranges you can set resource values ​​for the container at the start - minimum, maximum and default:

➜  ~ kubectl describe limitranges --namespace ops
Name:       limit-range
Namespace:  ops
Type        Resource           Min   Max   Default Request  Default Limit  Max Limit/Request Ratio
----        --------           ---   ---   ---------------  -------------  -----------------------
Container   cpu                50m   10    100m             100m           2
Container   ephemeral-storage  12Mi  8Gi   128Mi            4Gi            -
Container   memory             64Mi  40Gi  128Mi            128Mi          2

Remember to limit the namespace resources so that one command cannot take all the resources of the cluster:

➜  ~ kubectl describe resourcequotas --namespace ops
Name:                   resource-quota
Namespace:              ops
Resource                Used          Hard
--------                ----          ----
limits.cpu              77250m        80
limits.memory           124814367488  150Gi
pods                    31            45
requests.cpu            53850m        80
requests.memory         75613234944   150Gi
services                26            50
services.loadbalancers  0             0
services.nodeports      0             0

As you can see from the description resourcequotas, if the ops command wants to deploy pods that will consume another 10 cpu, then the scheduler will not allow it to be done and will issue an error:

Error creating: pods "nginx-proxy-9967d8d78-nh4fs" is forbidden: exceeded quota: resource-quota, requested: limits.cpu=5,requests.cpu=5, used: limits.cpu=77250m,requests.cpu=53850m, limited: limits.cpu=10,requests.cpu=10

To solve a similar problem, you can write a tool, for example, as this, which can store and commit the state of command resources.

2. Choose the best file storage

Nine Kubernetes Performance Tips

Here I would like to touch on the topic of persistent volumes and the disk subsystem of Kubernetes worker nodes. I hope that no one uses the "Cube" on the HDD in production, but sometimes even a regular SSD is already not enough. We faced such a problem that the logs were killing the disk by I / O operations, and there are not very many solutions here:

  • Use high-performance SSDs or switch to NVMe (if you manage your own hardware).

  • Decrease the level of logging.

  • Do "smart" balancing of pods that rape the disk (podAntiAffinity).

The screenshot above shows what happens under nginx-ingress-controller with a disk when access_logs logging is enabled (~12k logs/sec). Such a state, of course, can lead to the degradation of all applications on this node.

As for PV, alas, I haven't tried everything. views Persistent volumes. Use the best option that suits you. It has historically happened in our country that a small part of services needs RWX volumes, and a long time ago they began to use NFS storage for this task. Cheap and ... enough. Of course, we ate shit with him - be healthy, but we learned how to tune him, and his head no longer hurts. And if possible, switch to S3 object storage.

3. Build Optimized Images

Nine Kubernetes Performance Tips

It's best to use container-optimized images so that Kubernetes can fetch them faster and execute them more efficiently. 

Optimization means that images:

  • contain only one application or perform only one function;

  • small size, because large images are worse transmitted over the network;

  • have health and readiness endpoints that Kubernetes can use to take action in the event of downtime;

  • use container-friendly operating systems (like Alpine or CoreOS) that are more resistant to configuration errors;

  • use multi-stage builds so that you can only deploy compiled applications and not the accompanying sources.

There are many tools and services that allow you to check and optimize images on the fly. It is important to always keep them up to date and safe. As a result, you get:

  1. Reduced network load on the entire cluster.

  2. Decreased container startup time.

  3. Smaller size of your entire Docker registry.

4. Use a DNS cache

Nine Kubernetes Performance Tips

If we talk about high loads, then without tuning the DNS system of the cluster, life is pretty lousy. Once upon a time, the Kubernetes developers supported their kube-dns solution. It was also implemented in our country, but this software did not particularly tune in and did not give the required performance, although, it seems, the task is simple. Then coredns appeared, to which we switched and did not know grief, later it became the default DNS service in K8s. At some point, we grew up to 40 thousand rps to the DNS system, and this solution was also not enough. But, by a lucky chance, Nodelocaldns came out, aka node local cache, aka NodeLocal DNSCache.

Why are we using it? There is a bug in the Linux kernel that, when multiple accesses through conntrack NAT over UDP, lead to a race condition for writing to the conntrack tables, and part of the traffic through NAT is lost (each trip through the Service is NAT). Nodelocaldns solves this problem by getting rid of NAT and upgrading to TCP connectivity to upstream DNS, as well as caching upstream DNS queries locally (including a short 5 second negative cache).

5. Scale pods horizontally and vertically automatically

Nine Kubernetes Performance Tips

Can you say with confidence that all your microservices are ready for a two- to three-fold increase in load? How to properly allocate resources to your applications? Keeping a couple of pods running in excess of the workload can be redundant, and keeping them back to back risks downtime from a sudden increase in traffic to the service. The golden mean helps to achieve the spell of multiplication such services as Horizontal Pod Autoscaler и Vertical Pod Autoscaler.

VPA allows you to automatically raise the requests/limits of your containers in a pod based on actual usage. How can it be useful? If you have Pods that for some reason cannot be scaled out horizontally (which is not entirely reliable), then you can try to trust VPA to change its resources. Its feature is a recommendation system based on historical and current data from metric-server, so if you don't want to change requests/limits automatically, you can simply monitor the recommended resources for your containers and optimize the settings to save CPU and memory in the cluster.

Nine Kubernetes Performance TipsImage taken from https://levelup.gitconnected.com/kubernetes-autoscaling-101-cluster-autoscaler-horizontal-pod-autoscaler-and-vertical-pod-2a441d9ad231

The scheduler in Kubernetes is always based on requests. Whatever value you put there, the scheduler will look for a suitable node based on it. The limits value is needed by the kublet in order to know when to throttle or kill a pod. And since the only important parameter is the requests value, VPA will work with it. Whenever you scale your application vertically, you define what requests should be. And what will happen to limits then? This parameter will also be proportionally scaled.

For example, here are the typical pod settings:

resources:
   requests:
     memory: 250Mi
     cpu: 200m
   limits:
     memory: 500Mi
     cpu: 350m

The recommendation engine determines that your application needs 300m CPU and 500Mi to run properly. You will get these settings:

resources:
   requests:
     memory: 500Mi
     cpu: 300m
   limits:
     memory: 1000Mi
     cpu: 525m

As mentioned above, this is proportional scaling based on the requests/limits ratio in the manifest:

  • CPU: 200m → 300m: ratio 1:1.75;

  • Memory: 250Mi → 500Mi: 1:2 ratio.

Concerning HPA, then the mechanism of operation is more transparent. Thresholds are set for metrics such as processor and memory, and if the average of all replicas exceeds the threshold, then the application scales by +1 pod until the value falls below the threshold, or until the maximum number of replicas is reached.

Nine Kubernetes Performance TipsImage taken from https://levelup.gitconnected.com/kubernetes-autoscaling-101-cluster-autoscaler-horizontal-pod-autoscaler-and-vertical-pod-2a441d9ad231

In addition to the usual metrics like CPU and Memory, you can set thresholds on your custom Prometheus metrics and work with them if you feel this is the most accurate way to determine when to scale your application. Once the application stabilizes below the specified metric threshold, HPA will start scaling the pods down to the minimum number of replicas or until the load meets the specified threshold.

6. Don't Forget About Node Affinity and Pod Affinity

Nine Kubernetes Performance Tips

Not all nodes run on the same hardware, and not all pods need to run compute-intensive applications. Kubernetes allows you to specify the specialization of nodes and pods using Node Affinity и Pod Affinity.

If you have nodes suitable for compute-intensive operations, then for maximum efficiency, it is better to bind applications to the appropriate nodes. To do this, use nodeSelector with node label.

Let's say you have two nodes: one with CPUType=HIGHFREQ and a large number of fast cores, another with MemoryType=HIGHMEMORY more memory and faster performance. The easiest way is to assign a pod deployment to a node HIGHFREQby adding to the section spec a selector like this:

…
nodeSelector:
	CPUType: HIGHFREQ

A more costly and specific way to do this is to use nodeAffinity in the field affinity Section spec. There are two options:

  • requiredDuringSchedulingIgnoredDuringExecution: hard setting (scheduler will only deploy pods on specific nodes (and nowhere else));

  • preferredDuringSchedulingIgnoredDuringExecution: soft setting (the scheduler will try to deploy to specific nodes, and if it fails, it will try to deploy to the next available node).

You can specify a specific syntax for managing node labels, for example, In, NotIn, Exists, DoesNotExist, Gt or Lt. However, remember that complex methods in long lists of labels will slow down decision making in critical situations. In other words, don't over complicate.

As mentioned above, Kubernetes allows you to set the binding of current pods. That is, you can make certain pods work together with other pods in the same availability zone (relevant for clouds) or nodes.

В podAffinity margins affinity Section spec the same fields are available as in the case of nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution и preferredDuringSchedulingIgnoredDuringExecution. The only difference is that matchExpressions will bind pods to a node that is already running a pod with that label.

More Kubernetes offers a field podAntiAffinity, which, in contrast, does not bind a pod to a node with specific pods.

About expressions nodeAffinity The same advice can be given: try to keep the rules simple and logical, do not try to overload the pod specification with a complex set of rules. It's very easy to create a rule that doesn't match the conditions of the cluster, putting extra load on the scheduler and degrading overall performance.

7. Taints & Tolerations

There is another way to manage the scheduler. If you have a large cluster with hundreds of nodes and thousands of microservices, it's very difficult to prevent certain pods from being hosted by certain nodes.

The mechanism of taints - prohibiting rules - helps with this. For example, you can prevent certain nodes from running pods in certain scenarios. To apply taint to a specific node, use the option taint in kubectl. Specify key and value and then taint like NoSchedule or NoExecute:

$ kubectl taint nodes node10 node-role.kubernetes.io/ingress=true:NoSchedule

It is also worth noting that the taint mechanism supports three main effects: NoSchedule, NoExecute и PreferNoSchedule.

  • NoSchedule means that until there is a corresponding entry in the pod specification tolerations, it cannot be deployed to the node (in this example node10).

  • PreferNoSchedule - simplified version NoSchedule. In this case, the scheduler will try not to allocate pods that do not have a matching entry. tolerations per node, but this is not a hard limit. If there are no resources in the cluster, then the pods will begin to deploy on this node.

  • NoExecute - this effect triggers an immediate evacuation of pods that do not have a matching entry tolerations.

Curiously, this behavior can be undone using the tolerations mechanism. This is convenient when there is a “forbidden” node and you need to place only infrastructure services on it. How to do it? Allow only those pods for which there is a suitable tolerance.

Here's what the pod spec would look like:

spec:
   tolerations:
     - key: "node-role.kubernetes.io/ingress"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

This does not mean that during the next redeploy, the pod will hit exactly this node, this is not the Node Affinity mechanism and nodeSelector. But by combining several features, you can achieve a very flexible scheduler setup.

8. Set Pod Deployment Priority

Just because you've configured pod-to-node bindings doesn't mean that all pods should be treated with the same priority. For example, you might want to deploy some Pods before others.

Kubernetes offers different ways to set Pod Priority and Preemption. The setting consists of several parts: object PriorityClass and field descriptions priorityClassName in the pod specification. Consider an example:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 99999
globalDefault: false
description: "This priority class should be used for very important pods only"

We create PriorityClass, give it a name, description, and value. The higher value, the higher the priority. The value can be any 32-bit integer less than or equal to 1. Higher values ​​are reserved for mission-critical system pods, which typically cannot be preempted. The eviction will only occur if the high-priority pod has nowhere to turn around, then some of the pods from a particular node will be evacuated. If this mechanism is too rigid for you, then you can add the option preemptionPolicy: Never, and then there will be no preemption, the pod will be the first in the queue and wait for the scheduler to find free resources for it.

Next, we create a pod, in which we specify the name priorityClassName:

apiVersion: v1
kind: Pod
metadata:
  name: static-web
  labels:
    role: myrole
 spec:
  containers:
    - name: web
      image: nginx
      ports:
        - name: web
          containerPort: 80
          protocol: TCP
  priorityClassName: high-priority
          

You can create as many priority classes as you like, although it is recommended not to get carried away with this (say, limit yourself to low, medium and high priority).

Thus, if necessary, you can increase the efficiency of deploying critical services, such as nginx-ingress-controller, coredns, etc.

9. Optimize your ETCD cluster

Nine Kubernetes Performance Tips

ETCD can be called the brain of the whole cluster. It is very important to maintain the operation of this database at a high level, since the speed of operations in the "Cube" depends on it. A fairly standard, and at the same time, a good solution would be to keep an ETCD cluster on the master nodes in order to have a minimum delay to kube-apiserver. If this is not possible, then place the ETCD as close as possible, with good bandwidth between participants. Also pay attention to how many nodes from ETCD can fall out without harming the cluster.

Nine Kubernetes Performance Tips

Keep in mind that an excessive increase in the number of participants in the cluster can increase fault tolerance at the expense of performance, everything should be in moderation.

If we talk about setting up the service, then there are few recommendations:

  1. Have good hardware, based on the size of the cluster (you can read here).

  2. Tweak a few parameters if you have spread a cluster between a pair of DCs or your network and disks leave much to be desired (you can read here).

Conclusion

This article describes the points that our team tries to comply with. This is not a step-by-step description of actions, but options that can be useful to optimize the overhead of a cluster. It is clear that each cluster is unique in its own way, and tuning solutions can vary greatly, so it would be interesting to get feedback from you: how do you monitor your Kubernetes cluster, how do you improve its performance. Share your experience in the comments, it will be interesting to know it.

Source: habr.com