10 Common Kubernetes Mistakes

Note. transl.: The authors of this article are engineers from a small Czech company pipetail. They managed to put together a wonderful list of [sometimes banal, but still] such actual problems and misconceptions associated with the operation of Kubernetes clusters.

10 Common Kubernetes Mistakes

Over the years of using Kubernetes, we have worked with a large number of clusters (both managed and unmanaged - on GCP, AWS and Azure). Over time, we began to notice that some mistakes are constantly repeated. However, there is nothing to be ashamed of: we have done most of them ourselves!

The article contains the most common errors, and also mentions how to fix them.

1. Resources: requests and limits

This item definitely deserves the closest attention and the first place on the list.

CPU request usually either not set at all or has a very low value (to accommodate as many pods as possible on each node). Thus, the nodes are overloaded. During high load, the processing power of the node is fully utilized and a particular workload receives only what it "requested" by CPU throttling. This leads to increased delays in the application, timeouts and other unpleasant consequences. (Read more about this in our other recent translation: β€œCPU limits and aggressive throttling in Kubernetes"- approx. transl.)

BestEffort (extremely not recommended):

resources: {}

Extremely low CPU request (extremely not recommended):

   resources:
      Requests:
        cpu: "1m"

On the other hand, having a CPU limit can cause pods to skip cycles unreasonably, even if the node's CPU is not fully loaded. Again, this can lead to increased delays. Debate continues around CPU CFS quota in the Linux kernel and CPU throttling depending on the set limits, as well as disabling the CFS quota ... Alas, CPU limits can cause more problems than they can solve. You can find out more about this at the link below.

Over-allocation (overcommitting) memory can lead to larger problems. Hitting the CPU limit causes skipping cycles, while hitting the memory limit causes the pod to be killed. Have you ever watched OOMkill? Yes, we are talking about him.

Want to minimize the chance of this happening? Don't allocate excessive amounts of memory and use Guaranteed QoS (Quality of Service) by setting the memory request to the limit (as in the example below). Read more about this in Henning Jacobs presentations (Lead Engineer Zalando).

Burstable (higher chance of getting OOMkilled):

   resources:
      requests:
        memory: "128Mi"
        cpu: "500m"
      limits:
        memory: "256Mi"
        cpu: 2

Guaranteed:

   resources:
      requests:
        memory: "128Mi"
        cpu: 2
      limits:
        memory: "128Mi"
        cpu: 2

What would potentially help when setting up resources?

With metrics-server you can see the current consumption of CPU resources and memory usage by pods (and containers inside them). Chances are you are already using it. Just run the following commands:

kubectl top pods
kubectl top pods --containers
kubectl top nodes

However, they only show current usage. With it, you can get a rough idea of ​​the order of magnitude, but in the end you will need history of metrics over time (to answer questions such as: "What was the peak load on the CPU?", "What was the load yesterday morning?" - etc.). For this you can use Prometheus, DataDog and other tools. They simply get the metrics from metrics-server and store them, and the user can query them and plot accordingly.

VerticalPodAutoscaler Allows automate this process. It keeps track of CPU and memory usage history and sets up new requests and limits based on this information.

Efficient use of computing power is not an easy task. It's like playing Tetris all the time. If you are paying too much for computing power at a low average consumption (say, ~10%), we recommend that you look at products based on AWS Fargate or Virtual Kubelet. They are built on a serverless/pay-per-usage billing model, which can be cheaper in such conditions.

2. Liveness and readiness probes

By default, liveness and readiness health checks are not enabled in Kubernetes. And sometimes they forget to turn them on ...

But how else can you initiate a service restart in the event of a fatal error? And how does the load balancer know that a certain pod is ready to receive traffic? Or that it is capable of handling more traffic?

Often these tests are confused with each other:

  • liveness - check "survivability", which restarts the pod on failure;
  • readiness - a readiness check, if it fails, it disconnects the pod from the Kubernetes service (this can be checked using kubectl get endpoints) and no traffic is sent to it until the next check completes successfully.

Both of these checks PERFORMED DURING THE POD'S ENTIRE LIFE CYCLE. It is very important.

It is a common misconception that readiness probes are only run at startup so that the balancer can know that the pod is ready (Ready) and can start processing the traffic. However, this is only one of the options for their use.

The other is the ability to find out that the traffic on the pod is excessive and overloads it (or the pod does resource-intensive calculations). In this case, the readiness check helps reduce the load on the pod and "cool" it. Successful completion of the readiness check in the future allows again increase the load on the pod. In this case (when the readiness probe fails), failing the liveness check would be very counterproductive. Why restart a pod that is healthy and working hard?

Therefore, in some cases, the complete absence of checks is better than their inclusion with incorrectly configured parameters. As mentioned above, if liveness check copies readiness checkthen you are in big trouble. Possible option is to configure readiness test only, dangerous liveness leave aside.

Both types of checks should not fail when shared dependencies fail, otherwise it will cause all pods to cascade down. In other words, don't hurt yourself.

3. LoadBalancer for each HTTP service

Most likely, you have HTTP services in your cluster that you would like to forward to the outside world.

If you open the service as type: LoadBalancer, its controller (depending on the service provider) will provide and negotiate an external LoadBalancer (not necessarily running on L7, rather even on L4), and this may affect the cost (external static IPv4 address, computing power, per second billing) due to the need to create a large number of such resources.

In this case, it is much more logical to use one external load balancer, opening services as type: NodePort. Or, even better, expand something like nginx-ingress-controller (or traefik), which will be the only Node Port endpoint associated with an external load balancer and will route traffic in the cluster using ingress- Kubernetes resources.

Other intra-cluster (micro) services interacting with each other can "communicate" using services like Cluster IP and a built-in service discovery mechanism via DNS. Just don't use their public DNS/IP as this can impact latency and drive up cloud service costs.

4. Autoscaling a cluster without taking into account its features

When adding nodes to the cluster and removing them from it, you should not rely on some basic metrics like the CPU usage on these nodes. Pod planning should be done taking into account the set restrictionssuch as pod/node affinity, taints and tolerations, resource requests, QoS, etc. Using an external autoscaler that does not take these nuances into account can lead to problems.

Imagine that a pod needs to be scheduled, but all available CPU power is requested/taken apart and the pod stuck in state Pending. The external autoscaler sees the average current CPU load (not the requested one) and does not initiate the expansion (scale-out) - does not add another node. As a result, this pod will not be scheduled.

However, the rescaling (scale-in) - Removing a node from a cluster is always more difficult to implement. Imagine you have a stateful pod (with persistent storage attached). Persistent volumes usually belong to defined availability zone and are not replicated in the region. Thus, if an external autoscaler deletes the node with this pod, then the scheduler will not be able to schedule this pod to another node, since this can only be done in the availability zone where the persistent storage is located. Pod will hang in state Pending.

In the Kubernetes community, it is very popular cluster-autoscaler. It runs in a cluster, supports APIs from major cloud providers, takes into account all limitations, and can scale in the above cases. It is also able to scale-in while maintaining all the limits set, thereby saving money (which would otherwise be wasted on unclaimed capacity).

5. Neglect of IAM/RBAC capabilities

Beware of using IAM users with persistent secrets to machines and applications. Organize temporary access using roles and service accounts (service accounts).

We often see access keys (and secrets) being 'hardcoded' in the application's configuration, as well as neglecting to rotate secrets despite having access to Cloud IAM. Use IAM roles and service accounts instead of users where appropriate.

10 Common Kubernetes Mistakes

Forget about kube2iam and jump straight to IAM roles for service accounts (as described in note of the same name StΔ›pΓ‘n Vrany):

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/my-app-role
  name: my-serviceaccount
  namespace: default

One abstract. Not so difficult, right?

Also, don't privilege service accounts and instance profiles. admin ΠΈ cluster-adminif they don't need it. This is a little more difficult to implement, especially in RBAC K8s, but definitely worth the effort.

6. Don't rely on automatic anti-affinity for pods

Imagine that you have three replicas of some deployment on a node. The node falls, and with it all the replicas. Bad situation, right? But why were all the replicas on the same node? Shouldn't Kubernetes provide high availability (HA)?!

Alas, the Kubernetes scheduler on its own initiative does not respect the rules of separate existence (anti-affinity) for pods. They need to be written explicitly:

// ΠΎΠΏΡƒΡ‰Π΅Π½ΠΎ для краткости
      labels:
        app: zk
// ΠΎΠΏΡƒΡ‰Π΅Π½ΠΎ для краткости
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                    - zk
              topologyKey: "kubernetes.io/hostname"

That's all. Now the pods will be scheduled to different nodes (this condition is only checked during scheduling, but not their operation - hence requiredDuringSchedulingIgnoredDuringExecution).

Here we are talking about podAntiAffinity on different nodes: topologyKey: "kubernetes.io/hostname", rather than different Availability Zones. To implement a full-fledged HA, you have to dig deeper into this topic.

7. Ignoring PodDisruptionBudgets

Imagine you have a production workload on a Kubernetes cluster. Periodically, the nodes and the cluster itself have to be updated (or decommissioned). PodDisruptionBudget (PDB) is a kind of guarantee service agreement between cluster administrators and users.

PDB avoids service interruptions caused by a lack of nodes:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: zookeeper

In this example, you, as the user of the cluster, are stating to the administrators: "Hey, I have a zookeeper service, and no matter what you do, I would like at least 2 replicas of this service to always be available."

You can read more about this. here.

8. Multiple Users or Environments in a Shared Cluster

Kubernetes Namespaces (namespaces) do not provide strong insulation.

It is a common misconception that if you deploy a non-prod payload into one namespace and a prod payload into another, then they won't affect each other… However, some level of isolation can be achieved with resource requests/restrictions, setting quotas, setting priorityClasses. Some "physical" isolation in the data plane is provided by affinities, tolerations, taints (or nodeselectors), but such a separation is quite complicated implement.

Those who need to combine both types of workloads in the same cluster will have to put up with complexity. If there is no such need, and you can afford to start another cluster (say, in a public cloud), it is better to do so. This will achieve a much higher level of isolation.

9. externalTrafficPolicy: Cluster

Very often we observe that all traffic inside the cluster comes through a NodePort type service, for which the policy is set by default externalTrafficPolicy: Cluster. This means that Node Port is open on every node in the cluster, and you can use any of them to interact with the desired service (set of pods).

10 Common Kubernetes Mistakes

At the same time, real pods associated with the aforementioned NodePort service are usually available only on some subset of these nodes. In other words, if I connect to a node that does not have the right pod, it will redirect traffic to another node, adding a hop (hop) and increasing the latency (if the nodes are in different availability zones / data centers, the latency can be quite high; in addition, the cost of egress traffic will increase).

On the other hand, if a Kubernetes service has a policy externalTrafficPolicy: Local, then the NodePort is opened only on those nodes where the required pods are actually running. When using an external load balancer that checks the states (health check) endpoints (how does AWS ELB), He will send traffic only to the desired nodes, which will favorably affect delays, computational needs, bills for egress (and common sense dictates the same thing).

There is a high chance that you are already using something like traefik or nginx-ingress-controller as a NodePort endpoint (or LoadBalancer, which also uses NodePort) to route HTTP ingress traffic, and setting this option can significantly reduce latency for such requests.

Π’ this publication you can learn more about externalTrafficPolicy, its advantages and disadvantages.

10. Do not get attached to clusters and do not abuse the control plane

Previously, it was customary to call servers by proper names: Anton, HAL9000 and Colossus… Today they have been replaced by randomly generated identifiers. However, the habit has remained, and now clusters get their own names.

Typical story (based on real events): it all started with a proof of concept, so the cluster bore a proud name testing... Years have passed, and it is STILL used in production, and everyone is afraid to touch it.

There is nothing funny about clusters turning into pets, so we recommend periodically removing them, along the way practicing in disaster recovery (this will help chaos engineering - approx. transl.). In addition, it does not hurt to take care of the control layer (control plane). Being afraid to touch him is not a good sign. Etc. dead? Guys, you really screwed up!

On the other hand, do not get carried away with manipulations with him. With time control layer can become slow. Most likely, this is due to a large number of objects being created without their rotation (a common situation when using Helm with default settings, which is why its state in configmaps / secrets is not updated - as a result, thousands of objects accumulate in the control layer) or with constant editing of kube-api objects (for automatic scaling, for CI / CD, for monitoring, event logs, controllers, etc.).

In addition, we recommend that you check the SLA / SLO agreements with the managed Kubernetes provider and pay attention to guarantees. Vendor can guarantee control layer accessibility (or its subcomponents), but not the p99 delay of requests you send to it. In other words, you can enter kubectl get nodes, and receive a response only after 10 minutes, and this will not be a violation of the terms of the service agreement.

11. Bonus: using the latest tag

Now this is a classic. Recently, we have not encountered such a technique so often, since many, taught by bitter experience, have ceased to use the tag :latest and started pinning versions. Hooray!

ECR supports immutability of image tags; We recommend that you familiarize yourself with this remarkable feature.

Summary

Don't expect everything to work with a wave of your hand: Kubernetes is not a panacea. Bad app will remain so even in Kubernetes (and possibly even worse). Carelessness will lead to excessive complexity, slow and stressful work of the control layer. In addition, you run the risk of being left without a disaster recovery strategy. Don't expect Kubernetes to take care of isolation and high availability out of the box. Spend some time making your app truly cloud native.

You can get acquainted with the unsuccessful experience of various teams in this collection of stories by Henning Jacobs.

Those who wish to supplement the list of errors provided in this article can contact us on Twitter (@MarekBartik, @MstrsObserver).

PS from translator

Read also on our blog:

Source: habr.com

Add a comment