6 entertaining system bugs in the operation of Kubernetes [and their solution]

6 entertaining system bugs in the operation of Kubernetes [and their solution]

Over the years of using Kubernetes in production, we have accumulated a lot of interesting stories about how bugs in various system components led to unpleasant and / or incomprehensible consequences that affect the operation of containers and pods. In this article, we have compiled some of the most common or interesting ones. Even if you are never lucky enough to encounter such situations, reading about such short detective stories - especially "first-hand" - is always entertaining, isn't it? ..

Story 1. Supercronic and a stuck Docker

On one of the clusters, we periodically received a β€œhanging” Docker, which interfered with the normal functioning of the cluster. At the same time, the following was observed in the Docker logs

level=error msg="containerd: start init process" error="exit status 2: "runtime/cgo: pthread_create failed: No space left on device
SIGABRT: abort
PC=0x7f31b811a428 m=0

goroutine 0 [idle]:

goroutine 1 [running]:
runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc420026768 sp=0xc420026760
runtime.main() /usr/local/go/src/runtime/proc.go:127 +0x6c fp=0xc4200267c0 sp=0xc420026768
runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4200267c8 sp=0xc4200267c0

goroutine 17 [syscall, locked to thread]:
runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1

…

In this error, we are most interested in the message: pthread_create failed: No space left on device. A cursory study documentation explained that Docker cannot fork the process, which is why it periodically β€œhangs”.

In monitoring, what is happening corresponds to the following picture:

6 entertaining system bugs in the operation of Kubernetes [and their solution]

A similar situation is observed on other nodes:

6 entertaining system bugs in the operation of Kubernetes [and their solution]

6 entertaining system bugs in the operation of Kubernetes [and their solution]

On the same nodes we see:

root@kube-node-1 ~ # ps auxfww | grep curl -c
19782
root@kube-node-1 ~ # ps auxfww | grep curl | head
root     16688  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>
root     17398  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>
root     16852  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>
root      9473  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>
root      4664  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>
root     30571  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>
root     24113  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>
root     16475  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>
root      7176  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>
root      1090  0.0  0.0      0     0 ?        Z    Feb06   0:00      |       _ [curl] <defunct>

It turned out that this behavior is a consequence of the pod's work with supercronic (a Go utility we use to run cron jobs in pods):

 _ docker-containerd-shim 833b60bb9ff4c669bb413b898a5fd142a57a21695e5dc42684235df907825567 /var/run/docker/libcontainerd/833b60bb9ff4c669bb413b898a5fd142a57a21695e5dc42684235df907825567 docker-runc
|   _ /usr/local/bin/supercronic -json /crontabs/cron
|       _ /usr/bin/newrelic-daemon --agent --pidfile /var/run/newrelic-daemon.pid --logfile /dev/stderr --port /run/newrelic.sock --tls --define utilization.detect_aws=true --define utilization.detect_azure=true --define utilization.detect_gcp=true --define utilization.detect_pcf=true --define utilization.detect_docker=true
|       |   _ /usr/bin/newrelic-daemon --agent --pidfile /var/run/newrelic-daemon.pid --logfile /dev/stderr --port /run/newrelic.sock --tls --define utilization.detect_aws=true --define utilization.detect_azure=true --define utilization.detect_gcp=true --define utilization.detect_pcf=true --define utilization.detect_docker=true -no-pidfile
|       _ [newrelic-daemon] <defunct>
|       _ [curl] <defunct>
|       _ [curl] <defunct>
|       _ [curl] <defunct>
…

The problem is this: when a task is started in supercronic, the process spawned by it cannot complete correctly, turning into zombie.

Note: To be more precise, processes are spawned by cron tasks, but supercronic is not an init system and cannot "adopt" the processes that spawned by its children. When SIGHUP or SIGTERM signals occur, they are not passed to child processes, resulting in child processes not being terminated, remaining in zombie status. You can read more about all this, for example, in such an article.

There are a couple of ways to solve problems:

  1. As a temporary workaround - increase the number of PIDs in the system at a single point in time:
           /proc/sys/kernel/pid_max (since Linux 2.5.34)
                  This file specifies the value at which PIDs wrap around (i.e., the value in this file is one greater than the maximum PID).  PIDs greater than this  value  are  not  allo‐
                  cated;  thus, the value in this file also acts as a system-wide limit on the total number of processes and threads.  The default value for this file, 32768, results in the
                  same range of PIDs as on earlier kernels
  2. Or make the launch of tasks in supercronic not directly, but using the same tini, which is capable of gracefully terminating processes and not spawning zombies.

Story 2. "Zombies" when deleting a cgroup

Kubelet started consuming a lot of CPU:

6 entertaining system bugs in the operation of Kubernetes [and their solution]

No one likes this, so we armed ourselves perf and began to deal with the problem. The results of the investigation were as follows:

  • Kubelet spends more than a third of CPU time pulling memory data from all cgroups:

    6 entertaining system bugs in the operation of Kubernetes [and their solution]

  • In the kernel developers mailing list, you can find discussion of the problem. In short, the point is that various tmpfs files and other similar things are not completely removed from the system when deleting a cgroup, the so-called memcg zombie. Sooner or later they will still be deleted from the page cache, however, there is a lot of memory on the server and the kernel does not see the point in wasting time deleting them. So they keep piling up. Why is this even happening? This is a server with cron jobs that constantly creates new jobs, and with them new pods. Thus, new cgroups are created for containers in them, which are soon deleted.
  • Why does cAdvisor in kubelet spend so much time? This is easy to see by the simplest execution time cat /sys/fs/cgroup/memory/memory.stat. If on a healthy machine the operation takes 0,01 seconds, then on a problem cron02 it takes 1,2 seconds. The thing is that cAdvisor, which reads data from sysfs very slowly, tries to take into account the used memory in zombie cgroups as well.
  • To forcibly remove the zombies, we tried clearing the caches as recommended by LKML: sync; echo 3 > /proc/sys/vm/drop_caches, - but the core turned out to be more complicated and hung the car.

What to do? The problem is being fixedcommit, and see description in release message) by updating the Linux kernel to version 4.16.

Story 3. Systemd and its mount

Again, kubelet consumes too many resources on some nodes, but this time it is more memory:

6 entertaining system bugs in the operation of Kubernetes [and their solution]

It turned out that there is a problem with the systemd used in Ubuntu 16.04, and it occurs when managing the mounts that are created to connect subPath from ConfigMap'ov or secret'ov. After pod shutdown the systemd service and its service mount remain in system. Over time, they accumulate a huge amount. There are even issues on this topic:

  1. #5916;
  2. kubernetes#57345.

... in the last of which they refer to the PR in systemd: #7811 (issue in systemd - #7798).

The problem is no longer in Ubuntu 18.04, but if you want to continue using Ubuntu 16.04, you may find our workaround on this topic useful.

So, we have made the following DaemonSet:

---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: systemd-slices-cleaner
  name: systemd-slices-cleaner
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: systemd-slices-cleaner
  template:
    metadata:
      labels:
        app: systemd-slices-cleaner
    spec:
      containers:
      - command:
        - /usr/local/bin/supercronic
        - -json
        - /app/crontab
        Image: private-registry.org/systemd-slices-cleaner/systemd-slices-cleaner:v0.1.0
        imagePullPolicy: Always
        name: systemd-slices-cleaner
        resources: {}
        securityContext:
          privileged: true
        volumeMounts:
        - name: systemd
          mountPath: /run/systemd/private
        - name: docker
          mountPath: /run/docker.sock
        - name: systemd-etc
          mountPath: /etc/systemd
        - name: systemd-run
          mountPath: /run/systemd/system/
        - name: lsb-release
          mountPath: /etc/lsb-release-host
      imagePullSecrets:
      - name: antiopa-registry
      priorityClassName: cluster-low
      tolerations:
      - operator: Exists
      volumes:
      - name: systemd
        hostPath:
          path: /run/systemd/private
      - name: docker
        hostPath:
          path: /run/docker.sock
      - name: systemd-etc
        hostPath:
          path: /etc/systemd
      - name: systemd-run
        hostPath:
          path: /run/systemd/system/
      - name: lsb-release
        hostPath:
          path: /etc/lsb-release

... and it uses this script:

#!/bin/bash

# we will work only on xenial
hostrelease="/etc/lsb-release-host"
test -f ${hostrelease} && grep xenial ${hostrelease} > /dev/null || exit 0

# sleeping max 30 minutes to dispense load on kube-nodes
sleep $((RANDOM % 1800))

stoppedCount=0
# counting actual subpath units in systemd
countBefore=$(systemctl list-units | grep subpath | grep "run-" | wc -l)
# let's go check each unit
for unit in $(systemctl list-units | grep subpath | grep "run-" | awk '{print $1}'); do
  # finding description file for unit (to find out docker container, who born this unit)
  DropFile=$(systemctl status ${unit} | grep Drop | awk -F': ' '{print $2}')
  # reading uuid for docker container from description file
  DockerContainerId=$(cat ${DropFile}/50-Description.conf | awk '{print $5}' | cut -d/ -f6)
  # checking container status (running or not)
  checkFlag=$(docker ps | grep -c ${DockerContainerId})
  # if container not running, we will stop unit
  if [[ ${checkFlag} -eq 0 ]]; then
    echo "Stopping unit ${unit}"
    # stoping unit in action
    systemctl stop $unit
    # just counter for logs
    ((stoppedCount++))
    # logging current progress
    echo "Stopped ${stoppedCount} systemd units out of ${countBefore}"
  fi
done

... and it runs every 5 minutes with the help of the previously mentioned supercronic. Its Dockerfile looks like this:

FROM ubuntu:16.04
COPY rootfs /
WORKDIR /app
RUN apt-get update && 
    apt-get upgrade -y && 
    apt-get install -y gnupg curl apt-transport-https software-properties-common wget
RUN add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable" && 
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - && 
    apt-get update && 
    apt-get install -y docker-ce=17.03.0*
RUN wget https://github.com/aptible/supercronic/releases/download/v0.1.6/supercronic-linux-amd64 -O 
    /usr/local/bin/supercronic && chmod +x /usr/local/bin/supercronic
ENTRYPOINT ["/bin/bash", "-c", "/usr/local/bin/supercronic -json /app/crontab"]

Story 4: Concurrency in Pod Planning

It was noticed that: if we have a pod placed on a node and its image is pumped out for a very long time, then another pod that "hit" on the same node simply does not start pulling new pod image. Instead, it waits for the previous pod's image to be 'pulled'. As a result, a pod that was already planned and whose image could be downloaded in just a minute will end up in the status for a long time. containerCreating.

The events will look something like this:

Normal  Pulling    8m    kubelet, ip-10-241-44-128.ap-northeast-1.compute.internal  pulling image "registry.example.com/infra/openvpn/openvpn:master"

It turns out that a single image from a slow registry can block deployment to the node.

Unfortunately, there are not many ways out of the situation:

  1. Try to use your Docker Registry directly in the cluster or directly with the cluster (eg GitLab Registry, Nexus, etc.);
  2. Take advantage of utilities such as kraken.

Story 5. Hanging nodes due to lack of memory

During the operation of various applications, we also received a situation when the node completely ceases to be available: SSH does not respond, all monitoring daemons fall off, and then there is nothing (or almost nothing) abnormal in the logs.

I’ll tell you in pictures using the example of one node where MongoDB functioned.

This is what atop looks like to accidents:

6 entertaining system bugs in the operation of Kubernetes [and their solution]

And like this - after accidents:

6 entertaining system bugs in the operation of Kubernetes [and their solution]

In monitoring, there is also a sharp jump, at which the node ceases to be available:

6 entertaining system bugs in the operation of Kubernetes [and their solution]

So from the screenshots you can see that:

  1. The RAM on the machine is close to running out;
  2. There is a sharp jump in the consumption of RAM, after which access to the entire machine is abruptly disabled;
  3. A big task arrives on Mongo, which causes the DBMS process to use more memory and actively read from disk.

It turns out that if Linux runs out of free memory (memory pressure sets in) and there is no swap, then to the arrival of the OOM killer, there may be an equilibrium between throwing pages into the page cache and writing them back to disk. This is handled by kswapd, which valiantly frees up as many memory pages as possible for later allocation.

Unfortunately, with a heavy I/O load coupled with a small amount of free memory, kswapd becomes the bottleneck of the whole systembecause they are tied to it all allocations (page faults) of memory pages in the system. This can go on for a very long time if the processes do not want to use memory anymore, but fixate on the very edge of the OOM killer abyss.

The question is natural: why does the OOM killer come so late? In its current iteration, the OOM killer is extremely stupid: it will kill the process only when the attempt to allocate a memory page fails, i.e. if page fault passes with an error. This doesn't happen for long enough because kswapd valiantly frees pages of memory by flushing the page cache (all disk I/O on the system, essentially) back to disk. In more detail, with a description of the steps necessary to eliminate such problems in the kernel, you can read here.

This behavior should improve with Linux 4.6+ kernel.

Story 6. Pods get stuck in the Pending state

In some clusters that have a really large number of pods running, we started to notice that most of them hang in the state for a very long time. Pending, although the Docker containers themselves are already running on the nodes and you can manually work with them.

With this in describe there is nothing wrong:

  Type    Reason                  Age                From                     Message
  ----    ------                  ----               ----                     -------
  Normal  Scheduled               1m                 default-scheduler        Successfully assigned sphinx-0 to ss-dev-kub07
  Normal  SuccessfulAttachVolume  1m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-6aaad34f-ad10-11e8-a44c-52540035a73b"
  Normal  SuccessfulMountVolume   1m                 kubelet, ss-dev-kub07    MountVolume.SetUp succeeded for volume "sphinx-config"
  Normal  SuccessfulMountVolume   1m                 kubelet, ss-dev-kub07    MountVolume.SetUp succeeded for volume "default-token-fzcsf"
  Normal  SuccessfulMountVolume   49s (x2 over 51s)  kubelet, ss-dev-kub07    MountVolume.SetUp succeeded for volume "pvc-6aaad34f-ad10-11e8-a44c-52540035a73b"
  Normal  Pulled                  43s                kubelet, ss-dev-kub07    Container image "registry.example.com/infra/sphinx-exporter/sphinx-indexer:v1" already present on machine
  Normal  Created                 43s                kubelet, ss-dev-kub07    Created container
  Normal  Started                 43s                kubelet, ss-dev-kub07    Started container
  Normal  Pulled                  43s                kubelet, ss-dev-kub07    Container image "registry.example.com/infra/sphinx/sphinx:v1" already present on machine
  Normal  Created                 42s                kubelet, ss-dev-kub07    Created container
  Normal  Started                 42s                kubelet, ss-dev-kub07    Started container

After digging around, we made the assumption that the kubelet simply does not have time to send all the information about the state of the pods, liveness / readiness samples to the API server.

And having studied help, we found the following parameters:

--kube-api-qps - QPS to use while talking with kubernetes apiserver (default 5)
--kube-api-burst  - Burst to use while talking with kubernetes apiserver (default 10) 
--event-qps - If > 0, limit event creations per second to this value. If 0, unlimited. (default 5)
--event-burst - Maximum size of a bursty event records, temporarily allows event records to burst to this number, while still not exceeding event-qps. Only used if --event-qps > 0 (default 10) 
--registry-qps - If > 0, limit registry pull QPS to this value.
--registry-burst - Maximum size of bursty pulls, temporarily allows pulls to burst to this number, while still not exceeding registry-qps. Only used if --registry-qps > 0 (default 10)

As seen, default values ​​are quite small, and in 90% they cover all needs ... However, in our case, this was not enough. Therefore, we set the following values:

--event-qps=30 --event-burst=40 --kube-api-burst=40 --kube-api-qps=30 --registry-qps=30 --registry-burst=40

... and restarted the kubelets, after which we saw the following picture on the graphs for accessing the API server:

6 entertaining system bugs in the operation of Kubernetes [and their solution]

... and yes, everything started to fly!

PS

For help in collecting bugs and preparing the article, I express my deep gratitude to the numerous engineers of our company, and in particular to our colleague from our R&D team Andrey Klimentyev (zuzzas).

P.P.S

Read also on our blog:

Source: habr.com

Add a comment