![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/bed059552ed86580939aa18fbdf1553e.jpg)
Over the years of using Kubernetes in production, we have accumulated a lot of interesting stories about how bugs in various system components led to unpleasant and / or incomprehensible consequences that affect the operation of containers and pods. In this article, we have compiled some of the most common or interesting ones. Even if you are never lucky enough to encounter such situations, reading about such short detective stories - especially "first-hand" - is always entertaining, isn't it? ..
Story 1. Supercronic and a stuck Docker
On one of the clusters, we periodically received a “hanging” Docker, which interfered with the normal functioning of the cluster. At the same time, the following was observed in the Docker logs
level=error msg="containerd: start init process" error="exit status 2: "runtime/cgo: pthread_create failed: No space left on device
SIGABRT: abort
PC=0x7f31b811a428 m=0
goroutine 0 [idle]:
goroutine 1 [running]:
runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc420026768 sp=0xc420026760
runtime.main() /usr/local/go/src/runtime/proc.go:127 +0x6c fp=0xc4200267c0 sp=0xc420026768
runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4200267c8 sp=0xc4200267c0
goroutine 17 [syscall, locked to thread]:
runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1
… In this error, we are most interested in the message: pthread_create failed: No space left on device. A cursory study explained that Docker cannot fork the process, which is why it periodically “hangs”.
In monitoring, what is happening corresponds to the following picture:
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/bd778052c87b338493bae54b26830ef3.jpg)
A similar situation is observed on other nodes:
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/ef512532a95ca982e4342071115dbe9f.jpg)
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/43c32ebca78755dde348ed5e7ac75c79.jpg)
On the same nodes we see:
root@kube-node-1 ~ # ps auxfww | grep curl -c
19782
root@kube-node-1 ~ # ps auxfww | grep curl | head
root 16688 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>
root 17398 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>
root 16852 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>
root 9473 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>
root 4664 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>
root 30571 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>
root 24113 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>
root 16475 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>
root 7176 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>
root 1090 0.0 0.0 0 0 ? Z Feb06 0:00 | _ [curl] <defunct>It turned out that this behavior is a consequence of the pod's work with (a Go utility we use to run cron jobs in pods):
_ docker-containerd-shim 833b60bb9ff4c669bb413b898a5fd142a57a21695e5dc42684235df907825567 /var/run/docker/libcontainerd/833b60bb9ff4c669bb413b898a5fd142a57a21695e5dc42684235df907825567 docker-runc
| _ /usr/local/bin/supercronic -json /crontabs/cron
| _ /usr/bin/newrelic-daemon --agent --pidfile /var/run/newrelic-daemon.pid --logfile /dev/stderr --port /run/newrelic.sock --tls --define utilization.detect_aws=true --define utilization.detect_azure=true --define utilization.detect_gcp=true --define utilization.detect_pcf=true --define utilization.detect_docker=true
| | _ /usr/bin/newrelic-daemon --agent --pidfile /var/run/newrelic-daemon.pid --logfile /dev/stderr --port /run/newrelic.sock --tls --define utilization.detect_aws=true --define utilization.detect_azure=true --define utilization.detect_gcp=true --define utilization.detect_pcf=true --define utilization.detect_docker=true -no-pidfile
| _ [newrelic-daemon] <defunct>
| _ [curl] <defunct>
| _ [curl] <defunct>
| _ [curl] <defunct>
…The problem is this: when a task is started in supercronic, the process spawned by it cannot complete correctly, turning into .
Note: To be more precise, processes are spawned by cron tasks, but supercronic is not an init system and cannot "adopt" the processes that spawned by its children. When SIGHUP or SIGTERM signals occur, they are not passed to child processes, resulting in child processes not being terminated, remaining in zombie status. You can read more about all this, for example, in .
There are a couple of ways to solve problems:
- As a temporary workaround - increase the number of PIDs in the system at a single point in time:
/proc/sys/kernel/pid_max (since Linux 2.5.34) This file specifies the value at which PIDs wrap around (i.e., the value in this file is one greater than the maximum PID). PIDs greater than this value are not allo‐ cated; thus, the value in this file also acts as a system-wide limit on the total number of processes and threads. The default value for this file, 32768, results in the same range of PIDs as on earlier kernels - Or make the launch of tasks in supercronic not directly, but using the same , which is capable of gracefully terminating processes and not spawning zombies.
Story 2. "Zombies" when deleting a cgroup
Kubelet started consuming a lot of CPU:
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/6140058330faaa3785b089dcba857056.jpg)
No one likes this, so we armed ourselves and began to deal with the problem. The results of the investigation were as follows:
- Kubelet spends more than a third of CPU time pulling memory data from all cgroups:
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%20600%20241'%3E%3C/svg%3E)
- In the kernel developers mailing list, you can find . In short, the point is that various tmpfs files and other similar things are not completely removed from the system when deleting a cgroup, the so-called zombie. Sooner or later they will still be deleted from the page cache, however, there is a lot of memory on the server and the kernel does not see the point in wasting time deleting them. So they keep piling up. Why is this even happening? This is a server with cron jobs that constantly creates new jobs, and with them new pods. Thus, new cgroups are created for containers in them, which are soon deleted.
- Why does cAdvisor in kubelet spend so much time? This is easy to see by the simplest execution
time cat /sys/fs/cgroup/memory/memory.stat. If on a healthy machine the operation takes 0,01 seconds, then on a problem cron02 it takes 1,2 seconds. The thing is that cAdvisor, which reads data from sysfs very slowly, tries to take into account the used memory in zombie cgroups as well. - To forcibly remove the zombies, we tried clearing the caches as recommended by LKML:
sync; echo 3 > /proc/sys/vm/drop_caches, - but the core turned out to be more complicated and hung the car.
What to do? The problem is being fixed, and see description in ) kernel update Linux up to version 4.16.
Story 3. Systemd and its mount
Again, kubelet consumes too many resources on some nodes, but this time it is more memory:
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/044c4e23a772c61a6206b9b20aa67c1d.jpg)
It turned out that there was a problem with systemd used in Ubuntu 16.04, and it occurs when managing mounts that are created for connection subPath from ConfigMap'ov or secret'ov. After pod shutdown the systemd service and its service mount remain in system. Over time, they accumulate a huge amount. There are even issues on this topic:
- ;
- .
... in the last of which they refer to the PR in systemd: (issue in systemd - ).
The problem is no longer there Ubuntu 18.04, but if you want to continue using Ubuntu 16.04, you might find our workaround on this topic useful.
So, we have made the following DaemonSet:
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
app: systemd-slices-cleaner
name: systemd-slices-cleaner
namespace: kube-system
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app: systemd-slices-cleaner
template:
metadata:
labels:
app: systemd-slices-cleaner
spec:
containers:
- command:
- /usr/local/bin/supercronic
- -json
- /app/crontab
Image: private-registry.org/systemd-slices-cleaner/systemd-slices-cleaner:v0.1.0
imagePullPolicy: Always
name: systemd-slices-cleaner
resources: {}
securityContext:
privileged: true
volumeMounts:
- name: systemd
mountPath: /run/systemd/private
- name: docker
mountPath: /run/docker.sock
- name: systemd-etc
mountPath: /etc/systemd
- name: systemd-run
mountPath: /run/systemd/system/
- name: lsb-release
mountPath: /etc/lsb-release-host
imagePullSecrets:
- name: antiopa-registry
priorityClassName: cluster-low
tolerations:
- operator: Exists
volumes:
- name: systemd
hostPath:
path: /run/systemd/private
- name: docker
hostPath:
path: /run/docker.sock
- name: systemd-etc
hostPath:
path: /etc/systemd
- name: systemd-run
hostPath:
path: /run/systemd/system/
- name: lsb-release
hostPath:
path: /etc/lsb-release... and it uses this script:
#!/bin/bash
# we will work only on xenial
hostrelease="/etc/lsb-release-host"
test -f ${hostrelease} && grep xenial ${hostrelease} > /dev/null || exit 0
# sleeping max 30 minutes to dispense load on kube-nodes
sleep $((RANDOM % 1800))
stoppedCount=0
# counting actual subpath units in systemd
countBefore=$(systemctl list-units | grep subpath | grep "run-" | wc -l)
# let's go check each unit
for unit in $(systemctl list-units | grep subpath | grep "run-" | awk '{print $1}'); do
# finding description file for unit (to find out docker container, who born this unit)
DropFile=$(systemctl status ${unit} | grep Drop | awk -F': ' '{print $2}')
# reading uuid for docker container from description file
DockerContainerId=$(cat ${DropFile}/50-Description.conf | awk '{print $5}' | cut -d/ -f6)
# checking container status (running or not)
checkFlag=$(docker ps | grep -c ${DockerContainerId})
# if container not running, we will stop unit
if [[ ${checkFlag} -eq 0 ]]; then
echo "Stopping unit ${unit}"
# stoping unit in action
systemctl stop $unit
# just counter for logs
((stoppedCount++))
# logging current progress
echo "Stopped ${stoppedCount} systemd units out of ${countBefore}"
fi
done... and it runs every 5 minutes with the help of the previously mentioned supercronic. Its Dockerfile looks like this:
FROM ubuntu:16.04
COPY rootfs /
WORKDIR /app
RUN apt-get update &&
apt-get upgrade -y &&
apt-get install -y gnupg curl apt-transport-https software-properties-common wget
RUN add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable" &&
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - &&
apt-get update &&
apt-get install -y docker-ce=17.03.0*
RUN wget https://github.com/aptible/supercronic/releases/download/v0.1.6/supercronic-linux-amd64 -O
/usr/local/bin/supercronic && chmod +x /usr/local/bin/supercronic
ENTRYPOINT ["/bin/bash", "-c", "/usr/local/bin/supercronic -json /app/crontab"]Story 4: Concurrency in Pod Planning
It was noticed that: if we have a pod placed on a node and its image is pumped out for a very long time, then another pod that "hit" on the same node simply does not start pulling new pod image. Instead, it waits for the previous pod's image to be 'pulled'. As a result, a pod that was already planned and whose image could be downloaded in just a minute will end up in the status for a long time. containerCreating.
The events will look something like this:
Normal Pulling 8m kubelet, ip-10-241-44-128.ap-northeast-1.compute.internal pulling image "registry.example.com/infra/openvpn/openvpn:master"It turns out that a single image from a slow registry can block deployment to the node.
Unfortunately, there are not many ways out of the situation:
- Try to use your Docker Registry directly in the cluster or directly with the cluster (eg GitLab Registry, Nexus, etc.);
- Take advantage of utilities such as .
Story 5. Hanging nodes due to lack of memory
During the operation of various applications, we also received a situation when the node completely ceases to be available: SSH does not respond, all monitoring daemons fall off, and then there is nothing (or almost nothing) abnormal in the logs.
I’ll tell you in pictures using the example of one node where MongoDB functioned.
This is what atop looks like until accidents:
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/5de916d270a862cbcbb5ed23c31f698e.jpg)
And like this - after accidents:
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/0f32bf1113204cf19f4639a297e40348.jpg)
In monitoring, there is also a sharp jump, at which the node ceases to be available:
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/31e770cac5be32bb7f95cfbbc6b9f1ae.jpg)
So from the screenshots you can see that:
- The RAM on the machine is close to running out;
- There is a sharp jump in the consumption of RAM, after which access to the entire machine is abruptly disabled;
- A big task arrives on Mongo, which causes the DBMS process to use more memory and actively read from disk.
It turns out that if in Linux free memory runs out (memory pressure occurs) and there is no swap, then until the arrival of the OOM killer, there may be an equilibrium between throwing pages into the page cache and writing them back to disk. This is handled by kswapd, which valiantly frees up as many memory pages as possible for later allocation.
Unfortunately, with a heavy I/O load coupled with a small amount of free memory, kswapd becomes the bottleneck of the whole systembecause they are tied to it all allocations (page faults) of memory pages in the system. This can go on for a very long time if the processes do not want to use memory anymore, but fixate on the very edge of the OOM killer abyss.
The question is natural: why does the OOM killer come so late? In its current iteration, the OOM killer is extremely stupid: it will kill the process only when the attempt to allocate a memory page fails, i.e. if page fault passes with an error. This doesn't happen for long enough because kswapd valiantly frees pages of memory by flushing the page cache (all disk I/O on the system, essentially) back to disk. In more detail, with a description of the steps necessary to eliminate such problems in the kernel, you can read .
This behavior with core Linux 4.6 +.
Story 6. Pods get stuck in the Pending state
In some clusters that have a really large number of pods running, we started to notice that most of them hang in the state for a very long time. Pending, although the Docker containers themselves are already running on the nodes and you can manually work with them.
With this in describe there is nothing wrong:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned sphinx-0 to ss-dev-kub07
Normal SuccessfulAttachVolume 1m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-6aaad34f-ad10-11e8-a44c-52540035a73b"
Normal SuccessfulMountVolume 1m kubelet, ss-dev-kub07 MountVolume.SetUp succeeded for volume "sphinx-config"
Normal SuccessfulMountVolume 1m kubelet, ss-dev-kub07 MountVolume.SetUp succeeded for volume "default-token-fzcsf"
Normal SuccessfulMountVolume 49s (x2 over 51s) kubelet, ss-dev-kub07 MountVolume.SetUp succeeded for volume "pvc-6aaad34f-ad10-11e8-a44c-52540035a73b"
Normal Pulled 43s kubelet, ss-dev-kub07 Container image "registry.example.com/infra/sphinx-exporter/sphinx-indexer:v1" already present on machine
Normal Created 43s kubelet, ss-dev-kub07 Created container
Normal Started 43s kubelet, ss-dev-kub07 Started container
Normal Pulled 43s kubelet, ss-dev-kub07 Container image "registry.example.com/infra/sphinx/sphinx:v1" already present on machine
Normal Created 42s kubelet, ss-dev-kub07 Created container
Normal Started 42s kubelet, ss-dev-kub07 Started containerAfter digging around, we made the assumption that the kubelet simply does not have time to send all the information about the state of the pods, liveness / readiness samples to the API server.
And having studied help, we found the following parameters:
--kube-api-qps - QPS to use while talking with kubernetes apiserver (default 5)
--kube-api-burst - Burst to use while talking with kubernetes apiserver (default 10)
--event-qps - If > 0, limit event creations per second to this value. If 0, unlimited. (default 5)
--event-burst - Maximum size of a bursty event records, temporarily allows event records to burst to this number, while still not exceeding event-qps. Only used if --event-qps > 0 (default 10)
--registry-qps - If > 0, limit registry pull QPS to this value.
--registry-burst - Maximum size of bursty pulls, temporarily allows pulls to burst to this number, while still not exceeding registry-qps. Only used if --registry-qps > 0 (default 10)As seen, default values are quite small, and in 90% they cover all needs ... However, in our case, this was not enough. Therefore, we set the following values:
--event-qps=30 --event-burst=40 --kube-api-burst=40 --kube-api-qps=30 --registry-qps=30 --registry-burst=40... and restarted the kubelets, after which we saw the following picture on the graphs for accessing the API server:
![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/b2ae099729e55a686f6bec3012b96195.jpg)
... and yes, everything started to fly!
P.S.
For help in collecting bugs and preparing the article, I express my deep gratitude to the numerous engineers of our company, and in particular to our colleague from our R&D team Andrey Klimentyev ().
P.P.S
Read also on our blog:
- «».
- Kubernetes tips & tricks cycle:
- «";
- «";
- «";
- «».
Source: habr.com

![6 entertaining system bugs in the operation of Kubernetes [and their solution]](/wp-content/uploads/2019/03/0d15d1de17cd6838fc1cad19615af218.jpg)