ProHoster > Blog > Administration > 6 entertaining system bugs in the operation of Kubernetes [and their solution]
6 entertaining system bugs in the operation of Kubernetes [and their solution]
Over the years of using Kubernetes in production, we have accumulated a lot of interesting stories about how bugs in various system components led to unpleasant and / or incomprehensible consequences that affect the operation of containers and pods. In this article, we have compiled some of the most common or interesting ones. Even if you are never lucky enough to encounter such situations, reading about such short detective stories - especially "first-hand" - is always entertaining, isn't it? ..
Story 1. Supercronic and a stuck Docker
On one of the clusters, we periodically received a βhangingβ Docker, which interfered with the normal functioning of the cluster. At the same time, the following was observed in the Docker logs
level=error msg="containerd: start init process" error="exit status 2: "runtime/cgo: pthread_create failed: No space left on device
SIGABRT: abort
PC=0x7f31b811a428 m=0
goroutine 0 [idle]:
goroutine 1 [running]:
runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc420026768 sp=0xc420026760
runtime.main() /usr/local/go/src/runtime/proc.go:127 +0x6c fp=0xc4200267c0 sp=0xc420026768
runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4200267c8 sp=0xc4200267c0
goroutine 17 [syscall, locked to thread]:
runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1
β¦
In this error, we are most interested in the message: pthread_create failed: No space left on device. A cursory study documentation explained that Docker cannot fork the process, which is why it periodically βhangsβ.
In monitoring, what is happening corresponds to the following picture:
The problem is this: when a task is started in supercronic, the process spawned by it cannot complete correctly, turning into zombie.
Note: To be more precise, processes are spawned by cron tasks, but supercronic is not an init system and cannot "adopt" the processes that spawned by its children. When SIGHUP or SIGTERM signals occur, they are not passed to child processes, resulting in child processes not being terminated, remaining in zombie status. You can read more about all this, for example, in such an article.
There are a couple of ways to solve problems:
As a temporary workaround - increase the number of PIDs in the system at a single point in time:
/proc/sys/kernel/pid_max (since Linux 2.5.34)
This file specifies the value at which PIDs wrap around (i.e., the value in this file is one greater than the maximum PID). PIDs greater than this value are not alloβ
cated; thus, the value in this file also acts as a system-wide limit on the total number of processes and threads. The default value for this file, 32768, results in the
same range of PIDs as on earlier kernels
Or make the launch of tasks in supercronic not directly, but using the same tini, which is capable of gracefully terminating processes and not spawning zombies.
Story 2. "Zombies" when deleting a cgroup
Kubelet started consuming a lot of CPU:
No one likes this, so we armed ourselves perf and began to deal with the problem. The results of the investigation were as follows:
Kubelet spends more than a third of CPU time pulling memory data from all cgroups:
In the kernel developers mailing list, you can find discussion of the problem. In short, the point is that various tmpfs files and other similar things are not completely removed from the system when deleting a cgroup, the so-called memcg zombie. Sooner or later they will still be deleted from the page cache, however, there is a lot of memory on the server and the kernel does not see the point in wasting time deleting them. So they keep piling up. Why is this even happening? This is a server with cron jobs that constantly creates new jobs, and with them new pods. Thus, new cgroups are created for containers in them, which are soon deleted.
Why does cAdvisor in kubelet spend so much time? This is easy to see by the simplest execution time cat /sys/fs/cgroup/memory/memory.stat. If on a healthy machine the operation takes 0,01 seconds, then on a problem cron02 it takes 1,2 seconds. The thing is that cAdvisor, which reads data from sysfs very slowly, tries to take into account the used memory in zombie cgroups as well.
To forcibly remove the zombies, we tried clearing the caches as recommended by LKML: sync; echo 3 > /proc/sys/vm/drop_caches, - but the core turned out to be more complicated and hung the car.
What to do? The problem is being fixedcommit, and see description in release message) by updating the Linux kernel to version 4.16.
Story 3. Systemd and its mount
Again, kubelet consumes too many resources on some nodes, but this time it is more memory:
It turned out that there is a problem with the systemd used in Ubuntu 16.04, and it occurs when managing the mounts that are created to connect subPath from ConfigMap'ov or secret'ov. After pod shutdown the systemd service and its service mount remain in system. Over time, they accumulate a huge amount. There are even issues on this topic:
#!/bin/bash
# we will work only on xenial
hostrelease="/etc/lsb-release-host"
test -f ${hostrelease} && grep xenial ${hostrelease} > /dev/null || exit 0
# sleeping max 30 minutes to dispense load on kube-nodes
sleep $((RANDOM % 1800))
stoppedCount=0
# counting actual subpath units in systemd
countBefore=$(systemctl list-units | grep subpath | grep "run-" | wc -l)
# let's go check each unit
for unit in $(systemctl list-units | grep subpath | grep "run-" | awk '{print $1}'); do
# finding description file for unit (to find out docker container, who born this unit)
DropFile=$(systemctl status ${unit} | grep Drop | awk -F': ' '{print $2}')
# reading uuid for docker container from description file
DockerContainerId=$(cat ${DropFile}/50-Description.conf | awk '{print $5}' | cut -d/ -f6)
# checking container status (running or not)
checkFlag=$(docker ps | grep -c ${DockerContainerId})
# if container not running, we will stop unit
if [[ ${checkFlag} -eq 0 ]]; then
echo "Stopping unit ${unit}"
# stoping unit in action
systemctl stop $unit
# just counter for logs
((stoppedCount++))
# logging current progress
echo "Stopped ${stoppedCount} systemd units out of ${countBefore}"
fi
done
... and it runs every 5 minutes with the help of the previously mentioned supercronic. Its Dockerfile looks like this:
It was noticed that: if we have a pod placed on a node and its image is pumped out for a very long time, then another pod that "hit" on the same node simply does not start pulling new pod image. Instead, it waits for the previous pod's image to be 'pulled'. As a result, a pod that was already planned and whose image could be downloaded in just a minute will end up in the status for a long time. containerCreating.
The events will look something like this:
Normal Pulling 8m kubelet, ip-10-241-44-128.ap-northeast-1.compute.internal pulling image "registry.example.com/infra/openvpn/openvpn:master"
It turns out that a single image from a slow registry can block deployment to the node.
Unfortunately, there are not many ways out of the situation:
Try to use your Docker Registry directly in the cluster or directly with the cluster (eg GitLab Registry, Nexus, etc.);
During the operation of various applications, we also received a situation when the node completely ceases to be available: SSH does not respond, all monitoring daemons fall off, and then there is nothing (or almost nothing) abnormal in the logs.
Iβll tell you in pictures using the example of one node where MongoDB functioned.
This is what atop looks like to accidents:
And like this - after accidents:
In monitoring, there is also a sharp jump, at which the node ceases to be available:
So from the screenshots you can see that:
The RAM on the machine is close to running out;
There is a sharp jump in the consumption of RAM, after which access to the entire machine is abruptly disabled;
A big task arrives on Mongo, which causes the DBMS process to use more memory and actively read from disk.
It turns out that if Linux runs out of free memory (memory pressure sets in) and there is no swap, then to the arrival of the OOM killer, there may be an equilibrium between throwing pages into the page cache and writing them back to disk. This is handled by kswapd, which valiantly frees up as many memory pages as possible for later allocation.
Unfortunately, with a heavy I/O load coupled with a small amount of free memory, kswapd becomes the bottleneck of the whole systembecause they are tied to it all allocations (page faults) of memory pages in the system. This can go on for a very long time if the processes do not want to use memory anymore, but fixate on the very edge of the OOM killer abyss.
The question is natural: why does the OOM killer come so late? In its current iteration, the OOM killer is extremely stupid: it will kill the process only when the attempt to allocate a memory page fails, i.e. if page fault passes with an error. This doesn't happen for long enough because kswapd valiantly frees pages of memory by flushing the page cache (all disk I/O on the system, essentially) back to disk. In more detail, with a description of the steps necessary to eliminate such problems in the kernel, you can read here.
In some clusters that have a really large number of pods running, we started to notice that most of them hang in the state for a very long time. Pending, although the Docker containers themselves are already running on the nodes and you can manually work with them.
With this in describe there is nothing wrong:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned sphinx-0 to ss-dev-kub07
Normal SuccessfulAttachVolume 1m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-6aaad34f-ad10-11e8-a44c-52540035a73b"
Normal SuccessfulMountVolume 1m kubelet, ss-dev-kub07 MountVolume.SetUp succeeded for volume "sphinx-config"
Normal SuccessfulMountVolume 1m kubelet, ss-dev-kub07 MountVolume.SetUp succeeded for volume "default-token-fzcsf"
Normal SuccessfulMountVolume 49s (x2 over 51s) kubelet, ss-dev-kub07 MountVolume.SetUp succeeded for volume "pvc-6aaad34f-ad10-11e8-a44c-52540035a73b"
Normal Pulled 43s kubelet, ss-dev-kub07 Container image "registry.example.com/infra/sphinx-exporter/sphinx-indexer:v1" already present on machine
Normal Created 43s kubelet, ss-dev-kub07 Created container
Normal Started 43s kubelet, ss-dev-kub07 Started container
Normal Pulled 43s kubelet, ss-dev-kub07 Container image "registry.example.com/infra/sphinx/sphinx:v1" already present on machine
Normal Created 42s kubelet, ss-dev-kub07 Created container
Normal Started 42s kubelet, ss-dev-kub07 Started container
After digging around, we made the assumption that the kubelet simply does not have time to send all the information about the state of the pods, liveness / readiness samples to the API server.
And having studied help, we found the following parameters:
--kube-api-qps - QPS to use while talking with kubernetes apiserver (default 5)
--kube-api-burst - Burst to use while talking with kubernetes apiserver (default 10)
--event-qps - If > 0, limit event creations per second to this value. If 0, unlimited. (default 5)
--event-burst - Maximum size of a bursty event records, temporarily allows event records to burst to this number, while still not exceeding event-qps. Only used if --event-qps > 0 (default 10)
--registry-qps - If > 0, limit registry pull QPS to this value.
--registry-burst - Maximum size of bursty pulls, temporarily allows pulls to burst to this number, while still not exceeding registry-qps. Only used if --registry-qps > 0 (default 10)
As seen, default values ββare quite small, and in 90% they cover all needs ... However, in our case, this was not enough. Therefore, we set the following values:
... and restarted the kubelets, after which we saw the following picture on the graphs for accessing the API server:
... and yes, everything started to fly!
PS
For help in collecting bugs and preparing the article, I express my deep gratitude to the numerous engineers of our company, and in particular to our colleague from our R&D team Andrey Klimentyev (zuzzas).