Debugging network delays in Kubernetes

Debugging network delays in Kubernetes

A couple of years ago Kubernetes already discussed on the official GitHub blog. It has since become the standard technology for deploying services. Now Kubernetes manages a large part of the internal and public services. As our clusters grew and performance requirements became more stringent, we began to notice that some services on Kubernetes were experiencing sporadically latency that could not be explained by the load of the application itself.

Essentially, applications experience seemingly random network latency of up to 100ms or more, resulting in timeouts or retries. Services were expected to be able to respond to requests much faster than 100ms. But this is not possible if the connection itself takes so much time. Separately, we've seen very fast MySQL queries that should have taken milliseconds, and MySQL did do it in milliseconds, but from the point of view of the requesting application, the response took 100ms or more.

It immediately became clear that the problem only occurs when connecting to a Kubernetes node, even if the call came from outside Kubernetes. It's easiest to reproduce the problem in a test Vegetate, which runs from any internal host, tests the Kubernetes service on a specific port, and sporadically registers high latency. In this article, we will look at how we were able to track down the cause of this problem.

Eliminate unnecessary complexity in the chain to failure

By reproducing the same example, we wanted to narrow the focus of the problem and remove unnecessary layers of complexity. Initially, there were too many elements in the flow between Vegeta and pods on Kubernetes. To identify a deeper network problem, you need to exclude some of them.

Debugging network delays in Kubernetes

The client (Vegeta) creates a TCP connection to any node in the cluster. Kubernetes works like an overlay network (on top of the existing data center network) that uses IPIP, that is, it encapsulates the IP packets of the overlay network inside the IP packets of the data center. When connecting to the first host, network address translation is performed Network Address Translation (NAT) stateful to translate the IP address and port of a Kubernetes host to an IP address and port on the overlay network (specifically, a pod with an application). For incoming packets, the reverse sequence of actions is performed. It is a complex system with many states and many elements that are constantly updated and changed as services are deployed and moved.

Utility tcpdump in the Vegeta test, it gives a delay during the TCP handshake (between SYN and SYN-ACK). To remove this unnecessary complexity, you can use hping3 for simple "pings" with SYN packets. We check if there is a delay in the response packet, and then reset the connection. We can filter the data to only include packets over 100ms and get an easier way to reproduce the problem than a full Vegeta network layer 7 test. Here are the Kubernetes host pings using TCP SYN/SYN-ACK on the service's "host port" (30927) at 10ms intervals, filtered by the slowest responses:

theojulienne@shell ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms

Can immediately make the first observation. By serial numbers and timings, it is clear that these are not one-time traffic jams. Latency often accumulates and is eventually processed.

Next, we want to find out which components can be involved in the appearance of the congestion. Maybe it's some of the hundreds of iptables rules in NAT? Or some problems with IPIP tunneling in the network? One way to test this is to test each step of the system by excluding it. What happens if you remove NAT and firewall logic, leaving only the IPIP part:

Debugging network delays in Kubernetes

Luckily, Linux makes it easy to access the IP overlay directly if the machine is on the same network:

theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms

Judging by the results, the problem still remains! This rules out iptables and NAT. So the problem is with TCP? Let's see how a regular ICMP ping goes:

theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms

len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms

len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms

len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms

The results show that the problem has not disappeared. Perhaps it's an IPIP tunnel? Let's simplify the test even more:

Debugging network delays in Kubernetes

Are all packets sent between these two hosts?

theojulienne@kube-node-client ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms

len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms

We have simplified the situation to two Kubernetes nodes sending each other any packet, even an ICMP ping. They still see latency if the target host is "bad" (some worse than others).

Now for the last question: why does the latency only occur on kube-node servers? And does it happen when kube-node is the sender or the receiver? Luckily, this is pretty easy to figure out too, by sending a packet from a host outside of Kubernetes, but with the same "known bad" destination. As you can see, the problem has not disappeared:

theojulienne@shell ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms

Next, make the same requests from the previous source kube-node to the external host (which excludes the source host since the ping includes both the RX and TX components):

theojulienne@kube-node-client ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms

After examining the delayed packet captures, we got some additional information. In particular, what the sender (bottom) sees this timeout, but the receiver (top) does not see - see the Delta column (in seconds):

Debugging network delays in Kubernetes

In addition, if you look at the difference in the order of TCP and ICMP packets (by sequence numbers) on the recipient side, then ICMP packets always arrive in the same sequence in which they were sent, but with different timings. At the same time, TCP packets are sometimes interleaved, and some of them get stuck. In particular, if you examine the ports of SYN packets, then on the sender side they are in order, but not on the receiver side.

There is a subtle difference in how network cards modern servers (as in our data center) process packets containing TCP or ICMP. When a packet arrives, the network adapter "hashes it by connection", that is, it tries to break the connections into queues and send each queue to a separate processor core. For TCP, this hash includes both the source and destination IP address and port. In other words, each connection is hashed (potentially) differently. For ICMP, only IP addresses are hashed, as there are no ports.

Another new observation: during this period, we see ICMP delays on all communications between two hosts, while TCP does not. This tells us that the cause is probably related to the hashing of the RX queues: it is almost certain that the congestion occurs in the processing of RX packets, not in sending responses.

This excludes packets from the list of possible reasons. We now know that the problem with packet processing is on the receive side on some kube-node servers.

Understanding Packet Handling in the Linux Kernel

To understand why the receiver is having a problem on some kube-node servers, let's take a look at how the Linux kernel handles packets.

Returning to the simplest traditional implementation, the network card receives the packet and sends interrupt the Linux kernel that there is a package to be processed. The kernel stops other work, switches context to the interrupt handler, processes the batch, and then returns to the current tasks.

Debugging network delays in Kubernetes

This context switch is slow: latency may not have been noticeable on 10Mbit NICs in the 90s, but on modern 10G cards with a maximum throughput of 15Mpps, each core of a small XNUMX-core server can be interrupted millions of times per second.

In order not to deal with interrupts all the time, many years ago, Linux added NAPI: A networking API that all modern drivers use to improve performance at high speeds. At low speeds, the kernel still accepts interrupts from the network card in the old way. As soon as enough packets arrive that exceed the threshold, the kernel disables interrupts and instead starts polling the network adapter and picking up packets in chunks. Processing is done in softirq, i.e. in software interrupt context after system calls and hardware interrupts, when the kernel (as opposed to user space) is already running.

Debugging network delays in Kubernetes

This is much faster, but causes a different problem. If there are too many packets, then all the time is spent processing packets from the network card, and user-space processes do not have time to actually empty these queues (reading from TCP connections, etc.). Eventually the queues fill up and we start dropping packets. In an attempt to find a balance, the kernel sets a budget for the maximum number of packets processed in the softirq context. As soon as this budget is exceeded, a separate thread is awakened ksoftirqd (you will see one of them in ps per core) that handles these softirqs outside of the normal syscall/interrupt path. This thread is scheduled using the standard process scheduler, which tries to allocate resources fairly.

Debugging network delays in Kubernetes

By examining how the kernel handles packets, you can see that there is a certain probability of congestion. If the softirq calls are less frequent, the packets will have to wait some time to be processed in the RX queue on the NIC. Perhaps this is due to some task blocking the processor core, or something else is preventing the kernel from running softirq.

Narrowing processing down to a kernel or method

Softirq delays are just a guess for now. But it makes sense, and we know we're seeing something very similar. So the next step is to confirm this theory. And if it is confirmed, then find the cause of the delays.

Let's go back to our slow packages:

len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms

len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms

len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms

len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms

As discussed earlier, these ICMP packets are hashed into one NIC RX queue and processed by one CPU core. If we want to understand how Linux works, it is useful to know where (which CPU core) and how (softirq, ksoftirqd) these packets are processed in order to trace the process.

Now it's time to use the tools that allow you to monitor the operation of the Linux kernel in real time. Here we have used bcc. This set of tools allows you to write small C programs that hook arbitrary functions in the kernel and buffer events to a user-space Python program that can process them and return the result to you. Hooks for arbitrary functions in the kernel are tricky, but the utility is designed for maximum security and is designed to catch exactly the kind of production issues that are not easy to reproduce in a test or development environment.

The plan here is simple: we know that the kernel handles these ICMP pings, so we hook the kernel function icmp_echo, which accepts an incoming ICMP "echo request" packet and initiates the sending of an ICMP "echo response" response. We can identify the packet by increasing the icmp_seq number, which shows hping3 above.

Code bcc script looks complicated, but it's not as scary as it seems. Function icmp_echo transmits struct sk_buff *skb: This is a packet with an "echo request". We can trace it, pull the sequence echo.sequence (which matches with icmp_seq by hping3 Π²Ρ‹ΡˆΠ΅) and send it to user space. It's also handy to grab the current process name/id. The following shows the results that we see directly during the processing of packets by the kernel:

TGID PID PROCESS NAME ICMP_SEQ 0 0 swapper/11 770 0 0 swapper/11 771 0 0 swapper/11 772 0 0 swapper/11 773 0 0 swapper/11 774 20041 20086 prometheus 775 0 0 swapper/11 776 0 0 swapper/11 777 0 0 11 swapper/778 4512 4542 779 spokes-report-s XNUMX

It should be noted here that in the context softirq processes that have made system calls will show up as "processes" when in fact it is the kernel that handles packets safely in the context of the kernel.

With this tool, we can link specific processes to specific packages that show a delay in hping3. Making it simple grep on this capture for certain values icmp_seq. Packets matching the above icmp_seq values ​​were flagged along with their RTTs we observed above (in brackets are expected RTTs for packets we filtered out due to RTTs less than 50ms):

TGID PID PROCESS NAME ICMP_SEQ ** RTT -- 10137 10436 cadvisor 1951 10137 10436 cadvisor 1952 76 76 ksoftirqd/11 1953 ** 99ms 76 76 ksoftirqd/11 1954 ** 89ms 76 76 ksoftir qd/11 1955 ** 79ms 76 76 ksoftirqd/11 1956 ** 69ms 76 76 ksoftirqd/11 1957 ** 59ms 76 76 ksoftirqd/11 1958 ** (49ms) 76 76 ksoftirqd/11 1959 ** (39ms) 76 76 ksoftirqd/11 1960 ** (29ms) 76 76 ksoft irqd/ 11 1961 ** (19ms) 76 76 ksoftirqd/11 1962 ** (9ms) -- 10137 10436 cadvisor 2068 10137 10436 cadvisor 2069 76 76 ksoftirqd/11 2070 ** 75ms 76 76 ksoftirqd/11 2071 ** 65ms 76 76 ksoftirqd/ 11 2072 ** 55ms 76 76 ksoftirqd/11 2073 ** (45ms) 76 76 ksoftirqd/11 2074 ** (35ms) 76 76 ksoftirqd/11 2075 ** (25ms) 76 76 ksoftirqd/11 2076 ** (15ms) 76 76 ksoftirqd/11 2077** (5ms)

The results tell us several things. First, all these packages are handled by the context ksoftirqd/11. This means that for this particular pair of machines, ICMP packets were hashed to core 11 at the receiving side. We also see that with each congestion there are packets that are processed in the context of the system call cadvisor. Then ksoftirqd takes over the task and fulfills the accumulated queue: exactly the number of packets that have accumulated after cadvisor.

The fact that immediately before it always works cadvisor, implies his involvement in the problem. Ironically, the purpose cadvisor - "analyze resource usage and performance characteristics of running containers" rather than causing this performance issue.

As with other aspects of how containers work, these are all very advanced tools and you can expect performance problems under some unforeseen circumstances.

What does cadvisor do that slows down the packet queue?

We now have a pretty good understanding of how the crash occurs, which process is causing it, and on which CPU. We see that due to a hard lock, the Linux kernel does not have time to schedule ksoftirqd. And we see that packets are processed in context cadvisor. It is logical to assume that cadvisor starts a slow syscall, after which all packets accumulated at this time are processed:

Debugging network delays in Kubernetes

This is a theory, but how to test it? What we can do is to trace the CPU core throughout this process, find the point where the packet over budget occurs and ksoftirqd is called, and then look a little earlier - what exactly was running on the CPU core just before this point. It's like taking an x-ray of the CPU every few milliseconds. It will look something like this:

Debugging network delays in Kubernetes

It is convenient that all this can be done with existing tools. For example, perf record checks the specified CPU core at specified intervals and can generate a call schedule of the running system, including both user space and the Linux kernel. You can take this record and process it with a small fork of the program FlameGraph by Brendan Gregg, which preserves stack trace order. We can save single line stack traces every 1ms and then extract and save the sample 100ms before it hits the trace ksoftirqd:

# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100

Here are the results:

(сотни слСдов, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ выглядят ΠΏΠΎΡ…ΠΎΠΆΠΈΠΌΠΈ)

cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run

There's a lot going on here, but the main thing is that we find the "cadvisor before ksoftirqd" pattern we saw earlier in the ICMP tracer. What does it mean?

Each line is a CPU trace at a certain point in time. Each call down the stack on a line is separated by a semicolon. In the middle of the lines we see the syscall being called: read(): .... ;do_syscall_64;sys_read; .... So cadvisor spends a lot of time on system call read()related to functions mem_cgroup_* (top of call stack/end of line).

It is inconvenient to see what is being read in the call trace, so let's run strace and see what cadvisor does and find system calls longer than 100ms:

theojulienne@kube-node-bad ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0.[1-9]'
[pid 10436] <... futex resumed> ) = 0 <0.156784>
[pid 10432] <... futex resumed> ) = 0 <0.258285>
[pid 10137] <... futex resumed> ) = 0 <0.678382>
[pid 10384] <... futex resumed> ) = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> ) = 0 <0.104614>
[pid 10436] <... futex resumed> ) = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> ) = 0 <0.118113>
[pid 10382] <... pselect6 resumed> ) = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880nrss 507904nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> ) = 0 <0.917495>
[pid 10436] <... futex resumed> ) = 0 <0.208172>
[pid 10417] <... futex resumed> ) = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0nrss 0nrss_huge 0nmapped_"..., 4096) = 576 <0.154442>

As one might expect, here we see slow calls read(). From the contents of read operations and context mem_cgroup It can be seen that these calls read() refer to the file memory.stat, which shows memory usage and cgroup limits (Docker's resource isolation technology). The cadvisor tool queries this file for resource usage information for containers. Let's check if this kernel or cadvisor is doing something unexpected:

theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null

real 0m0.153s
user 0m0.000s
sys 0m0.152s
theojulienne@kube-node-bad ~ $

Now we can reproduce the bug and understand that the Linux kernel is facing pathology.

Why is the read operation so slow?

At this point, it's much easier to find other users' posts about similar problems. As it turns out, this bug was reported in the cadvisor tracker as the problem of excessive CPU usage, it's just that no one noticed that latency is also randomly reflected in the network stack. It was indeed noticed that cadvisor was consuming more CPU time than expected, but this was not given much attention, because our servers have a lot of CPU resources, so the problem was not studied thoroughly.

The problem is that control groups (cgroups) take into account the memory usage within the namespace (container). When all processes in that cgroup exit, Docker releases the memory cgroup. However, "memory" is not just process memory. Although process memory itself is no longer used, it appears that the kernel assigns more cached content such as dentries and inodes (directory and file metadata) that are cached in the memory cgroup. From the problem description:

cgroups-zombies: control groups that have no processes and have been removed, but for which memory is still allocated (in my case, from the dentry cache, but it can also be allocated from the page cache or tmpfs).

The kernel checking all pages in the cache when a cgroup is freed can be very slow, so a lazy process is chosen: wait until these pages are requested again, and only when the memory is really needed, finally flush the cgroup. Up to this point, cgroup is still taken into account when collecting statistics.

In terms of performance, they sacrificed memory for performance: speeding up the initial cleanup by leaving some cached memory. This is fine. When the kernel uses the last part of the cached memory, the cgroup is eventually cleared, so it's not a "leak". Unfortunately, the concrete implementation of the search mechanism memory.stat in this version of the kernel (4.9), combined with the huge amount of memory on our servers, it takes much longer to restore the latest cached data and clean up the cgroup zombies.

It turns out that some of our nodes had so many cgroup zombies that the read and latency exceeded a second.

A workaround for the cadvisor problem is to immediately free the dentries/inodes caches throughout the system, which immediately eliminates read latency as well as network latency on the host, since deleting the cache enables the cached zombie cgroup pages and they are also freed. This is not a solution, but confirms the cause of the problem.

It turned out that in newer versions of the kernel (4.19+) the call performance has been improved memory.stat, so switching to that kernel fixed the problem. At the same time, we had a toolkit to detect problem nodes in Kubernetes clusters, drain them gracefully, and reboot. We combed through all the clusters, found nodes with a sufficiently high latency and rebooted them. This gave us time to update the OS on the rest of the servers.

Summing up

Since this bug stopped processing NIC RX queues for hundreds of milliseconds, it caused both a large delay on short connections and a delay in the middle of the connection, for example, between MySQL queries and response packets.

Understanding and maintaining the performance of the most fundamental systems, such as Kubernetes, is critical to the reliability and speed of all services based on them. All running systems benefit from Kubernetes performance improvements.

Source: habr.com

Add a comment