The story of missing DNS packets from Google Cloud technical support

From the Google Blog Editor: Have you ever wondered how Google Cloud Technical Solutions (TSE) engineers handle your support calls? It is the responsibility of TSE Technical Support Engineers to find and fix user-specified problem sources. Some of these problems are quite simple, but sometimes there is a call that requires the attention of several engineers at once. In this article, one of the TSE employees will tell us about one very tricky problem from his recent practice - the case of missing DNS packets. In the course of this story, we will see how the engineers managed to resolve the situation, and what they learned in the course of fixing the error. We hope that this story will not only give you insight into a deeply rooted bug, but will also give you an understanding of the process involved in submitting a Google Cloud support case.

The story of missing DNS packets from Google Cloud technical support

Troubleshooting is both a science and an art. It all starts with building a hypothesis about the reason for the non-standard behavior of the system, after which it is tested for strength. However, before formulating a hypothesis, we must clearly define and articulate the problem. If the question sounds too vague, then you will have to analyze everything properly; this is the "art" of troubleshooting.

In the conditions of Google Cloud, such processes become more complicated at times, as Google Cloud goes out of its way to guarantee the privacy of its users. Because of this, TSE engineers do not have access to edit your systems, nor the ability to view configurations as widely as users do. Therefore, to test any of our hypotheses, we (engineers) cannot quickly modify the system.

Some users believe that we will fix everything like mechanics in a car service, and simply send us the id of the virtual machine, while in reality the process proceeds in a conversation format: collecting information, forming and confirming (or refuting) hypotheses, and, in the end, a decision problems are based on communication with the client.

Problem under consideration

Today we have a story with a happy ending. One of the reasons for the successful resolution of the proposed case is the very detailed and precise description of the problem. Below you can see a copy of the first ticket (edited to hide confidential information):
The story of missing DNS packets from Google Cloud technical support
This post contains a lot of useful information for us:

  • Specific VM specified
  • The problem itself is indicated - DNS does not work
  • It is indicated where the problem manifests itself - VM and container
  • The steps that the user took to determine the problem are indicated.

The appeal was registered as "P1: Critical Impact - Service Unusable in production", which means constant monitoring of the situation 24/7 according to the "Follow the Sun" scheme (you can read more about user requests priorities), passing it from one support team to another at every time zone shift. In fact, by the time the problem reached our team in Zurich, it had circled the globe. By this time, the user had taken mitigation measures, but was afraid of a repeat of the situation in production, as the root cause had not yet been found.

By the time the ticket reached Zurich, we already had the following information on hand:

  • Content /etc/hosts
  • Content /etc/resolv.conf
  • Hack and predictor Aviator iptables-save
  • Assembled by the team ngrep pcap file

With this data, we were ready to begin the β€œinvestigation” and troubleshooting phase.

Our first steps

First of all, we checked the logs and the status of the metadata server and made sure that it works correctly. The metadata server responds to the IP address 169.254.169.254 and, among other things, is responsible for controlling domain names. We also double-checked that the firewall works correctly with the VM and does not block packets.

It was kind of a strange problem: nmap checking disproved our main hypothesis about losing UDP packets, so we mentally deduced a few more options and ways to check them:

  • Are packets randomly dropped? => Check iptables rules
  • Isn't it too small MTU? => Check output ip a show
  • Does the problem only affect UDP packets, or does it affect TCP as well? => Drive away dig +tcp
  • Are dig-generated packages returned? => Drive away tcpdump
  • Is libdns working correctly? => Drive away strace to check the transmission of packets in both directions

Here we decide to call the user to troubleshoot live.

During the call, we manage to check several things:

  • After several checks, we exclude the iptables rules from the list of reasons
  • We check the network interfaces and routing tables, and double-check that the MTU is correct
  • We discover that dig +tcp google.com (TCP) works as it should, but here dig google.com (UDP) not working
  • Having driven tcpdump while working dig, we find that UDP packets are returned
  • We are chasing strace dig google.com and see how dig correctly calls sendmsg() ΠΈ recvms(), but the second timeout is interrupted

Unfortunately, the end of the shift comes and we are forced to transfer the problem to the next time zone. The appeal, however, aroused interest in our team, and a colleague suggests creating an initial DNS package using the scrapy Python module.

from scapy.all import *

answer = sr1(IP(dst="169.254.169.254")/UDP(dport=53)/DNS(rd=1,qd=DNSQR(qname="google.com")),verbose=0)
print ("169.254.169.254", answer[DNS].summary())

This snippet creates a DNS packet and sends a request to the metadata server.

The user runs the code, the DNS response is returned and the application receives it, which confirms that there is no problem at the network level.

After the next β€œround the world trip”, the appeal returns to our team, and I completely transfer it to myself, believing that it will be more convenient for the user if the appeal stops circling from place to place.

In the meantime, the user kindly agrees to provide a snapshot of the system image. This is very good news: the ability to test the system yourself greatly speeds up troubleshooting, because you no longer need to ask the user to run commands, send me results and analyze them, I can do everything myself!

Colleagues begin to envy me little by little. We discuss conversion over lunch, but no one has any idea what's going on. Fortunately, the user himself has already taken mitigation measures and is in no hurry, so we have time to dissect the problem. And since we have an image, we can run any tests that interest us. Great!

Taking a step back

One of the most popular questions in an interview for a systems engineer is: β€œWhat happens when you ping www.google.com? This is a great question, since the candidate needs to describe everything from the shell to user space, to the core of the system, and on to the network. I smile: sometimes interview questions are useful in real life ...

I decide to apply this HR question to the current problem. Roughly speaking, when you try to resolve a DNS name, the following happens:

  1. The application calls a system library, such as libdns
  2. libdns checks the system configuration for which DNS server to contact (in the diagram, this is 169.254.169.254, the metadata server)
  3. libdns uses system calls to create a UDP socket (SOKET_DGRAM) and send UDP DNS query packets in both directions
  4. Through the sysctl interface, you can configure the UDP stack at the kernel level
  5. The kernel communicates with the hardware to send packets over the network through the network interface.
  6. The hypervisor catches and passes the packet to the metadata server upon contact with it
  7. The metadata server uses its magic to determine the DNS name and returns a response using the same method

The story of missing DNS packets from Google Cloud technical support
Let me remind you what hypotheses we have already considered:

Hypothesis: Broken Libraries

  • Test 1: run strace on the system, check that dig is making the correct system calls
  • Result: The correct system calls are called
  • Test 2: using srapy to check if we can define names bypassing system libraries
  • Result: we can
  • Test 3: run rpm -V on libdns package and md5sum library files
  • Result: the library code is completely identical to the code in the working operating system
  • Test 4: mount the user's root image on a VM without this behavior, run the chroot, see if DNS is working
  • Result: DNS works correctly

Conclusion based on tests: the problem is not in the libraries

Hypothesis: There is an error in the DNS settings

  • Test 1: check tcpdump and see if DNS packets are sent and returned correctly after running dig
  • Result: packets are transmitted correctly
  • Test 2: recheck on the server /etc/nsswitch.conf ΠΈ /etc/resolv.conf
  • Result: everything is correct

Conclusion based on tests: the problem is not in the DNS configuration

Hypothesis: damaged core

  • Test: install new kernel, verify signature, restart
  • Result: similar behavior

Conclusion based on tests: kernel is not damaged

Hypothesis: incorrect behavior of the user network (or the hypervisor network interface)

  • Test 1: check firewall settings
  • Result: firewall passes DNS packets on both host and GCP
  • Test 2: intercept traffic and track the correctness of the transmission and return of DNS queries
  • Result: tcpdump confirms receipt of return packets by host

Conclusion based on tests: the problem is not the network

Hypothesis: the metadata server is down

  • Test 1: check metadata server logs for anomalies
  • Result: no anomalies in the logs
  • Test 2: bypass the metadata server via dig @8.8.8.8
  • Result: Permission broken even without using a metadata server

Conclusion based on tests: the problem is not in the metadata server

The bottom line: we have tested all subsystems except runtime settings!

Diving into kernel runtime settings

You can use the command line options (grub) or the sysctl interface to configure the kernel execution environment. I looked into /etc/sysctl.conf and just to think, found a few custom settings. Feeling like I was on to something, I swept aside all non-network or non-tcp settings, leaving with mountain settings net.core. Then I turned to where the VM has the permissions of the host and began to apply one by one, one by one, the settings with the broken VM, until I came to the culprit:

net.core.rmem_default = 2147483647

Here it is, DNS-breaking configuration! I found the crime weapon. But why is this happening? I still needed a motive.

The base DNS packet buffer size is configured via net.core.rmem_default. A typical value is somewhere around 200KiB, but if your server is receiving a lot of DNS packets, you may want to increase the buffer size. If the buffer is full when a new packet arrives, for example because the application is not processing it fast enough, then you will start losing packets. Our client correctly increased the buffer size because he was afraid of data loss, because he used the application to collect metrics through DNS packets. The value it set was the maximum possible: 231-1 (if set to 231, the kernel will return "INVALID ARGUMENT").

Suddenly I realized why nmap and scapy were working correctly: they were using raw sockets! Raw sockets are different from normal sockets: they bypass iptables, and they are not buffered!

But why is "buffer too big" causing problems? It clearly doesn't work as intended.

By this point, I could reproduce the problem on multiple kernels and multiple distributions. The problem has already manifested itself on the 3.x kernel and now also manifested itself on the 5.x kernel.

Indeed, when starting

sysctl -w net.core.rmem_default=$((2**31-1))

DNS stopped working.

I started looking for working values ​​through a simple binary search algorithm and found that the system works with 2147481343, but this number was a meaningless set of numbers for me. I suggested to the client to try this number, and he replied that the system worked with google.com, but still gave an error with other domains, so I continued my investigation.

I have installed drop watch, a tool that should have been used before: it shows exactly where in the kernel a packet ends up. The culprit was the function udp_queue_rcv_skb. I downloaded the kernel sources and added a few functions printk to track exactly where the packet goes. I quickly found the right condition if, and just stared at it for a while, because it was then that everything finally came together in a whole picture: 231-1, a meaningless number, a broken domain ... It was a piece of code in __udp_enqueue_schedule_skb:

if (rmem > (size + sk->sk_rcvbuf))
		goto uncharge_drop;

Note:

  • rmem is of type int
  • size is of type u16 (an unsigned sixteen-bit int) and stores the packet size
  • sk->sk_rcybuf is of type int and stores the size of the buffer, which by definition is equal to the value in net.core.rmem_default

When sk_rcvbuf approaches 231, summing the packet size may result in integer overflow. And since it's an int, its value becomes negative, so the condition becomes true when it should be false (more on this in link).

The error is corrected in a trivial way: by casting to unsigned int. I applied the fix and restarted the system, after which the DNS worked again.

The taste of victory

I forwarded my findings to the client and sent LKML kernel patch. I am satisfied: every piece of the puzzle has come together, I can explain exactly why we observed what we observed, and most importantly, we were able to find a solution to the problem thanks to our joint work!

It is worth recognizing that the case turned out to be rare, and fortunately we rarely receive such complex requests from users.

The story of missing DNS packets from Google Cloud technical support


Source: habr.com

Add a comment