DNS issues in Kubernetes. public postmortem

Note. translation: this is a translation of a public postmortem from the company's engineering blog Prely. It describes an issue with conntrack in a Kubernetes cluster that resulted in a partial downtime of some production services.

This article may be helpful for those who want to learn a little more about postmortems or prevent some potential DNS problems in the future.

DNS issues in Kubernetes. public postmortem
It's not DNS
It can't be DNS
It was DNS

A little about postmortems and processes in Preply

The postmortem describes a failure or some event in the production. The postmortem includes a timeline of events, user impact, root cause, actions, and lessons learned.

Seeking SRE

At the weekly meetings with pizza, in the circle of the technical team, we share various information. One of the most important parts of such meetings are post-mortems, which are most often accompanied by a presentation with slides and a deeper analysis of the incident. Even though we don't "clap" after post-mortems, we try to develop a culture of "no blame" (blameless culture). We believe that writing and presenting postmortems can help us (and not only) prevent similar incidents in the future, which is why we share them.

Individuals involved in an incident should feel they can talk about it in detail without fear of punishment or retribution. No reprimand! Writing a postmortem is not a punishment, but a learning opportunity for the whole company.

Keep CALMS & DevOps: S is for Sharing

DNS issues in Kubernetes. Postmortem

Date: 28.02.2020

Authors: Amet U., Andrey S., Igor K., Alexey P.

Title: Finished

Short: Partial DNS downtime (26 min) for some services in a Kubernetes cluster

Influence: 15000 events lost for services A, B and C

Root Cause: Kube-proxy was unable to correctly remove the old entry from the conntrack table, so some services were still trying to connect to non-existent pods

E0228 20:13:53.795782       1 proxier.go:610] Failed to delete kube-system/kube-dns:dns endpoint connections, error: error deleting conntrack entries for UDP peer {100.64.0.10, 100.110.33.231}, error: conntrack command returned: ...

Trigger: Due to the low load inside the Kubernetes cluster, CoreDNS-autoscaler reduced the number of pods in the deployment from three to two

Decision: The next deployment of the application initiated the creation of new nodes, CoreDNS-autoscaler added more pods to serve the cluster, which provoked a rewrite of the conntrack table

Detection: Prometheus monitoring found a large number of 5xx errors for services A, B and C and initiated a call to the engineers on duty

DNS issues in Kubernetes. public postmortem
5xx errors in Kibana

Actions

Action
A type
Responsible
Task

Disable autoscaler for CoreDNS
prevent
Amet W.
DEVOPS-695

Install caching DNS server
reduce
Max W.
DEVOPS-665

Set up conntrack monitoring
prevent
Amet W.
DEVOPS-674

Lessons learned

What went well:

  • The monitoring worked well. The response was fast and organized
  • We did not run into any limits on the nodes

What went wrong:

  • Still unknown real root cause seems to specific bug in conntrack
  • All actions correct only the consequences, not the root cause (bug)
  • We knew that sooner or later we might have problems with the DNS, but we did not prioritize the tasks

Where we are lucky:

  • Another deployment triggered CoreDNS-autoscaler, which overwrote the conntrack table
  • This bug affected only part of the services

Timeline (EET)

Time
Action

22:13
CoreDNS-autoscaler reduced the number of pods from three to two

22:18
Duty engineers began to receive calls from the monitoring system

22:21
Duty engineers began to find out the cause of the errors

22:39
On-duty engineers began to roll back one of the latest services to the previous version

22:40
5xx errors stopped appearing, the situation has stabilized

  • Time to discovery: 4 minutes
  • Time to action: 21 minutes
  • Time to fix: 1 minutes

Additional Information

To minimize CPU usage, the Linux kernel uses a thing called conntrack. In short, this is a utility that contains a list of NAT entries that are stored in a special table. When the next packet arrives from the same pod to the same pod as before, the final IP address will not be recalculated, but will be taken from the conntrack table.
DNS issues in Kubernetes. public postmortem
How conntrack works

Results

This was an example of one of our postmortems with some useful links. Specifically in this article, we share information that may be useful to other companies. That's why we're not afraid to make mistakes and that's why we make one of our postmortems public. Here are some more interesting public postmortems:

Source: habr.com

Add a comment