Note. translation: this is a translation of a public postmortem from the company's engineering blog Prely. It describes an issue with conntrack in a Kubernetes cluster that resulted in a partial downtime of some production services.
This article may be helpful for those who want to learn a little more about postmortems or prevent some potential DNS problems in the future.
It's not DNS
It can't be DNS
It was DNS
A little about postmortems and processes in Preply
The postmortem describes a failure or some event in the production. The postmortem includes a timeline of events, user impact, root cause, actions, and lessons learned.
At the weekly meetings with pizza, in the circle of the technical team, we share various information. One of the most important parts of such meetings are post-mortems, which are most often accompanied by a presentation with slides and a deeper analysis of the incident. Even though we don't "clap" after post-mortems, we try to develop a culture of "no blame" (blameless culture). We believe that writing and presenting postmortems can help us (and not only) prevent similar incidents in the future, which is why we share them.
Individuals involved in an incident should feel they can talk about it in detail without fear of punishment or retribution. No reprimand! Writing a postmortem is not a punishment, but a learning opportunity for the whole company.
Short: Partial DNS downtime (26 min) for some services in a Kubernetes cluster
Influence: 15000 events lost for services A, B and C
Root Cause: Kube-proxy was unable to correctly remove the old entry from the conntrack table, so some services were still trying to connect to non-existent pods
Trigger: Due to the low load inside the Kubernetes cluster, CoreDNS-autoscaler reduced the number of pods in the deployment from three to two
Decision: The next deployment of the application initiated the creation of new nodes, CoreDNS-autoscaler added more pods to serve the cluster, which provoked a rewrite of the conntrack table
Detection: Prometheus monitoring found a large number of 5xx errors for services A, B and C and initiated a call to the engineers on duty
5xx errors in Kibana
Actions
Action
A type
Responsible
Task
Disable autoscaler for CoreDNS
prevent
Amet W.
DEVOPS-695
Install caching DNS server
reduce
Max W.
DEVOPS-665
Set up conntrack monitoring
prevent
Amet W.
DEVOPS-674
Lessons learned
What went well:
The monitoring worked well. The response was fast and organized
We did not run into any limits on the nodes
What went wrong:
Still unknown real root cause seems to specific bug in conntrack
All actions correct only the consequences, not the root cause (bug)
We knew that sooner or later we might have problems with the DNS, but we did not prioritize the tasks
Where we are lucky:
Another deployment triggered CoreDNS-autoscaler, which overwrote the conntrack table
This bug affected only part of the services
Timeline (EET)
Time
Action
22:13
CoreDNS-autoscaler reduced the number of pods from three to two
22:18
Duty engineers began to receive calls from the monitoring system
22:21
Duty engineers began to find out the cause of the errors
22:39
On-duty engineers began to roll back one of the latest services to the previous version
22:40
5xx errors stopped appearing, the situation has stabilized
Time to discovery: 4 minutes
Time to action: 21 minutes
Time to fix: 1 minutes
Additional Information
CoreDNS logs:
I0228 20:13:53.507780 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"kube-system", Name:"coredns", UID:"2493eb55-3dc0-11ea-b3a2-02bb48f8c230", APIVersion:"apps/v1", ResourceVersion:"132690686", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set coredns-6cbb6646c9 to 2
To minimize CPU usage, the Linux kernel uses a thing called conntrack. In short, this is a utility that contains a list of NAT entries that are stored in a special table. When the next packet arrives from the same pod to the same pod as before, the final IP address will not be recalculated, but will be taken from the conntrack table.
How conntrack works
Results
This was an example of one of our postmortems with some useful links. Specifically in this article, we share information that may be useful to other companies. That's why we're not afraid to make mistakes and that's why we make one of our postmortems public. Here are some more interesting public postmortems: