How pod priorities in Kubernetes caused downtime at Grafana Labs

Note. transl.: We present to your attention the technical details about the reasons for the recent downtime in the work of the cloud service maintained by the creators of Grafana. This is a classic example of how a new and seemingly extremely useful feature designed to improve the quality of the infrastructure ... can do harm if you do not foresee the numerous nuances of its application in the realities of production. It's great when such materials appear that allow you to learn not only from your mistakes. Details are in the translation of this text from the vice president of product from Grafana Labs.

How pod priorities in Kubernetes caused downtime at Grafana Labs

On Friday, July 19, the Hosted Prometheus service on Grafana Cloud went down for about 30 minutes. I apologize to all customers affected by the failure. Our goal is to provide the right monitoring tools, and we understand that their unavailability complicates your life. We are taking this incident extremely seriously. This note explains what happened, how we responded to it and what we are doing to ensure that this does not happen again.

prehistory

The Grafana Cloud Hosted Prometheus service is based on Cortex - CNCF project to create a horizontally scalable, highly available, multi-tenant (multi-tenant) Prometheus service. The Cortex architecture consists of a set of individual microservices, each of which performs its own function: replication, storage, queries, etc. Cortex is under active development, with new features and performance improvements all the time. We regularly deploy new releases of Cortex to clusters so that customers can take advantage of these opportunities - fortunately, Cortex can be updated without downtime.

For non-durable updates, the Cortex Ingester service requires an additional replica of the Ingester during the update process. (Note. transl.: ingester - the basic component of the Cortex. Its job is to collect a constant stream of samples, group them into Prometheus chunks, and store them in a database like DynamoDB, BigTable, or Cassandra.) This allows old Ingesters to forward current data to new Ingesters. It is worth noting that Ingesters are demanding on resources. For them to work, you need to have 4 cores and 15 GB of memory per pod, i.e. 25% CPU and base machine memory for our Kubernetes clusters. In general, we usually have a lot more unused resources in the cluster than 4 cores and 15 GB of memory, so we can easily run these additional Ingesters during upgrades.

However, it often happens that during normal operation, none of the machines have these 25% unclaimed resources. Yes, we do not strive: the CPU and memory will always come in handy for other processes. To solve this problem, we decided to use Kubernetes Pod Priorities. The idea is to give Ingesters a higher priority than other (stateless) microservices. When we need to run an additional (N+1) Ingester, we temporarily preempt other, smaller pods. These pods are moved to free resources on other machines, leaving a hole large enough to run an additional Ingester.

On Thursday, July 18, we rolled out four new priority levels across our clusters: critical, tall, average ΠΈ low. They were tested on an internal cluster with no client traffic for approximately one week. By default, pods without a given priority received average priority, the Ingester class was set to high priority. Critical was reserved for monitoring (Prometheus, Alertmanager, node-exporter, kube-state-metrics, etc.). Our config is open, and you can see the PR here.

Accident

On Friday, July 19, one of the engineers launched a new dedicated Cortex cluster for a large client. The config for this cluster did not include new pod priorities, so all new pods were assigned a default priority - average.

The Kubernetes cluster did not have enough resources for the new Cortex cluster, and the existing production Cortex cluster was not updated (Ingesters were left without high priority). Since the Ingesters of the new cluster had by default average priority, and the pods existing in production worked without priority at all, the Ingesters of the new cluster ousted the Ingesters from the existing Cortex production cluster.

The ReplicaSet for the preempted Ingester in the production cluster detected the preempted pod and created a new one to maintain the specified number of copies. The new pod was assigned by default average priority, and the next "old" Ingester in production has lost resources. The result was avalanche process, which led to the displacement of all pods with Ingester for Cortex production clusters.

Ingesters are stateful and store data for the previous 12 hours. This allows us to more efficiently compress them before writing them to long-term storage. To do this, Cortex shards data by series using a Distributed Hash Table (DHT) and replicates each series to three Ingesters using Dynamo-style quorum consistency. Cortex does not write data to Ingesters that are disabled. Thus, when a large number of Ingesters leave the DHT, the Cortex cannot provide sufficient replication of the records, and they "fall".

Detection and elimination

New Prometheus notifications based on "bug budget" (error-budget-based - details will appear in a future article) began to sound the alarm 4 minutes after the start of the shutdown. Over the next five minutes or so, we ran diagnostics and scaled up the underlying Kubernetes cluster to host both the new and existing production clusters.

Five minutes later, the old Ingesters successfully wrote their data, and the new ones started up, and the Cortex clusters became available again.

It took another 10 minutes to diagnose and fix out-of-memory (OOM) errors from authentication reverse proxy servers located in front of Cortex. OOM errors were caused by a tenfold increase in QPS (we believe due to overly aggressive requests from the client's Prometheus servers).

Aftermath

The total downtime was 26 minutes. No data has been lost. Ingesters have successfully loaded all in-memory data into durable storage. During a shutdown, Prometheus client servers buffered deleted (remotely) records using new remote_write API based on WAL (authored by Callum Styan from Grafana Labs) and retry failed writes after a crash.

How pod priorities in Kubernetes caused downtime at Grafana Labs
Production Cluster Writes

Conclusions

It is important to learn from this incident and take the necessary steps to avoid it happening again.

In retrospect, we should admit that we should not have set the default average priority until all Ingesters in production have received tall a priority. In addition, it was necessary to take care of them in advance high priority. Now everything is fixed. We hope that our experience will help other organizations considering using pod priority in Kubernetes.

We will add an additional layer of control over the deployment of any additional objects whose configurations are global to the cluster. Henceforth, such changes will be evaluated bΠΎmore people. In addition, the modification that caused the crash was considered too minor for a single design document - it was only discussed in a GitHub issue. From now on, all such changes to the configs will be accompanied by the corresponding project documentation.

Finally, we'll automate resizing at the authentication reverse proxy to prevent OOM overloads we've seen, and analyze Prometheus defaults related to fallback and scaling to prevent similar issues in the future.

The failure experienced had some positive consequences: having the necessary resources at its disposal, Cortex automatically recovered without additional intervention. We have also gained valuable experience with Grafana Loki - our new log aggregation system - which helped make sure that all Ingesters behaved properly during and after the crash.

PS from translator

Read also on our blog:

Source: habr.com

Add a comment