Post Mortem on Quay.io unavailability

Note. transl.: in early August, Red Hat publicly talked about solving accessibility problems that arose in previous months for users of its service quay.io (it is based on a registry for container images, which the company got along with the purchase of CoreOS). Regardless of your interest in this service as such, the very path that the company's SRE engineers took to diagnose and eliminate the causes of the accident is instructive.

Post Mortem on Quay.io unavailability

On May 19, early in the morning (Eastern Daylight Saving Time, EDT), the quay.io service went down. The accident affected both consumers of quay.io and Open Source projects using quay.io as a platform for building and distributing software. Red Hat values ​​the trust of both one and the other.

The SRE team of engineers immediately joined the work and tried to stabilize the work of the Quay service as soon as possible. However, while they were doing this, customers were unable to push new images, and only occasionally were able to pull existing ones. For some unknown reason, the quay.io database was locking up after the service scaled to full capacity.

Β«What has changed?” is the first question that is usually asked in such cases. We noticed that just prior to the issue, the OpenShift Dedicated cluster (which runs quay.io) started to upgrade to version 4.3.19. Because quay.io runs on Red Hat OpenShift Dedicated (OSD), regular updates have been routine and have never caused problems. What's more, over the past six months, we've upgraded Quay clusters several times without any service interruption.

While we were trying to restore the service, other engineers began to prepare a new OSD cluster with the previous version of the software in order to deploy everything on it in case of emergency.

Root Cause Analysis

The main symptom of the failure was an avalanche of tens of thousands of database connections that effectively rendered the MySQL instance inoperable. This made it difficult to diagnose the problem. We have set a limit on the maximum number of connections from clients to help the SRE team evaluate the issue. No unusual traffic to the database was noticed: in fact, most of the requests were for reading, and only a few were for writing.

We also tried to identify a pattern in the database traffic that could trigger this avalanche. However, no patterns could be found in the logs. While waiting for the new cluster with OSD 4.3.18 to be ready, we kept trying to get the quay.io pods up and running. Every time the cluster reached full capacity, the database would hang. This meant that it was necessary to restart the RDS instance in addition to all the quay.io pods.

By evening, we stabilized the service in read-only mode and disabled as many non-essential functions as possible (for example, garbage collection in the namespace) in order to reduce the load on the database. The hangups have stopped but the reason was never found. The new OSD cluster was ready, and we moved the service, connected the traffic and continued monitoring.

Quay.io was stable on the new OSD cluster, so we went back to the database logs but could not find a correlation to explain the locks. OpenShift engineers have been working with us to see if the changes in Red Hat OpenShift 4.3.19 may have caused problems with Quay. However, nothing was found, and failed to reproduce the problem in the laboratory.

Second failure

On May 28, just before noon EDT, quay.io crashed again with the same symptom: the database was blocked. And again we threw all our efforts into the investigation. First of all, it was necessary to restore the service. However this time restarting RDS and restarting quay.io pods did nothing: another avalanche of connections flooded the base. But why?

Quay is written in Python and each pod runs as a single monolithic container. The container runtime has many parallel tasks running at the same time. We use the library gevent under gunicorn to process web requests. When a request comes into Quay (via our own API, or via Docker's API), a gevent worker is assigned to it. Usually this worker should contact the database. After the first crash, we found that the gevent workers were connecting to the database using the default settings.

Given the significant number of Quay pods and the thousands of incoming requests per second, the high number of database connections could theoretically overload the MySQL instance. Through monitoring, we knew that Quay was processing 5 requests per second on average. The number of connections to the database was about the same. 5 thousand connections with a margin fit into the capabilities of our RDS instance (which cannot be said about tens of thousands). For some reason there were unexpected spikes in the number of connections, however, we did not notice any correlation with incoming requests.

This time we were determined to find and fix the source of the problem, and not just reboot. To the Quay codebase changes have been made to limit the number of connections to the database for each worker gevent. This number became a parameter in the configuration: it became possible to change it on the fly without building a new container image. To find out how many connections to realistically handle, I ran some tests with a staging environment setting different values ​​to see how this would affect load testing scenarios. As a result, it was found that Quay starts throwing 502 errors when the number of connections exceeds 10k.

We immediately deployed this new version to production and began to monitor the schedule of connections to the database. In the past, the base would lock up after about 20 minutes. After 30 problem-free minutes, we had hope, and after an hour, confidence. We restored traffic to the post on the site and started postmortem analysis.

Having managed to get around the problem leading to blocking, we have not found out its true causes. It was confirmed that it is not related to any changes in OpenShift 4.3.19, since the same thing happened on version 4.3.18, which previously worked with Quay without any problems.

Something else was clearly hiding in the cluster.

Detailed study

Quay.io has been using the default settings for connecting to the database for six years without any problems. What changed? It is clear that all this time the traffic on quay.io has been steadily growing. In our case, everything looked as if a certain threshold was reached, which served as a trigger for an avalanche of connections. We continued to study the database logs after the second failure, but did not find any patterns or obvious relationships.

In the meantime, the SRE team has been working on improvements to the observability of requests in Quay and the overall health of the service. New metrics and dashboards have been deployed, showing which Quay parts are most in demand by customers.

Quay.io worked fine until June 9th. This morning (EDT) we again saw a significant increase in the number of connections to the database. This time there was no downtime, since the new parameter limited their number and did not allow exceeding the throughput of MySQL. However, for about half an hour, many users noted the slow operation of quay.io. We quickly collected all possible data using the added monitoring tools. Suddenly, a pattern emerged.

Before the jump in the number of connections, a large number of requests came to the App Registry API. App Registry is a little-known feature of quay.io. It allows you to store things like Helm charts and containers with rich metadata. Most quay.io users don't use this feature, but Red Hat OpenShift makes heavy use of it. OperatorHub, part of OpenShift, stores all operators in the App Registry. These operators form the basis for the OpenShift workload ecosystem and partner-centric operating model (through Day 2 operations).

Each OpenShift 4 cluster uses operators from the built-in OperatorHub to publish a catalog of operators available for installation and provide updates to those already installed. With the growing popularity of OpenShift 4, the number of clusters on it around the world has also increased. Each of these clusters loads the content of the operators to run the built-in OperatorHub using the App Registry inside quay.io as a backend. In looking for the source of the problem, we missed the fact that with the gradual increase in the popularity of OpenShift, the load on one of the rarely used functions of quay.io also increased..

We did some analysis of the App Registry request traffic and looked into the registry code. Shortcomings were immediately revealed, due to which queries to the database were formed non-optimally. With a small load, they did not cause any trouble, but when it increased, they became a source of problems. The App Registry turned out to have two problematic endpoints that react poorly to increasing load: the first gave a list of all packages in the repository, the second returned all blobs for the package.

Elimination of causes

For the next week we have been optimizing the code of the App Registry itself and its environment. Obviously inefficient SQL queries were reworked, unnecessary command calls were eliminated tar (it ran every time blobs were retrieved), added caching wherever possible. Then, large-scale performance testing was carried out and the speed of the App Registry was compared before and after the changes.

API requests that used to take up to half a minute now take milliseconds. We rolled out the changes to production the following week, and quay.io has been stable ever since. During this time, there have been several spikes in traffic at the App Registry endpoint, but improvements have been made to prevent database outages.

What have we learned?

It is clear that any service tries to avoid downtime. In our case, we believe the recent crashes helped make quay.io better. For ourselves, we have learned a few key lessons that we want to share:

  1. Data about who and how uses your service is not superfluous. Because Quay β€œjust worked,” we never had to spend time optimizing traffic and managing load. All this created a false sense of security that the service could scale indefinitely.
  2. When the service goes down getting it back up and running is top priority. Since Quay continued to suffer from a locked database during the first outage, our standard procedures did not have the intended effect and we were unable to restore the service using them. This led to a situation where it was necessary to spend time analyzing and collecting data in the hope of finding the root cause - instead of focusing all efforts on restoring health.
  3. Evaluate the impact of each service feature. Clients rarely used App Registry, so it was not a priority for our team. When some features of a product are barely used, their bugs rarely come up, and developers stop monitoring the code. It's easy to fall prey to the delusion that this is how it should be - until all of a sudden this feature is in the middle of a massive incident.

What's next?

The work to ensure the stability of the service never stops and we are constantly improving it. Traffic volumes on quay.io continue to grow, and we are aware that we must do everything possible to justify the trust of our customers. Therefore, we are currently working on the following tasks:

  1. Deploying read-only database replicas to help the service handle appropriate traffic in the event of problems with the primary RDS instance.
  2. Updating an RDS instance. The current version itself is not a problem. Rather, we just want to remove the false trail (which we followed at the time of the crash); keeping the software up to date will eliminate another factor in case of future outages.
  3. Additional caching across the entire cluster. We continue to look for areas where caching can reduce the load on the database.
  4. Adding a web application firewall (WAF) to see who connects to quay.io and why.
  5. Starting with the next release, Red Hat OpenShift clusters will phase out the App Registry in favor of Operator Catalogs based on the container images available on quay.io.
  6. A long-term replacement for App Registry could be support for Open Container Initiative (OCI) artifact specifications. It is currently being implemented as native Quay functionality and will be available to users when the specification itself is finalized.

All of the above is part of Red Hat's ongoing investment in quay.io as we move from a small "startup-style" team to a mature SRE-driven platform. We know that many of our customers rely on quay.io for their daily work (including Red Hat!) and try to be as open as possible about recent outages and ongoing efforts to improve.

PS from translator

Read also on our blog:

Source: habr.com

Add a comment