Cloudflare Crash Details July 2, 2019

Cloudflare Crash Details July 2, 2019

Almost 9 years ago Cloudflare was a tiny company and I didn't work for it, I was just a customer. A month after the launch of Cloudflare, I received an alert that on my site jgc.orgdoes not seem to work DNS. Cloudflare has made a change to Protocol Buffers, and there was a broken DNS.

I immediately emailed Matthew Prince with the title "Where's my DNS?" and he sent a long reply full of technical details (read all correspondence here), to which I replied:

From: John Graham-Cumming
Date: October 7, 2010, 9:14 am
Subject: Re: Where is my DNS?
To: Matthew Prince

Cool report, thanks. I will definitely call if there are problems. Probably worth writing a post about this when you collect all the technical information. I think people will like an open and honest story. Especially if you attach graphs to it to show how traffic has grown since launch.

I have good monitoring on my site, and I receive SMS about every failure. Monitoring shows that the failure was from 13:03:07 to 14:04:12. Tests are run every five minutes.

I'm sure you'll figure it all out. Are you sure you don't need your own person in Europe? πŸ™‚

And he answered:

From: Matthew Prince
Date: October 7, 2010, 9:57 am
Subject: Re: Where is my DNS?
To: John Graham-Cumming

Thank you. We have responded to everyone who wrote. I'm on my way to the office right now and we'll write something on the blog or pin an official post on our bulletin board. I completely agree, honesty is everything.

Now Cloudflare is a really big company, I work for it, and now I have to write openly about our mistake, its consequences and our actions.

July 2 events

On July 2, we rolled out a new rule in Managed Rules for WAF, due to which CPU resources are running out on every processor core handling HTTP/HTTPS traffic on the Cloudflare network around the world. We are constantly improving the managed rules for WAF in response to new vulnerabilities and threats. In May, for example, we hurried add ruleto protect against a serious vulnerability in SharePoint. The whole essence of our WAF is the ability to quickly and globally deploy rules.

Unfortunately, last Thursday's update contained a regex that was using too much of the CPU allocated to HTTP/HTTPS for backtracking. Our core proxy, CDN and WAF functions suffered from this. The graph shows that the CPU resources for serving HTTP / HTTPS traffic reach almost 100% on the servers in our network.

Cloudflare Crash Details July 2, 2019
Processor resource usage at one of the points of presence during an incident

As a result, our clients (and clients of our clients) hit a 502 error page on Cloudflare domains. 502 errors were generated by Cloudflare's front-end web servers, which still had free cores, but could not communicate with processes handling HTTP/HTTPS traffic.

Cloudflare Crash Details July 2, 2019

We know how much inconvenience this caused to our customers. We are terribly ashamed. And this failure prevented us from effectively dealing with the incident.

If you were one of those clients, you must have been frightened, angry, and frustrated. Moreover, we have not had for 6 years global failures. The high CPU usage was due to a single WAF rule with a poorly worded regular expression that resulted in excessive backtracking. Here is the guilty expression: (?:(?:"|'|]|}||d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|-|+)+[)]*;?((?:s|-|~|!|{}||||+)*.*(?:.*=.*)))

While it's interesting in itself (and I'll cover it in more detail below), Cloudflare's 27-minute blackout wasn't just due to a bad regular expression. It took us a while to describe the sequence of events that led to the crash, so we didn't respond quickly. At the end of the post, I will describe backtracking in a regular expression and tell you what to do with it.

What happened

Let's start in order. All times here are in UTC.

At 13:42, an engineer from the firewall team made a small change to the rules for detecting XSS through an automatic process. Accordingly, a change request ticket was created. We manage such tickets through Jira (screenshot below).

After 3 minutes, the first page of PagerDuty appeared, reporting a problem with WAF. This was a synthetic benchmark that tests the functionality of WAFs (we have hundreds of them) outside of Cloudflare to check for normal operation. This was followed immediately by pages of failures of other end-to-end tests of Cloudflare services, global traffic issues, widespread 502 errors, and a bunch of reports from our Points of Presence (PoP) in cities around the world that indicated a shortage of processor resources.

Cloudflare Crash Details July 2, 2019

Cloudflare Crash Details July 2, 2019

I received a few of these alerts, stormed out of the meeting, and was on my way to the table when our Solutions Development Lead said we had lost 80% of our traffic. I ran to our SRE engineers who were already working on the problem. At first we thought it was some unknown attack.

Cloudflare Crash Details July 2, 2019

Cloudflare SRE engineers are scattered around the world and monitor the situation around the clock. Typically, these alerts notify of specific local problems of a limited scope, are tracked on internal dashboards, and are resolved many times a day. But such pages and notifications pointed to something really serious, and SRE engineers immediately declared a severity level of P0 and turned to management and systems engineers.

Our London engineers were at that moment listening to a lecture in the main hall. The lecture had to be interrupted, everyone gathered in a large conference room, and more specialists were called. This was not a common problem that SREs could deal with on their own. It was necessary to urgently connect the right specialists.

At 14:00 we determined that there was a problem with the WAF and there was no attack. The performance department extracted the processor data, and it became obvious that WAF was to blame. Another employee confirmed this theory with strace. Someone else saw in the logs that there was a problem with WAF. At 14:02 pm, the whole team came to me when it was suggested to use global kill, a mechanism built into Cloudflare that disables one component worldwide.

How we did a global kill for WAF is a different story. It is not that simple. We use our own products, and since our service Access did not work, we could not authenticate and log into the internal control panel (when everything was fixed, we learned that some team members lost access due to a security feature that disables credentials if the internal control panel is not used for a long time).

And we couldn't get to our internal services like Jira or the build system. A workaround was needed, which we used infrequently (this will also need to be worked out). Finally, one engineer managed to cut off the WAF at 14:07 pm, and at 14:09 pm, the traffic and processor levels returned to normal everywhere. The rest of Cloudflare's security mechanisms worked as normal.

Then we started to restore the WAF. The situation was out of the ordinary, so we ran negative tests (asking ourselves if this change was really the problem) and positive tests (making sure the rollback worked) in one city with separate traffic, transferring paid customers from there.

At 14:52 we made sure we understood the cause and made a correction, and turned WAF back on.

How Cloudflare Works

Cloudflare has a team of engineers dedicated to managed rules for WAF. They strive to improve detection rates, reduce false positives, and respond quickly to new threats as they emerge. In the past 60 days, 476 WAF managed rule change requests have been processed (on average, one every 3 hours).

This particular change needed to be deployed in simulation mode, where real client traffic goes through the rule, but nothing is blocked. We use this mode to test the effectiveness of rules and to measure the proportion of false positives and false negatives. But even in simulation mode, the rules must actually be executed, and in this case, the rule contained a regular expression that consumed too much processor resources.

Cloudflare Crash Details July 2, 2019

As you can see from the change request above, we have a deployment plan, a rollback plan, and a link to an internal standard operating procedure (SOP) for this type of deployment. The SOP for modifying a rule allows it to be published globally. In fact, in Cloudflare, everything is arranged quite differently, and the SOP instructs to first submit the software for testing and internal use to the internal point of presence (PoP) (which our employees use), then to a small number of clients in an isolated location, then to a large number of clients, and only then around the world.

Here's what it looks like. We use git internally via BitBucket. Engineers working on changes submit the code that is built to TeamCity, and when the build passes, reviewers are assigned. When a pull request is approved, the code is compiled and a series of tests are run (again).

If the build and tests complete successfully, a change request is created in Jira and the change must be approved by the appropriate manager or lead. Once approved, it deploys into the so-called "PoP menagerie": DOG, PIG, and Canary (dog, pig and canary).

DOG PoP is Cloudflare PoP (just like any other of our cities) that only Cloudflare employees use. PoP for internal use allows you to catch problems even before customer traffic begins to enter the solution. Useful thing.

If the DOG test passes, the code moves to the PIG (guinea pig) stage. This is Cloudflare PoP, where a small amount of free client traffic goes through the new code.
If all is well, the code goes to Canary. We have three Canary PoPs in different parts of the world. In them, the traffic of paid and free clients passes through the new code, and this is the last check for errors.

Cloudflare Crash Details July 2, 2019
Software release process in Cloudflare

If the code is OK in Canary, we release it. Passing through all stages - DOG, PIG, Canary, the whole world - takes several hours or days, depending on the code change. Due to the diversity of the Cloudflare network and clients, we thoroughly test the code before a global release for all clients. But the WAF specifically doesn't follow this process because threats need to be dealt with quickly.

WAF Threats
In the past few years, there has been a significant increase in threats to regular applications. This is due to the greater availability of software testing tools. For example, we recently wrote about fuzzing).

Cloudflare Crash Details July 2, 2019
Source: https://cvedetails.com/

Very often, a proof of concept is created and immediately published on Github so that the teams that maintain the application can quickly test it and make sure it is adequately secured. Therefore, Cloudflare needs to be able to respond to new attacks as quickly as possible so that customers have the opportunity to fix their software.

A great example of a quick response from Cloudflare is the rollout of protections against a SharePoint vulnerability in May (Read here). Almost immediately after the announcements were published, we noticed a huge number of attempts to exploit the vulnerability in our customers' SharePoint installations. Our guys are constantly monitoring new threats and writing rules to protect our customers.

The rule that caused the issue on Thursday was supposed to protect against cross-site scripting (XSS). Such attacks have also become much more in recent years.

Cloudflare Crash Details July 2, 2019
Source: https://cvedetails.com/

The standard procedure for modifying a WAF managed rule is to test continuous integration (CI) prior to global deployment. We did that last Thursday and rolled out the rules. At 13:31, one engineer submitted an approved pull request with a change.

Cloudflare Crash Details July 2, 2019

At 13:37 TeamCity collected the rules, ran the tests and gave the go-ahead. The WAF test suite tests the core functionality of the WAF and consists of a large number of unit tests for individual functions. After unit testing, we tested the WAF rules with a huge number of HTTP requests. HTTP requests check which requests should be blocked by WAF (in order to intercept the attack) and which requests should be allowed to pass (in order not to block everything in a row and avoid false positives). But we haven't run tests for excessive CPU usage, and looking at the logs from previous WAF builds shows that the time to run the tests with the rule hasn't increased, and it's hard to suspect that there won't be enough resources.

The tests passed and TeamCity started to automatically deploy the change at 13:42.

Cloudflare Crash Details July 2, 2019

Quicksilver

WAF rules are designed to address threats quickly, which is why we deploy them using Quicksilver's distributed key-value store, which propagates changes globally in seconds. All of our clients use this technology when they change the configuration on the dashboard or through the API, and it is thanks to it that we respond to changes with lightning speed.

We didn't talk much about Quicksilver. Previously we used Kyoto Tycoon as a globally distributed key-value store, but it ran into operational problems and we wrote our own store replicated in over 180 cities. We now use Quicksilver to push configuration changes to clients, update WAF rules, and distribute client-written JavaScript to Cloudflare Workers.

From clicking a button on a dashboard or calling an API to making a configuration change around the world, it only takes a few seconds. Customers love this speed of setup. And Workers provides them with near-instantaneous global software deployment. On average, Quicksilver propagates about 350 changes per second.

And Quicksilver is very fast. On average, we reached the 99th percentile of 2,29 seconds to propagate changes to every computer around the world. Usually speed is good. After all, when you turn on a feature or clear the cache, it happens almost instantly and everywhere. Sending code through Cloudflare Workers happens at the same speed. Cloudflare promises its customers fast updates at the right time.

But in this case, the speed played a cruel trick on us, and the rules changed everywhere in a matter of seconds. You may have noticed that the WAF code uses Lua. Cloudflare uses Lua extensively in production and details Lua to WAF we already discussed. Lua WAF uses PCRE inside and applies backtracking for matching. It has no mechanisms to protect against out-of-control expressions. I'll go into more detail about this and what we're doing about it below.

Cloudflare Crash Details July 2, 2019

Before the rules were deployed, everything went smoothly: a pull request was created and approved, the CI/CD pipeline built and tested the code, a change request was submitted in accordance with the SOP that governs deployment and rollback, and the deployment was completed.

Cloudflare Crash Details July 2, 2019
Cloudflare WAF Deployment Process

Something went wrong
As I said before, we roll out dozens of new WAF rules every week, and we have many systems in place to protect against the negative effects of such a rollout. And when something goes wrong, it is usually a combination of several circumstances at once. If you find only one reason, this, of course, is reassuring, but not always true. Here are the reasons that together caused our HTTP/HTTPS service to fail.

  1. An engineer has written a regular expression that can lead to excessive backtracking.
  2. A feature that could have prevented the regular expression from overusing CPU was mistakenly removed during the WAF refactoring a few weeks earlier - the refactoring was needed to make the WAF consume less resources.
  3. The regular expression engine had no guarantees of complexity.
  4. The test suite could not detect excessive CPU usage.
  5. The SOP allows non-urgent rule changes to be deployed globally without a multi-step process.
  6. The rollback plan required a full WAF build twice, which is a long time.
  7. The first global traffic problem alert came too late.
  8. We've been slow to update the status page.
  9. We had problems accessing systems due to a crash, and the bypass procedure was not well practiced.
  10. SREs have lost access to some systems because their credentials have expired for security reasons.
  11. Our customers didn't have access to the Cloudflare Dashboard or API because they go through the Cloudflare region.

What has changed since last Thursday

First, we completely stopped all work on releases for WAF and do the following:

  1. Re-introducing the protection against excessive consumption of processor resources, which we removed. (Ready)
  2. Manually checking all 3868 rules in WAF Managed Rules to find and fix other potential instances of excessive backtracking. (Verification completed)
  3. Include performance profiling for all rules in the test suite. (Expected: July 19)
  4. Switching to the regular expression engine re2 or Rust Both provide guarantees in the runtime environment. (Expected: July 31)
  5. Rewriting the SOP to deploy rules in stages, like other software in Cloudflare, but still be able to deploy emergency globally if attacks have already begun.
  6. We are developing the ability to urgently withdraw the Cloudflare dashboard and API from the Cloudflare region.
  7. Automate page refresh Cloud Flare Status.

In the long term, we are moving away from the Lua WAF that I wrote a few years ago. Moving WAF to new firewall system. So WAF will be faster and get an additional layer of protection.

Conclusion

This failure caused trouble for us and our customers. We responded quickly to rectify the situation and are currently working on the flaws in the processes that caused the crash, as well as digging even deeper to guard against potential regex issues in the future by migrating to the new technology.

We are very ashamed of this failure and we apologize to our customers. Hopefully, these changes ensure that this doesn't happen again.

Application. Regular expression backtracking

To understand how the expression:

(?:(?:"|'|]|}||d
(?:nan|infinity|true|false|null|undefined|symbol|math)|`|-
|+)+[)]*;?((?:s|-|~|!|{}||||+)*.*(?:.*=.*)))

ate all the CPU resources, you need to know a little about how the standard regular expression engine works. The problem here is in the pattern .*(?:.*=.*). (?: and corresponding ) is a non-capturing group (that is, the parenthesized expression is grouped as a single expression).

In the context of excessive consumption of processor resources, this pattern can be designated as .*.*=.*. In this form, the pattern looks unnecessarily complicated. But more importantly, in the real world, such expressions (like complex expressions in WAF rules) that ask the engine to match a fragment followed by another fragment can lead to disastrous backtracking. And that's why.

Cloudflare Crash Details July 2, 2019

In a regular expression . means to match one character, .* - match zero or more characters "greedily", that is, capturing the maximum number of characters, so that .*.*=.* means match zero or more characters, then match zero or more characters, find the literal character =, match zero or more characters.

Let's take a test string x=x. It corresponds to the expression .*.*=.*. .*.* up to the equal sign matches the first x (one of the groups .* соотвСтствуСт x, and the second - to zero characters). .* after = matches last x.

For such a comparison, 23 steps are needed. First group .* Π² .*.*=.* acts greedily and matches the entire string x=x. The engine moves to the next group .*. We don't have any more characters to match, so the second group .* matches zero characters (this is allowed). Then the engine goes to the sign =. There are no more symbols (the first group .* used the whole expression x=x), no matching takes place.

And here the regular expression engine returns to the beginning. He goes to the first group .* and compares it с x= (Instead of x=x), and then takes on the second group .*. Second group .* compared with the second x, and we again have no characters left. And when the engine reaches again = в .*.*=.*, nothing works. And he's backtracking again.

This time the group .* still matches x=, but the second group .* no more x, and zero characters. The engine tries to find a literal character = in a pattern .*.*=.*, but does not come out (because it was already taken by the first group .*). And he's backtracking again.

This time the first group .* takes only the first x. But the second group .* "greedily" captures =x. Have you already guessed what will happen? The engine tries to match a literal =, fails and makes another backtracking.

The first group of .* still matches the first x... The second .* takes only =. Of course, the engine cannot match a literal =because the second group has already done it .*. And backtracking again. And we are trying to match a string of three characters!

As a result, the first group .* matches only the first xsecond .* - with zero characters, and the engine finally matches the literal = in expression с = in line. Then the last group .* compared with the last x.

23 steps only for x=x. Watch a short video about using Perl Regexp::Debugger, which shows how the steps and backtracking happen.

Cloudflare Crash Details July 2, 2019

This is already a lot of work, but what if instead of x=x we will have x=xx? That's 33 steps. And if x=xxx? 45. Dependence is not linear. The graph shows a comparison from x=x to x=xxxxxxxxxxxxxxxxxxxx (20 x after =). If we have 20 x after =, the engine does the matching in 555 steps! (Moreover, if we lost x= and the string just consists of 20 x, the engine will take 4067 steps to realize there are no matches).

Cloudflare Crash Details July 2, 2019

This video shows all backtracking for comparison x=xxxxxxxxxxxxxxxxxxxx:

Cloudflare Crash Details July 2, 2019

The trouble is that as the string size increases, the matching time grows superlinearly. But things can get even worse if the regular expression is slightly modified. Let's say we had .*.*=.*; (that is, there was a literal semicolon at the end of the pattern). For example, to match against an expression like foo=bar;.

And here backtracking would be a real disaster. For comparison x=x it will take 90 steps, not 23. And this number is growing rapidly. To match x= and 20 x, you need 5353 steps. Here is the chart. Look at the values ​​along the axis Y compared to the previous chart.

Cloudflare Crash Details July 2, 2019

If interested, see all 5353 failed matching steps x=xxxxxxxxxxxxxxxxxxxx ΠΈ .*.*=.*;

Cloudflare Crash Details July 2, 2019

By using "lazy" rather than "greedy" matching, the amount of backtracking can be controlled. If we change the original expression to .*?.*?=.*?, for matching x=x it will take 11 steps (not 23). As for x=xxxxxxxxxxxxxxxxxxxx... All because ? after .* tells the engine to match the minimum number of characters before moving on.

But lazy matching does not completely solve the backtracking problem. If we replace the catastrophic example .*.*=.*; on .*?.*?=.*?;, the execution time will remain the same. x=x still requires 555 steps, and x= and 20 x - 5353.

The only thing that can be done (other than completely rewriting the pattern for more specificity) is to abandon the regular expression engine with its backtracking mechanism. This is what we will be doing in the next few weeks.

The solution to this problem has been known since 1968, when Kent Thompson wrote an article Programming Techniques: Regular expression search algorithm ("Programming Methods: Regular Expression Search Algorithm"). The article describes a mechanism that allows you to convert a regular expression into non-deterministic finite automata, and after state changes in non-deterministic finite automata, use an algorithm whose execution time depends linearly on the matched string.

Cloudflare Crash Details July 2, 2019

Programming Methods
Regular expression search algorithm
Ken Thompson

Bell Telephone Laboratories, Inc., Murray Hill, NJ

This describes a method for searching for a specific string of characters in text, and discusses the implementation of this method in compiler form. The compiler takes the regular expression as source code and generates the IBM 7094 program as object code. The object program takes input in the form of search text and emits a signal each time a string of text matches the given regular expression. The article provides examples, problems and solutions.

Algorithm
Previous search algorithms resulted in backtracking if a partially successful search failed.

In compilation mode, the algorithm does not work with symbols. It passes instructions to the compiled code. The execution is very fast - after passing the data to the top of the current list, it automatically searches for all possible consecutive characters in the regular expression.
The compilation and search algorithm is included in the time-sharing text editor as a contextual search. Of course, this is far from the only application of such a search procedure. For example, a variant of this algorithm is used as a symbol lookup in a table in assembler.
It is assumed that the reader is familiar with regular expressions and the programming language of the IBM 7094 computer.

Compiler
The compiler consists of three parallel stages. The first step is syntax filtering, which allows only syntactically correct regular expressions to pass through. This step also inserts the "Β·" operator to match regular expressions. In the second step, the regular expression is converted to postfix form. At the third stage, the object code is created. The first 2 stages are obvious, and we will not dwell on them.

Thompson's article doesn't talk about nondeterministic finite automata, but it does a good job of explaining the linear time algorithm and presenting an ALGOL-60 program that generates assembly language code for the IBM 7094. The implementation is tricky, but the idea is very simple.

Cloudflare Crash Details July 2, 2019

current search path. It is represented by βŠ• with one input and two outputs.
Figure 1 shows the functions of the third compilation step when converting the regular expression example. The first three characters in the example are a, b, c, and each creates an S[i] stack entry and an NNODE field.

NNODE to existing code to generate the final regular expression in a single stack entry (see Figure 5)

This is what the regular expression would look like .*.*=.*, if you imagine it, as in the pictures from Thompson's article.

Cloudflare Crash Details July 2, 2019

On fig. 0 there are five states starting at 0 and 3 cycles starting at states 1, 2 and 3. These three cycles correspond to the three .* in a regular expression. 3 ovals with dots correspond to one character. Oval with sign = matches a literal character =. State 4 is final. If we have reached it, then the regular expression has been matched.

To see how such a state diagram can be used to match a regular expression .*.*=.*, we will consider string matching x=x. The program starts from state 0, as shown in Fig. 1.

Cloudflare Crash Details July 2, 2019

For this algorithm to work, the end machine must be in several states at the same time. A non-deterministic state machine will make all possible transitions at the same time.

Before it has time to read the input data, it goes into both first states (1 and 2), as shown in Fig. 2.

Cloudflare Crash Details July 2, 2019

On fig. 2 see what happens when he considers the first x Π² x=x. x can match the high point by going from state 1 and back to state 1. Or x can map to the point below by going from state 2 and back to state 2.

After matching the first x Π² x=x we are still in states 1 and 2. We cannot reach state 3 or 4 because we need a literal character =.

The algorithm then considers = Π² x=x. Like x before, it can match any of the top two cycles from state 1 to state 1 or from state 2 to state 2, but the algorithm can still match a literal = and go from state 2 to state 3 (and immediately 4). This is shown in fig. 3.

Cloudflare Crash Details July 2, 2019

The algorithm then moves on to the last x Π² x=x. From states 1 and 2 the same transitions are possible back to states 1 and 2. From state 3 x can match the point on the right and go back to state 3.

At this stage, each character x=x considered, and since we have reached state 4, the regular expression matches this string. Each character is processed once, so this algorithm depends linearly on the length of the input string. And no backtracking.

Obviously, after reaching state 4 (when the algorithm has matched x=) the entire regular expression is matched, and the algorithm can terminate without considering at all x.

This algorithm depends linearly on the size of the input string.

Source: habr.com

Add a comment