Chaos Engineering: the art of deliberate destruction. Part 2

Note. transl.: This article continues an excellent series of articles from AWS technology evangelist Adrian Hornsby, who set out to simply and clearly explain the importance of experiments designed to mitigate the effects of failures in IT systems.

Chaos Engineering: the art of deliberate destruction. Part 2

“If you fail to prepare a plan, then you plan to fail.” - Benjamin Franklin

В the first part In this series of articles, I introduced the concept of chaos engineering and explained how it helps to find and fix flaws in the system before they lead to production failures. It was also discussed how chaos engineering contributes to positive cultural change within organizations.

At the end of the first part, I promised to talk about "tools and ways to introduce failures into systems." Alas, my head had plans of its own in this regard, and in this article I will try to answer the most popular question that arises for people who want to do chaos engineering: What to break first?

Great question! However, this panda does not seem to be particularly worried ...

Chaos Engineering: the art of deliberate destruction. Part 2
Don't mess with chaos panda!

Short answer: Target critical services along the request path.

Long but more intelligible answer: To understand where to start experimenting with chaos, pay attention to three areas:

  1. Look at crash history and identify patterns;
  2. Decide on critical dependencies;
  3. Use the so-called. overconfidence effect.

It's funny, but this part could just as well be called "Journey to Self-Knowledge and Enlightenment". In it, we will begin to “play” with some cool instruments.

1. The answer lies in the past

If you remember, in the first part I introduced the concept of Correction-of-Errors (COE) - the method by which we analyze our mistakes: mistakes in technology, process or organization - in order to understand their cause (s) and prevent future recurrence . In general, this is where you should start.

"To understand the present, you need to know the past." — Carl Sagan

Look at the crash history, tag the COE or postmortems and categorize them. Identify common patterns that often lead to problems, and for each SOE, ask yourself the following question:

“Could it have been foreseen, and therefore prevented, by introducing a malfunction?”

I remember one failure at the very beginning of my career. It could be easily prevented if we did a couple of simple chaos experiments:

Under normal circumstances, backend instances respond to health checks from load balancer (ELB). The ELB uses these checks to redirect requests to "healthy" instances. When it turns out that a certain instance is "unhealthy", ELB stops sending requests to it. One day, after a successful marketing campaign, the volume of traffic increased, and the backends began to respond to health checks more slowly than usual. It should be said that these health checks were deep, that is, the status of dependencies was checked.

However, for a while everything was fine.

Then, already under rather stressful conditions, one of the instances began to perform a non-critical, regular cron task from the ETL category. The combination of high traffic and the cronjob boosted the CPU usage by almost 100%. The processor overload slowed down responses to health checks even more - so much so that the ELB decided that the instance was experiencing problems in operation. As expected, the balancer stopped distributing traffic to it, which, in turn, led to an increase in the load on the rest of the instances in the group.

All of a sudden, all the other instances also started failing the health check.

Starting a new instance required downloading and installing packages, and took much longer than it took ELB to shut them down - one by one - in the autoscaler group. It is clear that soon the whole process reached a critical point and the application fell.

Then we finally clarified the following points:

  • It takes a long time to install software when creating a new instance, it is better to give preference to the immutable approach and Golden AMI.
  • In difficult situations, responses to health-checks and ELBs should take precedence - the last thing you want to do is make life difficult for the remaining instances.
  • Local caching of health-checks helps a lot (even for a few seconds).
  • In a difficult situation, do not run cron tasks and other non-critical processes - save resources for the most important tasks.
  • Use smaller instances when autoscaling. A group of 10 small specimens is better than 4 large ones; if one instance fails, in the first case, 10% of the traffic will be distributed over 9 points, in the second case, 25% of the traffic will be distributed over three points.

So, could it have been foreseen, and therefore prevented, by introducing the problem?

Yes , and in several ways.

First, by simulating high CPU usage using tools such as stress-ng or cpuburn:

❯ stress-ng --matrix 1 -t 60s

Chaos Engineering: the art of deliberate destruction. Part 2
stress-ng

Second, by overloading the instance with wrk and other similar utilities:

❯ wrk -t12 -c400 -d20s http://127.0.0.1/api/health

Chaos Engineering: the art of deliberate destruction. Part 2

The experiments are relatively simple, but can provide some good food for thought without having to go through the stress of a real failure.

But don't stop there. Try to reproduce the failure in a test environment and check your answer to the question "Could this have been foreseen and, therefore, prevented by introducing a malfunction?". This is a mini chaos experiment within a chaos experiment to test assumptions, but starting with a crash.

Chaos Engineering: the art of deliberate destruction. Part 2
Was it a dream, or did it really happen?

So study the history of failures, analyze COE, tag and categorize them by their “radius of impact”—or more specifically, by the number of affected customers—and then look for patterns. Ask yourself if this could have been foreseen and prevented by introducing the problem. Check your answer.

Then switch to the most common patterns with the largest damage radius.

2. Build a dependency map

Take a moment to think about your application. Is there a clear map of its dependencies? Do you know what impact they will have in the event of a failure?

If you're not very familiar with your application's code, or if your application has become too large, it can be difficult to understand what the code does and what its dependencies are. Understanding these dependencies and their possible impact on the application and users is critical to understanding where to start chaos engineering: the starting point is the component with the largest impact radius.

Identifying and documenting dependencies is referred to as "building a dependency map» (dependency mapping). It is typically done on applications with a large code base using code profiling tools. (code profiling) and instrumentation (instrumentation). You can also build a map by monitoring network traffic.

However, not all dependencies are the same (which further complicates the process). Some critical, others - secondary (at least in theory, since crashes are often due to dependency issues that were considered non-critical).

Without critical dependencies, the service cannot work. Non-critical dependenciesshould not» have an impact on the service in case of its fall. To understand dependencies, you need to have a clear understanding of the APIs used by the application. This can be a lot more difficult than it looks - at least for large applications.

Start by iterating over all the APIs. Highlight the most significant and critical. Take depending on from the code repository, explore connection logs, then view documentation (of course, if it exists - otherwise you still haveоbigger problems). Use the tools to profiling and tracing, filter out external calls.

You can use programs like netstat - a command line utility that displays a list of all network connections (active sockets) in the system. For example, to list all current connections, type:

❯ netstat -a | more 

Chaos Engineering: the art of deliberate destruction. Part 2

On AWS, you can use flow logs (flow logs) VPC is a method that allows you to collect information about IP traffic going to or from network interfaces in a VPC. Such logs can also help with other tasks, such as finding an answer to the question why certain traffic does not reach the instance.

You can also use AWS X Ray. X-Ray allows you to get a detailed, "final" (end-to-end) overview of requests as they move through the application, and also builds a map of the basic components of the application. Very handy if you need to identify dependencies.

Chaos Engineering: the art of deliberate destruction. Part 2
AWS X-Ray Console

The network dependency map is only a partial solution. Yes, it shows which application communicates with which, but there are other dependencies.

Many applications use DNS to connect to dependencies, while others may use service discovery or even hard-coded IP addresses in configuration files (for example, in /etc/hosts).

For example, you can create DNS blackhole through iptables and see what breaks. To do this, enter the following command:

❯ iptables -I OUTPUT -p udp --dport 53 -j REJECT -m comment --comment "Reject DNS"

Chaos Engineering: the art of deliberate destruction. Part 2
"Black hole" DNS

If the /etc/hosts or other configuration files, you will find IP addresses that you know nothing about (yes, unfortunately, this happens), again it can come to the rescue iptables. Let's say you found 8.8.8.8 and don't know it's Google's public DNS server address. By using iptables You can close incoming and outgoing traffic to this address using the following commands:

❯ iptables -A INPUT -s 8.8.8.8 -j DROP -m comment --comment "Reject from 8.8.8.8"
❯ iptables -A OUTPUT -d 8.8.8.8 -j DROP -m comment --comment "Reject to 8.8.8.8"

Chaos Engineering: the art of deliberate destruction. Part 2
Closing access

The first rule drops all packets from Google's public DNS: ping works but no packets are returned. The second rule drops all packets outgoing from your system in the direction of Google's public DNS - in response to ping we get Operation not permitted.

Note: in this particular case it would be better to use whois 8.8.8.8, but this is just an example.

You can go even deeper down the rabbit hole, since everything that uses TCP and UDP actually depends on IP as well. In most cases, IP is tied to ARP. Don't forget about firewalls...

Chaos Engineering: the art of deliberate destruction. Part 2
Take the red pill and stay in Wonderland and I'll show you how deep the rabbit hole goes."

A more radical approach is to disconnect machines one by one and see what's broken...become a "chaos monkey". Of course, many production systems are not designed for such a brute attack, but at least it can be tried in a test environment.

Building a dependency map is often a very long undertaking. I recently spoke with a client who spent almost 2 years developing a tool that semi-automatically generates dependency maps for hundreds of microservices and commands.

The result, however, is extremely interesting and useful. You will learn a lot about your system, its dependencies and operations. Again, be patient: the journey itself matters the most.

3. Beware of overconfidence

"Whoever dreams about something, he believes in it." — Demosthenes

Have you ever heard of overconfidence effect?

According to Wikipedia, the overconfidence effect is "a cognitive bias in which a person's confidence in their actions and decisions is significantly greater than the objective accuracy of those judgments, especially when the level of confidence is relatively high."

Chaos Engineering: the art of deliberate destruction. Part 2
Based on intuition and experience...

In my experience, this distortion is a great hint of where to start chaos engineering.

Beware of the overconfident operator:

Charlie: "This thing hasn't been dropped in five years, everything is fine!"
Glitch: "Wait... I'll be there soon!"

Bias as a consequence of self-confidence is an insidious and even dangerous thing because of the various factors that affect it. This is especially true when team members have put their heart and soul into a certain technology or have spent a lot of time on “fixes”.

To summarize

The search for a starting point for chaos engineering always brings more results than expected, and teams that start breaking everything around too quickly lose sight of the more global and interesting essence (chaos-)engineering - creative application scientific methods и empirical evidence to design, develop, operate, maintain and improve (software) systems.

This concludes the second part. Please write reviews, share opinions or just clap your hands on Medium. In the next part I really I will consider tools and methods for introducing failures into systems. Until!

PS from translator

Read also on our blog:

Source: habr.com

Add a comment