Istio Circuit Breaker: Disable Faulty Containers

The holidays are over and we're back with our second post in the Istio Service Mesh series.

Istio Circuit Breaker: Disable Faulty Containers

Today's topic is Circuit Breaker, which in translation into Russian electrotechnical means "circuit breaker", in common parlance - "circuit breaker". Only in Istio, this machine does not turn off a shorted or overloaded circuit, but faulty containers.

How it should ideally work

When microservices are managed by Kubernetes, such as within the OpenShift platform, they automatically scale up and down depending on the load. Since microservices run in pods, there can be several instances of a containerized microservice on one endpoint at once, and Kubernetes will route requests and load balance between them. And - ideally - all this should work fine.

We remember that microservices are small and ephemeral. Ephemerality, which here means the ease of arising and disappearing, is often underestimated. The birth and death of the next instance of a microservice in a pod is quite expected, OpenShift and Kubernetes do a good job of it, and everything works great - but again in theory.

How it really works

Now imagine that a particular instance of the microservice, that is, the container, has become unusable: either it does not respond (error 503), or, which is more unpleasant, it reacts, but too slowly. In other words, it bugs or does not respond to requests, but it is not automatically removed from the pool. What should be done in this case? To retry? Remove it from the routing scheme? And what does β€œtoo slow” mean - how much is it in numbers, and who determines them? Maybe just give it a break and try again later? If yes, how much later?

What is Pool Ejection in Istio

And here Istio comes to the rescue with its Circuit Breakers, which temporarily remove faulty containers from the routing and load balancing resource pool by implementing the Pool Ejection procedure.

Using an outlier detection strategy, Istio detects crooked pods that are out of line and removes them from the resource pool for a specified time, which is called the "sleep window" (sleep window).

To show how this works in Kubernetes on the OpenShift platform, let's start with a screenshot of a normally working microservices from the example in the repository Red Hat Developer Demos. Here we have two pods, v1 and v2, each running one container. When Istio routing rules are not used, Kubernetes applies evenly balanced round robin routing by default:

Istio Circuit Breaker: Disable Faulty Containers

Getting ready to crash

Before doing Pool Ejection, you need to create an Istio routing rule. Let's say we want to distribute requests between pods in a 50/50 ratio. In addition, we will increase the number of v2 containers from one to two, like this:

oc scale deployment recommendation-v2 --replicas=2 -n tutorial

Now we set a routing rule so that traffic is distributed between pods in a 50/50 ratio.

Istio Circuit Breaker: Disable Faulty Containers
And here is what the result of this rule looks like:

Istio Circuit Breaker: Disable Faulty Containers
You can find fault with the fact that this screen is not 50/50, but 14:9, but over time the situation will improve.

We arrange a failure

Now let's disable one of the two v2 containers so that we have one healthy v1 container, one healthy v2 container, and one failed v2 container:

Istio Circuit Breaker: Disable Faulty Containers

Fixing a glitch

So, we have a faulty container, and it's time for Pool Ejection. With a very simple config, we will exclude this failed container from any routing schemes for 15 seconds, with the expectation that it will return to a healthy state on its own (either restart or restore performance). Here is what this config looks like and the results of its work:

Istio Circuit Breaker: Disable Faulty Containers
Istio Circuit Breaker: Disable Faulty Containers
As you can see, the failed v2 container is no longer used in request routing, as it has been removed from the pool. But after 15 seconds, it will automatically return to the pool. Actually, we have just shown how Pool Ejection works.

Let's start building architecture

Pool Ejection, combined with the monitoring capabilities of Istio, allows you to start building a framework for automatically replacing failed containers to reduce or even eliminate downtime and failures.
 
NASA has one loud motto - Failure Is Not an Option, which is considered the author of the flight director Gene Krantz. It can be translated into Russian as β€œDefeat is not an option”, and the meaning here is that everything can be made to work, having enough will to do it. However, in real life, failures do not just happen, they are inevitable, everywhere and in everything. And how to deal with them in the case of microservices? In our opinion, it is better to rely not on willpower, but on the capabilities of containers, Kubernetes, Red Hat OpenShiftand Istio.

Istio, as we wrote above, implements the concept of circuit breakers that has proven itself in the physical world. And just as an electric machine turns off a problematic section of the circuit, so the software Circuit Breaker in Istio opens the connection between the request flow and the problem container when something is wrong with the endpoint, for example, when the server crashes or starts to slow down.

Moreover, in the second case, there are only more problems, since the brakes of one container not only cause a cascade of delays in the services accessing it and, as a result, reduce the performance of the system as a whole, but also generate repeated requests to an already slow-working service, which only aggravates the situation. .

Circuit Breaker in theory

Circuit Breaker is a proxy that controls the flow of requests to an endpoint. When this point stops working or - depending on the configured settings - starts to slow down, the proxy disconnects from the container. Traffic is then redirected to other containers, well, simply due to load balancing. The link remains open for a given sleep window, say two minutes, and then is considered half-open. An attempt to send the next request determines the further state of the connection. If everything is OK with the service, the connection returns to the working state and becomes closed again. If something is still wrong with the service, the connection is disconnected and the sleep window is turned on again. Here's what a simplified Circuit Breaker state transition diagram looks like:

Istio Circuit Breaker: Disable Faulty Containers
It is important to note here that all this happens at the level, so to speak, of the system architecture. Therefore, at some point you will have to teach your applications to work with Circuit Breaker, for example, provide a default value in response, or, if possible, ignore the existence of a service. The bulkhead pattern is used for this, but it is beyond the scope of this article.

Circuit Breaker in practice

For example, we will run two versions of our recommendations microservice on OpenShift. Version 1 will work fine, but in v2 we will build in a delay to simulate server lag. The tool is used to view the results. seat:

siege -r 2 -c 20 -v customer-tutorial.$(minishift ip).nip.io

Istio Circuit Breaker: Disable Faulty Containers
Everything seems to work, but at what cost? At first glance, we have 100% availability, but take a closer look - the maximum transaction duration is as much as 12 seconds. This is clearly a bottleneck, and it needs to be embroidered.

To do this, we will use Istio to exclude calls to slow containers. Here is what the corresponding config looks like using Circuit Breaker:

Istio Circuit Breaker: Disable Faulty Containers
The last line with the httpMaxRequestsPerConnection parameter signals that the connection with should be disconnected when trying to create another - the second - connection in addition to the existing one. Since our container is simulating a slow service, such situations will periodically arise, and then Istio will return a 503 error, and here is what siege will show:

Istio Circuit Breaker: Disable Faulty Containers

OK, we have Circuit Breaker, what's next?

So, we implemented automatic shutdown without touching the source code of the services themselves at all. Using Circuit Breaker and the Pool Ejection procedure described above, we can remove brake containers from the resource pool until they return to normal, and check their status at a specified frequency - in our example, this is two minutes (sleepWindow parameter).

Note that the ability of an application to respond to a 503 error is still set at the level of its source code. There are many strategies for working with Circuit Breaker, which are applied depending on the situation.

In the next post: we will talk about tracing and monitoring, which are already built-in or easily added to Istio, as well as how to introduce errors into the system intentionally.

Source: habr.com

Add a comment