Note. transl.: The first part this cycle was devoted to getting acquainted with the capabilities of Istio and demonstrating them in action. Now we will talk about more complex aspects of configuring and using this service mesh, and in particular about finely tuned routing and network traffic management.
We also remind you that the article uses configurations (manifests for Kubernetes and Istio) from the repository istio-mastery.
traffic management
With Istio, new features are added to the cluster to provide:
Load balancing: simple and consistent, based on hashes;
Recovery after falls: timeouts, retries, circuit breakers;
Introduction of faults: delays, interruption of requests, etc.
As the article continues, these features will be demonstrated using the selected application as an example, and new concepts will be introduced along the way. The first such concept will be DestinationRules(i.e. rules about the recipient of traffic / requests - approx. transl.), with which we activate A / B testing.
A/B Testing: DestinationRules in Practice
A/B testing is used when there are two versions of an application (usually they are visually different) and we are not 100% sure which one will improve the user experience. Therefore, we simultaneously launch both versions and collect metrics.
To deploy the second version of the frontend, which is required for the A/B testing demo, run the following command:
$ kubectl apply -f resource-manifests/kube/ab-testing/sa-frontend-green-deployment.yaml
deployment.extensions/sa-frontend-green created
The deployment manifest for the "green version" differs in two places:
The image is based on a different tag β istio-green,
Pods have a label version: green.
Because both deployments have the label app: sa-frontend, requests routed by the virtual service sa-external-services for service sa-frontend, will be redirected to all its instances and the load will be distributed by round-robin algorithm, which will lead to the following situation:
Requested files not found
These files were not found due to the fact that they are named differently in different versions of the application. Let's check it out:
It means that index.htmlrequesting one version of static files can be sent by the load balancer to pods that have a different version, where, for obvious reasons, such files do not exist. Therefore, in order for the application to work, we need to put a restriction: "the same version of the application that returned index.html must serve subsequent requestsΒ».
We'll achieve our goal with hash-based consistent load balancing. (Consistent Hash Load Balancing). In this case requests from the same client are sent to the same backend instance, for which a predefined property is used - for example, an HTTP header. Implemented using DestinationRules.
DestinationRules
After Virtual Service sent a request to the desired service, using DestinationRules we can define the policies that will be applied to traffic destined for instances of this service:
Traffic management with Istio resources
Note: The impact of Istio resources on network traffic is presented here in a simplified form for understanding. To be precise, the decision on which instance to send the request to is made by Envoy in the Ingress Gateway configured in CRD.
With Destination Rules, we can set up load balancing to use consistent hashes and ensure that the same service instance responds to the same user. The following configuration achieves this (destinationrule-sa-frontend.yaml):
Note: To add different values ββin the header and test the results directly in the browser, you can use this extension to Chrome (or this for Firefox - approx. transl.).
In general, DestinationRules has more features in the field of load balancing - check the details in official documentation.
Before exploring VirtualService further, let's remove the "green version" of the application and the corresponding rule for the direction of traffic by executing the following commands:
shadowing ("shielding") or Mirroring ("mirroring") is used in cases where we want to test a change in production without affecting end users: to do this, we duplicate ("mirror") requests to a second instance where the necessary changes are made, and look at the consequences. To put it simply, it's when your colleague picks the most critical issue and makes a pull request in the form of such a huge ball of dirt that no one can actually review it.
To test this scenario in action, let's create a second instance of SA-Logic with bugs (buggy) by running the following command:
$ kubectl apply -f resource-manifests/kube/shadowing/sa-logic-service-buggy.yaml
deployment.extensions/sa-logic-buggy created
And now let's run a command to make sure all instances with app=sa-logic they also have labels with the corresponding versions:
Service sa-logic targets pods with a label app=sa-logic, so all requests will be distributed among all instances:
β¦ but we want requests to be routed to v1 instances and mirrored to v2 instances:
We achieve this through a VirtualService in combination with a DestinationRule, where the rules define the subsets and routes of the VirtualService to a particular subset.
No explanation needed here, so let's just see it in action:
$ kubectl apply -f resource-manifests/istio/shadowing/sa-logic-subsets-shadowing-vs.yaml
virtualservice.networking.istio.io/sa-logic created
Let's add a load by calling the following command:
$ while true; do curl -v http://$EXTERNAL_IP/sentiment
-H "Content-type: application/json"
-d '{"sentence": "I love yogobella"}';
sleep .8; done
Let's look at the results in Grafana, where you can see that the version with bugs (buggy) fails for ~60% of requests, but none of these failures affect end users as they are answered by a running service.
Success of responses of different versions of the sa-logic service
This is where we first saw how VirtualService is applied to our service Envoys: when sa-web-app makes a request to sa-logic, it passes through sidecar Envoy, which - through VirtualService - is configured to route the request to the v1 subset and mirror the request to the v2 subset of the service sa-logic.
I know: you already had time to think that Virtual Services are simple. In the next section, we will expand this opinion by saying that they are also truly great.
Canary Rollouts
Canary Deployment is the process of rolling out a new version of an application to a small number of users. It is used to make sure that there are no problems in the release, and only after that, already being sure of its sufficient (release) quality, distribute it toΠΎbigger audience.
To demonstrate canary rollouts, we will continue with a subset buggy Ρ sa-logic.
Let's not waste time on trifles and immediately send 20% of users to the version with bugs (it will represent our canary rollout), and the remaining 80% to a normal service. To do this, apply the following VirtualService (sa-logic-subsets-canary-vs.yaml):
... and we will immediately see that some of the requests lead to failures:
$ while true; do
curl -i http://$EXTERNAL_IP/sentiment
-H "Content-type: application/json"
-d '{"sentence": "I love yogobella"}'
--silent -w "Time: %{time_total}s t Status: %{http_code}n"
-o /dev/null; sleep .1; done
Time: 0.153075s Status: 200
Time: 0.137581s Status: 200
Time: 0.139345s Status: 200
Time: 30.291806s Status: 500
VirtualServices trigger canary rollouts: in this case, we've narrowed down the potential impact of problems to 20% of the user base. Wonderful! Now, in every case when we are not sure about our code (in other words, always ...), we can use mirroring and canary rollouts.
Timeouts and retries
But bugs don't always end up in the code. In the list from8 misconceptions in distributed computingβ in the first place is the erroneous opinion that βthe network is reliableβ. In reality the network not reliable, and for this reason we need timeouts (timeouts) and retries (retries).
For demonstration, we will continue to use the same problem version sa-logic (buggy), and the unreliability of the network will be simulated by random failures.
Let our buggy service have a 1/3 chance of responding too long, 1/3 of ending with an Internal Server Error, and 1/3 of successfully rendering the page.
To mitigate the impact of these issues and improve the lives of our users, we can:
add a timeout if the service responds longer than 8 seconds,
And each attempt is considered unsuccessful if the response time exceeds 3 seconds.
This is an optimization because the user does not have to wait more than 8 seconds and we will make three new attempts to get a response in case of failures, increasing the chance of a successful response.
Apply the updated configuration with the following command:
And check in the Grafana charts that the number of successful responses is over:
Improvements in success statistics after adding timeouts and retries
Before proceeding to the next section (more precisely - already to the next part of the article, because in this practical experiments there will be no more - approx. transl.), delete sa-logic-buggy and VirtualService by running the following commands:
We are talking about two important patterns in the microservice architecture that allow you to achieve self-restoration. (self-healing) services.
Circuit Breaker("circuit breaker") used to stop requests coming in to a service instance that is considered unhealthy and restore it while client requests are redirected to healthy instances of that service (which increases the success rate). (Transl. note: A more detailed description of the pattern can be found, for example, here.)
bulkhead("partition") isolates failures in services from the defeat of the entire system. For example, service B is broken, and another service (a client of service B) makes a request to service B, causing it to use up its thread pool and not be able to serve other requests (even if they do not belong to service B). (Transl. note: A more detailed description of the pattern can be found, for example, here.)
I will omit the implementation details of these patterns because they are easy to find in official documentation, and I also really want to show authentication and authorization, which will be discussed in the next part of the article.