CPU consumption benchmark for Istio and Linkerd

CPU consumption benchmark for Istio and Linkerd

Introduction

We are Shopify started deploying Istio as a service mesh. In principle, everything suits, except for one thing: it is expensive.

Π’ published benchmarks for Istio it says:

With Istio 1.1, the proxy consumes approximately 0,6 vCPUs (virtual cores) per 1000 requests per second.

For the first region in the service mesh (2 proxies on each side of the connection), we will have 1200 proxy-only cores, based on one million requests per second. Google's cost calculator is about $40/month/core for configuration n1-standard-64, that is, this region alone will cost us more than 50 thousand dollars per month for 1 million requests per second.

Ivan Sim (Ivan Sim) clearly compared service mesh delays last year and promised the same for memory and processor, but failed:

Apparently, values-istio-test.yaml will seriously increase requests to the processor. If I counted correctly, I need about 24 processor cores for the control panel and 0,5 CPU for each proxy. I don't have that many. I will repeat the tests when more resources are allocated to me.

I wanted to see for myself how similar Istio's performance is to another open source service mesh: linkerd.

Service mesh installation

First of all, I installed in the cluster supergloo:

$ supergloo init
installing supergloo version 0.3.12
using chart uri https://storage.googleapis.com/supergloo-helm/charts/supergloo-0.3.12.tgz
configmap/sidecar-injection-resources created
serviceaccount/supergloo created
serviceaccount/discovery created
serviceaccount/mesh-discovery created
clusterrole.rbac.authorization.k8s.io/discovery created
clusterrole.rbac.authorization.k8s.io/mesh-discovery created
clusterrolebinding.rbac.authorization.k8s.io/supergloo-role-binding created
clusterrolebinding.rbac.authorization.k8s.io/discovery-role-binding created
clusterrolebinding.rbac.authorization.k8s.io/mesh-discovery-role-binding created
deployment.extensions/supergloo created
deployment.extensions/discovery created
deployment.extensions/mesh-discovery created
install successful!

I used SuperGloo because it makes bootstrapping the service mesh much easier. I didn't have to do much. We don't use SuperGloo in production, but it's perfect for this task. I had to apply literally a couple of commands to each service mesh. I used two isolation clusters - one each for Istio and Linkerd.

The experiment was carried out on the Google Kubernetes Engine. I used Kubernetes 1.12.7-gke.7 and node pool n1-standard-4 with automatic node scaling (minimum 4, maximum 16).

Then I installed both service meshes from the command line.

Linkerd first:

$ supergloo install linkerd --name linkerd
+---------+--------------+---------+---------------------------+
| INSTALL |     TYPE     | STATUS  |          DETAILS          |
+---------+--------------+---------+---------------------------+
| linkerd | Linkerd Mesh | Pending | enabled: true             |
|         |              |         | version: stable-2.3.0     |
|         |              |         | namespace: linkerd        |
|         |              |         | mtls enabled: true        |
|         |              |         | auto inject enabled: true |
+---------+--------------+---------+---------------------------+

Then Istio:

$ supergloo install istio --name istio --installation-namespace istio-system --mtls=true --auto-inject=true
+---------+------------+---------+---------------------------+
| INSTALL |    TYPE    | STATUS  |          DETAILS          |
+---------+------------+---------+---------------------------+
| istio   | Istio Mesh | Pending | enabled: true             |
|         |            |         | version: 1.0.6            |
|         |            |         | namespace: istio-system   |
|         |            |         | mtls enabled: true        |
|         |            |         | auto inject enabled: true |
|         |            |         | grafana enabled: true     |
|         |            |         | prometheus enabled: true  |
|         |            |         | jaeger enabled: true      |
+---------+------------+---------+---------------------------+

Crash-loop took a few minutes, and then the control panels stabilized.

(Note: SuperGloo only supports Istio 1.0.x so far. I repeated the experiment with Istio 1.1.3, but didn't notice any noticeable difference.)

Configuring Istio Automatic Deployment

In order for Istio to install sidecar Envoy, we use the sidecar injector βˆ’ MutatingAdmissionWebhook. We will not talk about it in this article. I can only say that this is a controller that monitors the access of all new pods and dynamically adds a sidecar and an initContainer that is responsible for tasks iptables.

We at Shopify wrote our own access controller to implement sidecars, but in this benchmark, I took the controller that comes with Istio. The controller injects sidecars by default when there is a label in the namespace istio-injection: enabled:

$ kubectl label namespace irs-client-dev istio-injection=enabled
namespace/irs-client-dev labeled

$ kubectl label namespace irs-server-dev istio-injection=enabled
namespace/irs-server-dev labeled

Configuring Linkerd Automatic Deployment

To set up the injection of Linkerd sidecars, we use annotations (I added them manually via kubectl edit):

metadata:
  annotations:
    linkerd.io/inject: enabled

$ k edit ns irs-server-dev 
namespace/irs-server-dev edited

$ k get ns irs-server-dev -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    linkerd.io/inject: enabled
  name: irs-server-dev
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

Istio Failover Simulator

We made the Istio failover simulator to experiment with traffic that is unique to Shopify. We needed a tool to create a custom topology that would represent a specific part of our service graph, dynamically configured to model specific workloads.

Shopify's infrastructure is under heavy load during flash sales. At the same time, Shopify encourages sellers to hold such sales more often. Large customers are sometimes alerted to a planned flash sale. Others spend them unexpectedly for us at any time of the day or night.

We wanted our failover simulator to model workflows that match the topologies and workloads that have overwhelmed the Shopify infrastructure in the past. The main purpose of using a service mesh is that we need reliability and fault tolerance at the network level, and it is important for us that the service mesh can effectively cope with loads that previously disrupted services.

At the heart of the failover simulator is a worker node that acts as a service mesh node. The worker node can be configured statically at startup or dynamically via the REST API. We use dynamic worker node tuning to create workflows in the form of regression tests.

Here is an example of such a process:

  • We start 10 servers as bar service that returns a response 200/OK after 100 ms.
  • We launch 10 clients - each sends 100 requests per second to bar.
  • Every 10 seconds we remove 1 server, monitor errors 5xx on the client.

At the end of the workflow, we examine the logs and metrics and check if the test passed. This is how we learn about the performance of our service mesh and run a regression test to test our fault tolerance assumptions.

(Note: We are considering open source for the Istio Failover Simulator, but are not ready to do so yet.)

Istio fault tolerance simulator for service mesh benchmark

We set up several worker nodes of the simulator:

  • irs-client-loadgen: 3 replicas that send 100 requests per second to irs-client.
  • irs-client: 3 replicas that receive a request wait 100ms and redirect the request to irs-server.
  • irs-server: 3 replicas that return 200/OK after 100 ms.

With this configuration, we can measure a stable traffic flow between 9 endpoints. Sidecars in irs-client-loadgen ΠΈ irs-server receive 100 requests per second, and irs-client β€” 200 (incoming and outgoing).

We track resource usage through DataDogbecause we don't have a Prometheus cluster.

The results

Control Panel

We first looked at CPU consumption.

CPU consumption benchmark for Istio and Linkerd
Linkerd control panel ~22 millidrama

CPU consumption benchmark for Istio and Linkerd
Istio Control Panel: ~750 mNy

The Istio control panel uses approximately 35 times more CPU resourcesthan Linkerd. Of course, everything is set by default, and istio-telemetry consumes a lot of CPU resources here (it can be disabled by declining some functions). If you remove this component, you still get more than 100 milli-cores, that is 4 timesthan Linkerd.

sidecar proxy

We then checked the use of a proxy. There should be a linear relationship with the number of requests, but for each sidecar there is some overhead that affects the curve.

CPU consumption benchmark for Istio and Linkerd
Linkerd: ~100 millicores for irs-client, ~50 millicores for irs-client-loadgen

The results look logical, because the client proxy receives twice as much traffic as the loadgen proxy: for every outgoing request from loadgen, the client has one incoming and one outgoing request.

CPU consumption benchmark for Istio and Linkerd
Istio/Envoy: ~155 millicores for irs-client, ~75 millicores for irs-client-loadgen

We see similar results for the Istio sidecars.

But in general, Istio/Envoy proxies consume about 50% more CPU resourcesthan Linkerd.

We see the same scheme on the server side:

CPU consumption benchmark for Istio and Linkerd
Linkerd: ~50 millicores for irs-server

CPU consumption benchmark for Istio and Linkerd
Istio/Envoy: ~80 millicores for irs-server

Server side sidecar Istio/Envoy consumes about 60% more CPU resourcesthan Linkerd.

Conclusion

The Istio Envoy proxy consumes 50+% more CPU than Linkerd on our simulated workload. The Linkerd control panel consumes much less resources than Istio, especially for the main components.

We are still thinking about how to reduce these costs. If you have ideas, please share!

Source: habr.com

Add a comment