CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

TL; DR: All CNIs work as they should, with the exception of Kube-Router and Kube-OVN, Calico, with the exception of automatic MTU detection, is the best.

Article-updating my past checks (2018 ΠΈ 2019), at the time of testing I'm running Kubernetes 1.19 on Ubuntu 18.04 with updated CNIs for August 2020.

Before diving into metrics…

What's new from April 2019?

  • You can test on your own cluster: you can run tests on your own cluster using our tool Kubernetes Network Benchmark: knb
  • There are new members
  • New Scenarios: Current tests run "Pod-to-Pod" network performance tests, and a new "Pod-to-Service" scenario has been added that runs tests that are closer to real conditions. In practice, your Pod with API works with the base as a service, and not through the Pod ip address (of course, we check both TCP and UDP for both scenarios).
  • Resource Consumption: Each test now has its own resource comparison
  • Removal of Application Tests: We are no longer doing HTTP, FTP, and SCP tests as our fruitful collaboration with the community and CNI maintainers found a gap between iperf over TCP results and curl results due to a delay in CNI startup (the first few seconds of Pod startup, which not typical in real life).
  • Open source: all test sources (scripts, yml settings and raw data) are available here

Reference test protocol

The protocol is detailed here, please note that this article is about Ubuntu 18.04 with the default kernel.

Selecting a CNI for Evaluation

This test is aimed at comparing CNIs configured with a single yaml file (so all scripted ones like VPP and others are excluded).

Our selected CNIs for comparison:

  • Antrea v.0.9.1
  • Calico v3.16
  • Canal v3.16 (Flannel network + Calico Network Policies)
  • Cilium 1.8.2
  • Flannel 0.12.0
  • Kube-router latest (2020–08–25)
  • WeaveNet 2.7.0

MTU setting for CNI

First of all, we check the impact of automatic MTU detection on TCP performance:

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

Impact of MTU on TCP performance

An even bigger gap is found when using UDP:

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)
Impact of MTU on UDP performance

With the HUGE performance impact uncovered in the tests, we'd like to send a letter of hope to all CNI maintainers: please add automatic MTU detection to CNI. You'll save kittens, unicorns, and even the cutest one: a little devopsy.

Nevertheless, if in any way you need to take CNI without support for automatic MTU detection, you can configure it manually to get performance. Please note that this applies to Calico, Canal and WeaveNet.

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)
My little request to the CNI maintainers...

CNI Testing: Raw Data

In this section, we will compare the CNI with the correct MTU (determined automatically or manually set). The main goal here is to show raw data in the form of graphs.

Color legend:

  • gray - sample (i.e. bare iron)
  • green - throughput is above 9500 Mbps
  • yellow - throughput is above 9000 Mbps
  • orange - bandwidth above 8000 Mbps
  • red - throughput below 8000 Mbps
  • blue - neutral (not related to bandwidth)

Resource consumption without load

First of all, check the consumption of resources when the cluster is "sleeping".

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)
Resource consumption without load

Pod to Pod

This scenario assumes that the client Pod connects directly to the server Pod by its IP address.

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)
Pod-to-Pod Scenario

TCP

Pod-to-Pod TCP results and corresponding resource consumption:

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

UDP

Pod-to-Pod UDP results and corresponding resource consumption:

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

Pod-to-Service

This section is relevant for real use cases, the client Pod connects to the server Pod through the ClusterIP service.

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)
Pod-to-Service Scenario

TCP

Pod-to-Service TCP results and corresponding resource consumption:

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

UDP

Pod-to-Service UDP results and corresponding resource consumption:

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

Network policy support

Among all of the above, the only one that does not support politics is Flannel. All others correctly implement network policies, including inbound and outbound. Great job!

CNI Encryption

Among the tested CNIs are those that can encrypt network communication between Pods:

  • Antrea with IPsec
  • Calico with wireguard
  • Cilium with IPsec
  • WeaveNet using IPsec

Throughput

Since there are fewer CNIs left, let's summarize all the scenarios in one graph:

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

Resource consumption

In this section, we will evaluate the resources used in handling TCP and UDP Pod-to-Pod communications. There is no point in drawing a Pod-to-Service graph as it does not provide additional information.

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

Putting it all together

Let's try to repeat all the graphs, we brought a little subjectivity here, replacing the actual values ​​with the words "vwry fast", "low", etc.

CNI Performance Evaluation for Kubernetes over 10G Network (August 2020)

Conclusion and my findings

This is a bit subjective, as I am giving my own interpretation of the results.

I am glad that new CNIs have appeared, Antrea performed well, many features are implemented even in early versions: automatic MTU detection, encryption and easy installation.

In terms of performance, all CNIs work well, except for Kube-OVN and Kube-Router. Kube-Router was also unable to determine the MTU, I did not find a way to configure it anywhere in the documentation (here open request on this topic).

In terms of resource consumption, Cilium still uses more RAM than others, but the manufacturer is clearly aiming for large clusters, which is clearly not the same as the test on a three-node cluster. Kube-OVN also consumes a lot of CPU time and RAM, but it is a young CNI based on Open vSwitch (like Antrea, it works better and with less consumption).

Everyone has network policies except Flannel. It is very likely that he will never support them, since the goal is simple as a steamed turnip: the lighter the better.

Also, among other things, the encryption performance is a delight. Calico is one of the oldest CNIs, but encryption was only added a couple of weeks ago. They chose wireguard over IPsec, and to put it simply, everything works great and awesome, completely edging out other CNIs in this part of the test. Of course, the consumption of resources increases due to encryption, but the throughput achieved is worth it (Calico in the test with encryption showed a six-fold superiority compared to Cilium, which takes second place). Moreover, you can enable wireguard at any time after Calico is deployed in the cluster, and you can also disable it for a short time or permanently if you wish. It's incredibly convenient, though! We remind you that Calico is currently unable to automatically detect the MTU (this feature is planned for future releases), so don't forget to configure the MTU if your network supports Jumbo Frames (MTU 9000).

Among other things, please note that Cilium can encrypt traffic between cluster nodes (and not just between Pods), which can be very important for public cluster nodes.

As a conclusion, I suggest the following use cases:

  • Need a CNI for a very small cluster OR I don't need security: work with Flannel, the most lightweight and stable CNI (he is one of the oldest, according to legend, he was invented by Homo Kubernautus or Homo Contaitorus). You may also be interested in the most ingenious project k3s, check!
  • Need CNI for normal cluster: Calico - your choice, but do not forget to adjust the MTU if necessary. You can easily and naturally play with network policies, turn encryption on and off, etc.
  • Need a CNI for a (very) large scale clusterA: Well, the test doesn't show the behavior of large clusters, I'd be happy to run tests, but we don't have hundreds of servers with a 10Gbps connection. So the best option is to run a modified test on your nodes, at least with Calico and Cilium.

Source: habr.com

Add a comment