“Kubernetes increased latency by 10 times”: who is to blame?

Note. transl.: This article, written by Galo Navarro, who is the Principal Software Engineer at the European company Adevinta, is a fascinating and instructive "investigation" in the field of infrastructure operation. Its original name has been slightly supplemented in translation for the reason that the author explains at the very beginning.

“Kubernetes increased latency by 10 times”: who is to blame?

Note from the author: Looks like this post attracted much more attention than expected. I still get angry comments that the title of the article is misleading and that some readers are saddened. I understand the reasons for what is happening, therefore, despite the risk of disrupting all the intrigue, I want to tell you right away what this article is about. When teams migrate to Kubernetes, I observe a curious thing: every time a problem arises (for example, an increase in delays after a migration), Kubernetes is first blamed, but then it turns out that the orchestrator, in general, is not to blame. This article is about one such case. Its name repeats the exclamation of one of our developers (later you will see that Kubernetes has nothing to do with it). You will not find unexpected revelations about Kubernetes in it, but you can expect a couple of good lessons about complex systems.

A couple of weeks ago, my team was migrating a single microservice to a mainstream platform that included CI/CD, a Kubernetes-based workbench, metrics, and other goodies. The move was trial in nature: we planned to take it as a basis and transfer about 150 more services in the coming months. All of them are responsible for the operation of some of the largest online sites in Spain (Infojobs, Fotocasa, etc.).

After we deployed the application in Kubernetes and redirected some of the traffic to it, we were in for an alarming surprise. Delay (latency) requests in Kubernetes was 10 times higher than in EC2. In general, it was necessary to either look for a solution to this problem, or refuse to migrate the microservice (and, possibly, the entire project).

Why is latency so much higher in Kubernetes than in EC2?

To find the bottleneck, we collected metrics along the entire request path. Our architecture is simple: an API gateway (Zuul) proxies requests to microservice instances in EC2 or Kubernetes. In Kubernetes, we use the NGINX Ingress Controller, and the backends are ordinary objects like Deployment with a JVM application based on the Spring platform.

                                  EC2
                            +---------------+
                            |  +---------+  |
                            |  |         |  |
                       +-------> BACKEND |  |
                       |    |  |         |  |
                       |    |  +---------+  |                   
                       |    +---------------+
             +------+  |
Public       |      |  |
      -------> ZUUL +--+
traffic      |      |  |              Kubernetes
             +------+  |    +-----------------------------+
                       |    |  +-------+      +---------+ |
                       |    |  |       |  xx  |         | |
                       +-------> NGINX +------> BACKEND | |
                            |  |       |  xx  |         | |
                            |  +-------+      +---------+ |
                            +-----------------------------+

It seemed that the problem was related to the delay at the initial stage of work in the backend (I marked the problem area on the graph as "xx"). In EC2, the application response took about 20ms. In Kubernetes, the delay increased to 100-200 ms.

We quickly ruled out likely suspects related to changing the runtime environment. The JVM version has remained the same. There were no containerization problems either: the application was already running successfully in containers in EC2. Loading? But we observed high latency even at 1 request per second. Pauses for garbage collection could also be neglected.

One of our Kubernetes admins asked if the application had any external dependencies, since DNS queries have caused similar problems in the past.

Hypothesis 1: DNS Name Resolution

With each request, our application accesses the AWS Elasticsearch instance in the domain one to three times, like elastic.spain.adevinta.com. Inside our containers available shell, so we can check if the domain lookup is really taking a long time.

DNS queries from container:

[root@be-851c76f696-alf8z /]# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done
;; Query time: 22 msec
;; Query time: 22 msec
;; Query time: 29 msec
;; Query time: 21 msec
;; Query time: 28 msec
;; Query time: 43 msec
;; Query time: 39 msec

Similar requests from one of the EC2 instances where the application is running:

bash-4.4# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done
;; Query time: 77 msec
;; Query time: 0 msec
;; Query time: 0 msec
;; Query time: 0 msec
;; Query time: 0 msec

Considering that the lookup takes about 30ms, it became clear that DNS resolution when accessing Elasticsearch really contributes to the increase in latency.

However, this was strange for two reasons:

  1. We already have a lot of applications in Kubernetes that interact with AWS resources, but do not suffer from high latency. Whatever the reason, it is relevant specifically to this case.
  2. We know that the JVM does DNS in-memory caching. In our images, the TTL value is written in $JAVA_HOME/jre/lib/security/java.security and set to 10 seconds: networkaddress.cache.ttl = 10. In other words, the JVM should cache all DNS requests for 10 seconds.

To confirm the first hypothesis, we decided to drop DNS calls for a while and see if the problem goes away. First, we decided to reconfigure the application so that it communicates with Elasticsearch directly by IP address, and not through a domain name. This would require a code change and a new deployment, so we simply mapped the domain to its IP address in /etc/hosts:

34.55.5.111 elastic.spain.adevinta.com

Now the container received an IP almost instantly. This resulted in some improvement, but we were only slightly closer to the expected latency level. Although DNS resolution took a long time, the real reason still eluded us.

Network Diagnostics

We decided to analyze the traffic from the container using tcpdumpto see exactly what's happening on the network:

[root@be-851c76f696-alf8z /]# tcpdump -leni any -w capture.pcap

Then we sent some requests and downloaded their capture (kubectl cp my-service:/capture.pcap capture.pcap) for further analysis in Wireshark.

There was nothing suspicious in the DNS requests (except for one little thing, which I will talk about later). But there were certain oddities in how our service handled each request. Below is a screenshot of the capture showing the acceptance of the request before the start of the response:

“Kubernetes increased latency by 10 times”: who is to blame?

The package numbers are listed in the first column. For clarity, I've highlighted the various TCP streams.

The green stream starting at packet 328 shows how the client (172.17.22.150) established a TCP connection with the container (172.17.36.147). After the initial handshake (328-330), package 331 brought HTTP GET /v1/.. — an incoming request to our service. The whole process took 1 ms.

The gray stream (from packet 339) shows that our service sent an HTTP request to the Elasticsearch instance (there is no TCP handshake because an existing connection is being used). It took 18ms.

So far, everything is fine, and the times roughly correspond to the expected delays (20-30 ms when measured from the client).

However, the blue section takes 86ms. What is going on in it? With packet 333, our service sent an HTTP GET request to /latest/meta-data/iam/security-credentials, and immediately after it, on the same TCP connection, another GET request to /latest/meta-data/iam/security-credentials/arn:...

We have found that this is repeated with every request in the entire trace. DNS resolution is indeed slightly slower in our containers (the explanation for this phenomenon is quite interesting, but I'll save it for a separate article). It turned out that the cause of large delays is the calls to the AWS Instance Metadata service with each request.

Hypothesis 2: Extra Calls to AWS

Both endpoints belong AWS Instance Metadata API. Our microservice uses this service while working with Elasticsearch. Both calls are part of the basic authorization process. The endpoint that is accessed on the first request issues the IAM role associated with the instance.

/ # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
arn:aws:iam::<account_id>:role/some_role

The second request asks the second endpoint for temporary permissions for this instance:

/ # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::<account_id>:role/some_role`
{
    "Code" : "Success",
    "LastUpdated" : "2012-04-26T16:39:16Z",
    "Type" : "AWS-HMAC",
    "AccessKeyId" : "ASIAIOSFODNN7EXAMPLE",
    "SecretAccessKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    "Token" : "token",
    "Expiration" : "2017-05-17T15:09:54Z"
}

The client can use them for a short period of time and must periodically obtain new certificates (before they Expiration). The model is simple: AWS rotates temporary keys frequently for security reasons, but customers can cache them for a few minutes to offset the performance penalty associated with obtaining new certificates.

The AWS Java SDK is supposed to take over the responsibility for organizing this process, but for some reason this is not happening.

After searching through issues on GitHub, we came across an issue #1921. She helped us determine the direction in which to "dig" further.

The AWS SDK renews certificates when one of the following conditions occurs:

  • Expiry date (Expiration) Fall into EXPIRATION_THRESHOLD, hardcoded to 15 minutes.
  • More time has elapsed since the last attempt to renew certificates than REFRESH_THRESHOLD, 'hardcoded' for 60 minutes.

To see the actual expiration date of the certificates we receive, we ran the cURL commands above from the container and from the EC2 instance. The validity period of the certificate received from the container turned out to be much shorter: exactly 15 minutes.

Now everything became clear: for the first request, our service received temporary certificates. Because they didn't expire for more than 15 minutes, the next time they were requested, the AWS SDK decided to update them. And this happened with every request.

Why is the certificate validity period shorter?

The AWS Instance Metadata service is designed to work with EC2 instances, not Kubernetes. On the other hand, we did not want to change the application interface. For this we used KIAM - a tool that, using agents on each Kubernetes node, allows users (engineers deploying applications to a cluster) to assign IAM roles to containers in pods as if they were EC2 instances. KIAM intercepts calls to the AWS Instance Metadata service and processes them from its cache after receiving them from AWS. From the point of view of the application, nothing changes.

KIAM provides short-term certificates to pods. This makes sense considering that the average lifespan of a pod is less than an EC2 instance. Default certificate expiration date equals 15 minutes.

As a result, if you impose both default values ​​on top of each other, a problem arises. Each certificate granted to an application expires after 15 minutes. This forces the AWS Java SDK to renew any certificate that is less than 15 minutes from expiration date.

As a result, the temporary certificate is forced to be renewed with each request, which incurs a couple of calls to the AWS API and results in a significant increase in latency. In the AWS Java SDK, we found feature requestwhich mentions a similar issue.

The solution turned out to be simple. We just reconfigured KIAM to require certificates with a longer validity period. As soon as this happened, requests began to pass without the participation of the AWS Metadata service, and the latency dropped even to lower levels than in EC2.

Conclusions

Based on our experience with migrations, one of the most common sources of problems is not errors in Kubernetes or other elements of the platform. It also doesn't involve any fundamental flaws in the microservices we're migrating. Problems often arise simply because we put different elements together.

We mix complex systems that have never interacted before with the expectation that together they form a single, larger system. Alas, the more elements, the more room for errors, the higher the entropy.

In our case, the high latency was not the result of bugs or bad decisions in Kubernetes, KIAM, the AWS Java SDK, or our microservice. It is the result of combining two independent default settings, one in KIAM and one in the AWS Java SDK. Separately, both options make sense: both the active certificate renewal policy in the AWS Java SDK and the short validity period of certificates in KAIM. But if you put them together, the results become unpredictable. Two independent and logical solutions do not have to make sense when combined.

PS from translator

To learn more about the architecture of the KIAM utility for integrating AWS IAM with Kubernetes, see this article from its creators.

Read also on our blog:

Source: habr.com

Add a comment