Sometimes more is less. When a decrease in load leads to an increase in latency

As in most posts, there was a problem with a distributed service, let's call this service Alvin. This time I didn't find the problem myself, I was informed by the guys from the client side.

I woke up one day to a dissatisfied email due to long delays at Alvin, which we were planning to launch soon. In particular, the client experienced 99th percentile latency in the region of 50ms, well above our latency budget. This was surprising, as I tested the service extensively, especially for delays, as this is the subject of frequent complaints.

Before giving Alvin a test run, I did a lot of experiments with 40k requests per second (QPS), all showing sub-10ms latency. I was ready to say that I did not agree with their results. But looking at the email again, I noticed something new: I hadn't exactly tested the conditions they mentioned, their QPS was much lower than mine. I tested at 40k QPS and they are only at 1k. I ran another experiment, this time with a lower QPS, just to please them.

Since I'm blogging about it, you've probably already figured out that their numbers turned out to be correct. I tested my virtual client again and again, with the same result: low requests not only increase latency, but increase requests with more than 10ms latency. In other words, if at 40k QPS about 50 requests per second exceeded 50ms, then at 1k QPS every second there were 100 requests above 50ms. Paradox!

Sometimes more is less. When a decrease in load leads to an increase in latency

Narrowing the search

When faced with a latency problem in a distributed system with many components, the first step is to make a short list of suspects. Let's dig a little deeper into Alvin's architecture:

Sometimes more is less. When a decrease in load leads to an increase in latency

A good starting point is a list of completed I/O transitions (network calls/disk lookups, etc.). Let's try to figure out where the delay is. In addition to the obvious I/O with the client, Alvin takes an extra step: he accesses the data store. However, this storage works in the same cluster as Alvin, so the delay there should be less than with the client. So here's the list of suspects:

  1. Network call from client to Alvin.
  2. Network call from Alvin to the data store.
  3. Search on disk in the data store.
  4. Network call from the data warehouse to Alvin.
  5. Network call from Alvin to the client.

Let's try to cross out some points.

Nothing to do with data storage

The first thing I did was convert Alvin to a ping-ping server that doesn't handle requests. Upon receiving the request, it returns an empty response. If the latency decreases, then a mistake in the implementation of Alvin or the data store is nothing unheard of. In the first experiment, we get the following graph:

Sometimes more is less. When a decrease in load leads to an increase in latency

As you can see, there is no improvement when using the ping-ping server. This means that the data store does not increase latency, and the list of suspects is halved:

  1. Network call from client to Alvin.
  2. Network call from Alvin to the client.

Great! The list is shrinking fast. I thought I almost figured out the reason.

gRPC

Now is the time to introduce you to a new player: gRPC. This is an open source library from Google for in-process communication RPC. Although gRPC well optimized and widely used, it was my first time using it on a system of this scale and I expected my implementation to be sub-optimal to say the least.

Availability gRPC on the stack gave rise to a new question: maybe this is my implementation or myself gRPC causing the latency issue? Add a new suspect to the list:

  1. The client calls the library gRPC
  2. Library gRPC on the client makes a network call to the library gRPC on server
  3. Library gRPC calls Alvin (no operation in case of ping-pong server)

To give you an idea of ​​what the code looks like, my client/Alvin implementation doesn't differ much from the client/server ones. async examples.

Note: The list above is a bit oversimplified because gRPC makes it possible to use your own (template?) threading model, in which the execution stack is intertwined gRPC and user implementation. For the sake of simplicity, we will stick to this model.

Profiling fixes everything.

Crossing out the datastores, I thought I was almost done: β€œNow it's easy! Let's apply a profile and find out where the delay occurs. I big fan of precision profiling, because CPUs are very fast and most often not the bottleneck. Most delays occur when the processor must stop processing in order to do something else. Precise CPU profiling is done exactly for this: it accurately records everything context switches and makes it clear where the delays occur.

I took four profiles: under high QPS (low latency) and with a ping-pong server at low QPS (high latency), both on the client side and on the server side. And just in case, I also took a sample processor profile. When comparing profiles, I usually look for an anomalous call stack. For example, on the bad side with high latency, there are many more context switches (10 times or more). But in my case, the number of context switches was almost the same. To my dismay, there was nothing significant.

Additional Debugging

I was in despair. I didn't know what other tools to use, and my next plan was essentially to repeat the experiments with different variations rather than clearly diagnosing the problem.

What if

From the very beginning, I was worried about the specific delay time of 50 ms. This is a very big time. I decided that I would cut pieces out of the code until I could figure out exactly which part was causing this error. Then followed an experiment that worked.

As usual, in hindsight it seems that everything was obvious. I put the client on the same machine as Alvin - and sent a request to localhost. And the latency increase is gone!

Sometimes more is less. When a decrease in load leads to an increase in latency

Something was wrong with the network.

Learning the skills of a network engineer

I must admit that my knowledge of network technologies is terrible, especially since I work with them on a daily basis. But the network was the prime suspect, and I needed to learn how to debug it.

Fortunately, the Internet loves those who want to learn. The combination of ping and tracert seemed like a good enough start to debug network transport issues.

First, I ran PsPing to Alvin's TCP port. I used the default options - nothing special. Out of more than a thousand pings, none exceeded 10ms, except for the first one for warming up. This is in contrast to the observed increase in latency of 50ms in the 99th percentile: there, for every 100 requests, we should have seen about one request with a delay of 50ms.

Then I tried tracert: Maybe there is a problem on one of the nodes along the route between Alvin and the client. But the tracer returned empty-handed.

So the reason for the delay was not my code, not the gRPC implementation, and not the network. I was beginning to worry that I would never understand it.

Now what OS are we on

gRPC widely used on Linux, but on Windows it is exotic. I decided to do an experiment that worked: I created a Linux virtual machine, compiled Alvin for Linux, and deployed it.

Sometimes more is less. When a decrease in load leads to an increase in latency

Here's what happened: the Linux ping-pong server didn't have the latency of a similar Windows host, even though the data source was the same. It turns out that the problem is in the implementation of gRPC for Windows.

Nagle's algorithm

All this time I thought I was missing the flag gRPC. Now I understand what it really is. gRPC the Windows flag is missing. I found an internal RPC library that I was sure worked well for all the flags set Winsock. Then I added all these flags to gRPC and deployed Alvin on Windows, in a fixed Windows ping-pong server!

Sometimes more is less. When a decrease in load leads to an increase in latency

Almost done: I started removing the added flags one at a time until the regression came back so I could pinpoint the cause. It was infamous TCP_NODELAY, Nagle's algorithm switch.

Nagle's algorithm attempts to reduce the number of packets sent over the network by delaying the transmission of messages until the packet size exceeds a certain number of bytes. While this might be nice for the average user, it's devastating for realtime servers as the OS will delay some messages, causing delays at low QPS. At gRPC this flag was set in the Linux implementation for TCP sockets, but not in Windows. I am this corrected.

Conclusion

A large delay at low QPS was caused by OS optimization. In hindsight, the profiling did not detect latency because it was done in kernel mode and not in custom mode. I don't know if Nagle's algorithm can be observed through ETW captures, but it would be interesting.

As for the localhost experiment, it probably didn't touch the actual netcode and Nagle's algorithm didn't run, so the latency issues disappeared when the client contacted Alvin via localhost.

The next time you see an increase in latency as requests per second decrease, Nagle's algorithm should be on your list of suspects!

Source: habr.com

Add a comment