As in
I woke up one day to a dissatisfied email due to long delays at Alvin, which we were planning to launch soon. In particular, the client experienced 99th percentile latency in the region of 50ms, well above our latency budget. This was surprising, as I tested the service extensively, especially for delays, as this is the subject of frequent complaints.
Before giving Alvin a test run, I did a lot of experiments with 40k requests per second (QPS), all showing sub-10ms latency. I was ready to say that I did not agree with their results. But looking at the email again, I noticed something new: I hadn't exactly tested the conditions they mentioned, their QPS was much lower than mine. I tested at 40k QPS and they are only at 1k. I ran another experiment, this time with a lower QPS, just to please them.
Since I'm blogging about it, you've probably already figured out that their numbers turned out to be correct. I tested my virtual client again and again, with the same result: low requests not only increase latency, but increase requests with more than 10ms latency. In other words, if at 40k QPS about 50 requests per second exceeded 50ms, then at 1k QPS every second there were 100 requests above 50ms. Paradox!
Narrowing the search
When faced with a latency problem in a distributed system with many components, the first step is to make a short list of suspects. Let's dig a little deeper into Alvin's architecture:
A good starting point is a list of completed I/O transitions (network calls/disk lookups, etc.). Let's try to figure out where the delay is. In addition to the obvious I/O with the client, Alvin takes an extra step: he accesses the data store. However, this storage works in the same cluster as Alvin, so the delay there should be less than with the client. So here's the list of suspects:
- Network call from client to Alvin.
- Network call from Alvin to the data store.
- Search on disk in the data store.
- Network call from the data warehouse to Alvin.
- Network call from Alvin to the client.
Let's try to cross out some points.
Nothing to do with data storage
The first thing I did was convert Alvin to a ping-ping server that doesn't handle requests. Upon receiving the request, it returns an empty response. If the latency decreases, then a mistake in the implementation of Alvin or the data store is nothing unheard of. In the first experiment, we get the following graph:
As you can see, there is no improvement when using the ping-ping server. This means that the data store does not increase latency, and the list of suspects is halved:
- Network call from client to Alvin.
- Network call from Alvin to the client.
Great! The list is shrinking fast. I thought I almost figured out the reason.
gRPC
Now is the time to introduce you to a new player: gRPC
well optimized and widely used, it was my first time using it on a system of this scale and I expected my implementation to be sub-optimal to say the least.
Availability gRPC
on the stack gave rise to a new question: maybe this is my implementation or myself gRPC
causing the latency issue? Add a new suspect to the list:
- The client calls the library
gRPC
- Library
gRPC
on the client makes a network call to the librarygRPC
on server - Library
gRPC
calls Alvin (no operation in case of ping-pong server)
To give you an idea of ββwhat the code looks like, my client/Alvin implementation doesn't differ much from the client/server ones.
Note: The list above is a bit oversimplified because
gRPC
makes it possible to use your own (template?) threading model, in which the execution stack is intertwinedgRPC
and user implementation. For the sake of simplicity, we will stick to this model.
Profiling fixes everything.
Crossing out the datastores, I thought I was almost done: βNow it's easy! Let's apply a profile and find out where the delay occurs. I
I took four profiles: under high QPS (low latency) and with a ping-pong server at low QPS (high latency), both on the client side and on the server side. And just in case, I also took a sample processor profile. When comparing profiles, I usually look for an anomalous call stack. For example, on the bad side with high latency, there are many more context switches (10 times or more). But in my case, the number of context switches was almost the same. To my dismay, there was nothing significant.
Additional Debugging
I was in despair. I didn't know what other tools to use, and my next plan was essentially to repeat the experiments with different variations rather than clearly diagnosing the problem.
What if
From the very beginning, I was worried about the specific delay time of 50 ms. This is a very big time. I decided that I would cut pieces out of the code until I could figure out exactly which part was causing this error. Then followed an experiment that worked.
As usual, in hindsight it seems that everything was obvious. I put the client on the same machine as Alvin - and sent a request to localhost
. And the latency increase is gone!
Something was wrong with the network.
Learning the skills of a network engineer
I must admit that my knowledge of network technologies is terrible, especially since I work with them on a daily basis. But the network was the prime suspect, and I needed to learn how to debug it.
Fortunately, the Internet loves those who want to learn. The combination of ping and tracert seemed like a good enough start to debug network transport issues.
First, I ran
Then I tried
So the reason for the delay was not my code, not the gRPC implementation, and not the network. I was beginning to worry that I would never understand it.
Now what OS are we on
gRPC
widely used on Linux, but on Windows it is exotic. I decided to do an experiment that worked: I created a Linux virtual machine, compiled Alvin for Linux, and deployed it.
Here's what happened: the Linux ping-pong server didn't have the latency of a similar Windows host, even though the data source was the same. It turns out that the problem is in the implementation of gRPC for Windows.
Nagle's algorithm
All this time I thought I was missing the flag gRPC
. Now I understand what it really is. gRPC
the Windows flag is missing. I found an internal RPC library that I was sure worked well for all the flags set
Almost done: I started removing the added flags one at a time until the regression came back so I could pinpoint the cause. It was infamous
gRPC
this flag was set in the Linux implementation for TCP sockets, but not in Windows. I am this
Conclusion
A large delay at low QPS was caused by OS optimization. In hindsight, the profiling did not detect latency because it was done in kernel mode and not in
As for the localhost experiment, it probably didn't touch the actual netcode and Nagle's algorithm didn't run, so the latency issues disappeared when the client contacted Alvin via localhost.
The next time you see an increase in latency as requests per second decrease, Nagle's algorithm should be on your list of suspects!
Source: habr.com