QUIC protocol in action: how Uber implemented it to optimize performance

The QUIC protocol is extremely interesting to watch, which is why we love writing about it. But if previous publications about QUIC were more of a historical (local lore, if you like) nature and materiel, today we are pleased to publish a translation of a different kind - we will talk about the actual application of the protocol in 2019. And this is not about small infrastructure based in a conditional garage, but about Uber, which operates almost all over the world. How the company's engineers came to the decision to use QUIC in production, how they conducted the tests and what they saw after rolling it into production - under the cut.

Pictures are clickable. Enjoy reading!

QUIC protocol in action: how Uber implemented it to optimize performance

Uber is a global scale, namely 600 cities of presence, in each of which the application fully relies on wireless Internet from more than 4500 mobile operators. Users expect the app to be not just fast, but real-time – to deliver this, the Uber app needs low latency and a very reliable connection. Alas, but the stack HTTP / 2 does not perform well in dynamic and lossy wireless networks. We found that in this case, poor performance is directly related to the implementations of TCP in the kernels of operating systems.

To solve the problem, we applied HERE C, a modern channel multiplexing protocol that gives us more control over the performance of the transport protocol. The working group is currently IETF standardizes QUIC as HTTP / 3.

After extensive testing, we came to the conclusion that introducing QUIC into our application will make tail delays less compared to TCP. We've seen decreases in the 10-30% range for HTTPS traffic in the Driver and Passenger apps. QUIC also gave us end-to-end control over user packages.

In this article, we share our experience of optimizing TCP for Uber applications using a stack that supports QUIC.

State of the Art: TCP

Today, TCP is the most used transport protocol for delivering HTTPS traffic on the Internet. TCP provides a reliable stream of bytes, thereby coping with network congestion and link layer losses. The widespread use of TCP for HTTPS traffic is due to the ubiquity of the former (almost every OS contains TCP), availability on most infrastructure (for example, load balancers, HTTPS proxies and CDNs), and out-of-the-box functionality that is available on almost most platforms and networks.

Most users use our application on the go, and TCP tail delays were far from the requirements of our real-time HTTPS traffic. To put it simply, users all over the world have experienced this - Figure 1 shows delays in major cities:

QUIC protocol in action: how Uber implemented it to optimize performance
Figure 1. Tail delays vary across major cities where Uber operates.

Although delays in the Indian and Brazilian networks were greater than in the US and UK, the tail delays are significantly larger than the average delays. And this is true even for the US and UK.

TCP performance over the air

TCP was created for wired networks, that is, with an emphasis on well-predictable links. However, wireless networks have their own characteristics and difficulties. First, wireless networks are susceptible to loss due to interference and signal attenuation. For example, Wi-Fi networks are sensitive to microwaves, bluetooth, and other radio waves. Cellular networks suffer from signal loss (path loss) due to reflection / absorption of the signal by objects and buildings, as well as from interference from neighboring cell towers. This leads to more significant (4-10 times) and more diverse round trip delay (RTT) and packet loss compared to a wired connection.

To combat bandwidth fluctuation and loss, cellular networks typically use large buffers for bursts of traffic. This can lead to excessive queuing, which means more delays. Very often, TCP treats such queuing as a loss due to an extended timeout, so TCP tends to do a relay and thus fill the buffer. This problem is known as bufferbloat (excessive network buffering, buffer swelling) and this is very serious problem modern internet.

Finally, cellular network performance varies by carrier, region, and time. In Figure 2, we have collected the median delays of HTTPS traffic across cells over a range of 2 kilometers. The data is collected for the two largest mobile operators in Delhi, India. As you can see, performance varies from cell to cell. Also, the performance of one operator differs from the performance of the second one. This is influenced by such factors as network entry patterns, taking into account time and location, user mobility, as well as network infrastructure, taking into account the density of towers and the ratio of network types (LTE, 3G, etc.).

QUIC protocol in action: how Uber implemented it to optimize performance
Figure 2. Delays using a 2 km radius as an example. Delhi, India.

Also, the performance of cellular networks varies over time. Figure 3 shows the median delay by day of the week. We also observed the difference on a smaller scale - within one day and hour.

QUIC protocol in action: how Uber implemented it to optimize performance
Figure 3. Tail delays can vary significantly on different days, but for the same operator.

All of the above results in TCP performance being inefficient on wireless networks. However, before looking for alternatives to TCP, we wanted to develop a clear understanding on the following points:

  • Is TCP the main culprit in tail delays in our applications?
  • Do modern networks have significant and varied round-trip delays (RTT)?
  • What is the impact of RTT and loss on TCP performance?

TCP performance analysis

To understand how we analyzed the performance of TCP, let's briefly recall how TCP transfers data from a sender to a receiver. The sender first establishes a TCP connection by performing a three-way handshake: The sender sends a SYN packet, waits for a SYN-ACK packet from the receiver, then sends an ACK packet. The additional second and third passes go into making a TCP connection. The receiver acknowledges receipt of each packet (ACK) to ensure reliable delivery.

If a packet or ACK is lost, the sender retransmits after a timeout (RTO, retransmission timeout). The RTO is dynamically calculated based on various factors such as the expected RTT latency between sender and receiver.

QUIC protocol in action: how Uber implemented it to optimize performance
Figure 4. Packet exchange over TCP/TLS includes a retransmit mechanism.

To determine how TCP worked in our applications, we monitored TCP packets using Tcpdump for a week on combat traffic coming from the Indian edge servers. We then parsed the TCP connections with tcptrace. Additionally, we created an Android application that sends emulated traffic to a test server, imitating real traffic as much as possible. Smartphones with this application were distributed to several employees who collected logs over several days.

The results of both experiments were consistent with each other. We saw high RTT latency; tail values ​​were almost 6 times the median value; the arithmetic mean of delays is more than 1 second. Many connections were lossy, causing TCP to retransmit 3,5% of all packets. In areas with congestion, such as airports and train stations, we have observed a 7% loss. These results cast doubt on the conventional wisdom that those used in cellular networks advanced retransmission schemes significantly reduce the losses at the transport layer. Below are the test results from the "simulator" application:

Network Metrics
The values

RTT, milliseconds [50%,75%, 95%,99%]
[350, 425, 725, 2300]

RTT discrepancy, seconds
Average ~1,2 s

Packet Loss on Intermittent Links
Average ~3.5% (7% in congested areas)

Nearly half of these connections had at least one packet loss, mostly SYNs and SYN-ACKs. Most TCP implementations use an RTO of 1 second for SYN packets, which increases exponentially for subsequent losses. Application load times can increase because TCP takes longer to establish connections.

In the case of data packets, high RTO values ​​greatly reduce the useful utilization of the network in the presence of time losses in wireless networks. We have found that the average retransmission time is approximately 1 second with a tail delay of almost 30 seconds. Such high latency at the TCP layer caused HTTPS timeouts and retry requests, further increasing network latency and inefficiency.

While the 75th percentile of the measured RTTs was in the region of 425ms, the 75th percentile for TCP was almost 3 seconds. This hints that the loss caused TCP to make 7-10 passes to successfully transfer data. This may be due to inefficient RTO calculation, TCP's inability to quickly respond to loss of latest packages in the window and the inefficiency of the congestion control algorithm, which does not distinguish between wireless loss and loss due to network congestion. Below are the results of TCP loss tests:

TCP Packet Loss Statistics
Value

Percentage of connections with at least 1 packet loss
45%

Percentage of connections with losses during connection establishment
30%

Percentage of connections with loss during communication
76%

Distribution of delays in retransmission, seconds [50%, 75%, 95%, 99%] [1, 2.8, 15, 28]

Distribution of the number of retransmissions for one packet or TCP segment
[1,3,6,7]

Application of QUIC

Originally designed by Google, QUIC is a multithreaded modern transport protocol that runs on top of UDP. QUIC is currently standardization process (we already wrote that there are, as it were, two versions of QUIC, inquisitive can follow the link - approx. translator). As shown in Figure 5, QUIC has been placed under HTTP/3 (in fact, HTTP/2 on top of QUIC is HTTP/3, which is now heavily standardized). It partially replaces the HTTPS and TCP layers, using UDP to form packets. QUIC only supports secure data transfer as TLS is fully built into QUIC.

QUIC protocol in action: how Uber implemented it to optimize performance
Figure 5: QUIC runs under HTTP/3, replacing TLS which used to run under HTTP/2.

Here are the reasons that convinced us to use QUIC for TCP hardening:

  • 0-RTT connection establishment. QUIC allows the reuse of authorizations from previous connections, reducing the number of security handshakes. In future TLS1.3 will support 0-RTT, but the XNUMX-way TCP handshake will still be required.
  • overcoming HoL blocking. HTTP/2 uses one TCP connection per client to improve performance, but this can lead to HoL (head-of-line) blocking. QUIC simplifies multiplexing and delivers requests to the application independently.
  • overload control. QUIC resides at the application layer, making it easier to update the main transport algorithm that manages forwarding based on network parameters (loss rate or RTT). Most TCP implementations use the algorithm CUBIC, which is not optimal for latency sensitive traffic. Recently developed algorithms like BBR, model the network more accurately and optimize delays. QUIC allows you to use BBR and update this algorithm as it improving.
  • replenishment of losses. QUIC calls two TLPs (tail loss probe) before the RTO kicks in - even when the losses are very noticeable. This is different from TCP implementations. TLP retransmits mostly the last packet (or the new one if there is one) to start a fast replenishment. Handling tail delays is especially useful for how Uber works with the network, namely for short, episodic, and delay-sensitive data transfers.
  • optimized ACK. Since each packet has a unique serial number, there is no problem distinctions packets while retransmitting them. ACK packets also contain time to process the packet and generate an ACK on the client side. These features ensure that QUIC calculates RTT more accurately. ACK in QUIC supports up to 256 bands NACK, helping the sender to be more resilient to packet shuffling and use fewer bytes in the process. Selective ACK (BAG) in TCP does not solve this problem in all cases.
  • connection migration. QUIC connections are identified with a 64-bit ID so that if a client changes IP addresses, the old connection ID can continue to be used on the new IP address without interruption. This is a very common practice for mobile apps when the user switches between Wi-Fi and cellular connections.

Alternatives to QUIC

We considered alternative approaches to solving the problem before choosing QUIC.

First of all, we tried to deploy TPC PoPs (Points of Presence) to terminate TCP connections closer to users. Essentially, PoPs terminate the TCP connection to the mobile device closer to the cellular network and proxy the traffic back to the original infrastructure. By terminating TCP closer, we can potentially reduce the RTT and ensure that TCP is more responsive to dynamic wireless environments. However, our experiments have shown that most of the RTT and loss comes from cellular networks and using PoPs does not provide a significant performance improvement.

We also looked at tuning TCP parameters. Setting up the TCP stack on our heterogeneous edge servers was difficult, as TCP has disparate implementations across OS versions. It was difficult to implement and test various network configurations. Configuring TCP directly on mobile devices was not possible due to lack of permissions. More importantly, features like 0-RTT connections and enhanced RTT prediction are critical to the protocol architecture and therefore it is not possible to achieve significant benefits by tuning TCP alone.

Finally, we evaluated several UDP-based protocols that troubleshoot video streaming - we wanted to see if these protocols would help in our case. Alas, they were severely lacking in many security settings, and they also required an additional TCP connection for metadata and control information.

Our research has shown that QUIC is perhaps the only protocol that can help with the problem of Internet traffic, while considering both security and performance.

Integration of QUIC into the platform

In order to successfully embed QUIC and improve application performance in poor connectivity, we have replaced the old stack (HTTP/2 over TLS/TCP) with the QUIC protocol. We used the network library Cronet of Chromium Projects, which contains the original, Google version of the protocol - gQUIC. This implementation is also constantly being improved to follow the latest IETF specification.

We first integrated Cronet into our Android apps to add QUIC support. The integration was carried out in such a way as to minimize the cost of migration. Instead of completely replacing the old networking stack that used the library OkHttp, we have integrated Cronet UNDER the OkHttp API framework. By integrating in this way, we avoided changes to our network calls (which use Retrofit) at the API level.

Similar to the Android approach, we have implemented Cronet in Uber iOS apps by intercepting HTTP traffic from network APIusing NSURLProtocol. This abstraction, provided by the iOS Foundation, handles protocol-specific URL data and ensures that we can integrate Cronet into our iOS applications without significant migration costs.

Completion of QUIC on Google Cloud balancers

On the backend side, QUIC completion is provided by the Google Cloud Load balancing infrastructure, which uses alt-svc headers in responses to support QUIC. In general, the balancer adds the alt-svc header to each HTTP request and it already validates QUIC support for the domain. When a Cronet client receives an HTTP response with this header, it uses QUIC for subsequent HTTP requests to that domain. As soon as the balancer completes QUIC, our infrastructure explicitly sends this action over HTTP2/TCP to our data centers.

Performance: Results

The output performance is the main reason for our search for the best protocol. To begin with, we created a stand with network emulationto find out how QUIC will behave with different network profiles. To test how QUIC works in real networks, we ran experiments driving around New Delhi using emulated network traffic, much like HTTP calls in a passenger app.

Experiment 1

Equipment for the experiment:

  • android test devices with OkHttp and Cronet stacks to make sure we're allowing HTTPS traffic over TCP and QUIC, respectively;
  • a Java-based emulation server that sends HTTPS headers of the same type in responses and loads client devices to receive requests from them;
  • cloud proxies that are physically located close to India to terminate TCP and QUIC connections. Whereas for TCP termination we used a reverse proxy on Nginx, it was difficult to find an open source reverse proxy for QUIC. We built a reverse proxy for QUIC ourselves, using the base QUIC stack from Chromium and published its in chromium as open source.

QUIC protocol in action: how Uber implemented it to optimize performanceQUIC protocol in action: how Uber implemented it to optimize performance
Figure 6. The TCP vs QUIC road test set consisted of Android devices with OkHttp and Cronet, cloud proxies to terminate connections, and an emulation server.

Experiment 2

When Google made QUIC available with Google Cloud Load Balancing, we used the same inventory, but with one modification: instead of NGINX, we took Google's balancers to terminate TCP and QUIC connections from devices, as well as to direct HTTPS traffic to the emulation server. Balancers are distributed all over the world, but use the closest PoP server to the device (thanks to geolocation).

QUIC protocol in action: how Uber implemented it to optimize performance
Figure 7. In the second experiment, we wanted to compare TCP and QUIC completion latency: using Google Cloud and using our cloud proxy.

As a result, several revelations awaited us:

  • termination via PoP improved TCP performance. Since balancers terminate TCP connections closer to users and are highly optimized, this results in lower RTTs, which improves TCP performance. And although QUIC was less affected, it still bypassed TCP in terms of reducing tail delays (by 10-30 percent).
  • tails are affected network transitions (hops). Although our QUIC proxy was further away from the devices (around 50ms higher latency) than Google's balancers, it delivered similar performance - a 15% reduction in latency versus a 20% reduction in the 99th percentile for TCP. This suggests that the last mile transition is a bottleneck in the network.

QUIC protocol in action: how Uber implemented it to optimize performanceQUIC protocol in action: how Uber implemented it to optimize performance
Figure 8. The results of two experiments show that QUIC is significantly superior to TCP.

Combat traffic

Inspired by experiments, we have implemented QUIC support in our Android and iOS apps. We conducted A/B testing to determine the impact of QUIC in cities where Uber operates. In general, we saw a significant reduction in tail delays in the context of both regions and telecom operators and network type.

The graphs below show the percentage improvements in the tails (95th and 99th percentiles) by macro regions and different network types - LTE, 3G, 2G.
QUIC protocol in action: how Uber implemented it to optimize performanceQUIC protocol in action: how Uber implemented it to optimize performance
Figure 9. QUIC outperformed TCP in latency in combat tests.

Only forward

Perhaps this is just the beginning - the rollout of QUIC to production has provided amazing opportunities to improve application performance in both stable and unstable networks, namely:

Coverage increase

After analyzing the performance of the protocol on real traffic, we saw that approximately 80% of sessions successfully used QUIC to all requests, while 15% of sessions used a combination of QUIC and TCP. We assume that the combination is due to the Cronet library switching back to TCP on a timeout, as it cannot distinguish between real UDP failures and bad network conditions. We are currently looking for a solution to this problem as we work on the subsequent implementation of QUIC.

QUIC optimization

Traffic from mobile apps is latency sensitive, but not bandwidth sensitive. Also, our applications are mainly used in cellular networks. Based on experimentation, tail lags are still large even though proxies are used to terminate TCP and QUIC close to users. We are actively looking for ways to improve congestion management and increase the efficiency of QUIC loss recovery algorithms.

With these and several other improvements, we plan to improve the user experience regardless of the network and region, making convenient and seamless packet transport more accessible around the world.

Source: habr.com

Add a comment