[Translation] Envoy threading model

Article translation: Envoy threading model - https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310

This article seemed quite interesting to me, and since Envoy is most often used as part of the β€œistio” or simply as the β€œingress controller” of kubernetes, therefore most people do not have the same direct interaction with it as for example with typical Nginx or Haproxy installations. However, if something breaks, it would be good to understand how it works from the inside. I tried to translate as much of the text as possible into Russian, including special words, for those who find it painful to look at, I left the originals in brackets. Welcome under cat.

Low-level technical documentation on the Envoy codebase is currently quite sparse. To fix this, I'm planning on doing a series of blog posts about the various Envoy subsystems. Since this is the first article, please let me know what you think and what you might be interested in in future articles.

One of the most common technical questions I get about Envoy is a request for a low-level description of the threading model being used. In this post, I will describe how Envoy maps connections to threads, as well as a description of the Thread Local Storage system that is used internally to make code more parallel and performant.

Description of threads (Threading overview)

[Translation] Envoy threading model

Envoy uses three different types of streams:

  • Main (Main): This thread manages the start and end of the process, all XDS (xDiscovery Service) API processing, including DNS, health checking, general cluster and service process management (runtime), statistics reset, administration and general process management - Linux signals, hot restart, etc. Everything that happens in this thread is asynchronous and "non-blocking". In general, the main thread coordinates all critical processes of functionality that do not require a large number of CPUs to complete. This allows most control code to be written as if it were single-threaded.
  • Worker: By default, Envoy creates a worker thread for every hardware thread in the system, this can be controlled with the option --concurrency. Each worker thread starts a "non-blocking" event loop (event loop), which is responsible for listening (listening) to each listener (listener), at the time of writing (July 29, 2017) there is no sharding (sharding) of the listener (listener), accepting new connections, instantiating the filter stack for the connection, and handling all input/output (IO) operations during the lifetime of the connection. Again, this allows most connection handling code to be written as if it were single-threaded.
  • File (File flusher): Every file that Envoy writes, mostly access logs, currently has an independent blocking thread. This is due to the fact that writing to files cached by the file system even when using O_NONBLOCK can sometimes block (sigh). When worker threads need to write to a file, the data is actually moved to a buffer in memory where it is eventually flushed through the thread file flush. This is one area of ​​code where, technically, all worker threads can block the same lock while trying to fill the memory buffer.

Connection handling

As discussed briefly above, all worker threads listen to all listeners without any segmentation. Thus, the kernel is used to intelligently send received sockets to worker threads. Modern kernels are generally very good at this, they use features like I/O priority boost to try and fill a thread with work before starting to use other threads that are also listening on the same socket, and also not use spinlock (Spinlock) to process each request.
Once a connection is accepted on a worker thread, it never leaves that thread. All further processing of the connection is handled entirely in the worker thread, including any forwarding behavior.

This has several important implications:

  • All connection pools in Envoy are per worker thread. Thus, although HTTP/2 connection pools only make one connection to each upstream host at a time, if there are four worker threads, there will be four HTTP/2 connections per upstream host in a steady state.
  • The reason why Envoy works this way is that by keeping everything on a single worker thread, almost all code can be written block-free and as if it were single-threaded. This design makes it easy to write a lot of code and scales incredibly well for an almost unlimited number of worker threads.
  • However, one of the main findings is that in terms of the efficiency of the memory pool and connections, it is actually very important to tune the parameter --concurrency. Having more worker threads than needed will waste memory, create more idle connections, and slow down the connection pool hit rate. At Lyft, our envoy sidecar containers run at very low concurrency, so the performance is roughly on par with the services they sit next to. We run Envoy as an edge proxy only at maximum concurrency.

What non-blocking means

The term "non-blocking" has been used a few times so far when discussing how the main and worker threads work. All code is written on the assumption that nothing ever blocks. However, this is not quite true (what is not quite true?).

Envoy uses several long process locks:

  • As discussed earlier, when writing access logs, all worker threads acquire the same lock before filling the in-memory log buffer. The lock hold time should be very low, but it is possible that this lock will be contested under high concurrency and high throughput.
  • Envoy uses a very complex system to handle statistics that are local to a thread. This will be the topic of a separate post. However, I will briefly mention that as part of the local processing of thread statistics, it is sometimes necessary to acquire a lock on a central "statistics store". This lock should never be required.
  • The main thread periodically needs to coordinate with all worker threads. This is done by "posting" from the main thread to worker threads, and sometimes from worker threads back to the main thread. The send requires a lock so that the published message can be queued for later delivery. These locks should never be heavily contentioned, but they can still technically be locked.
  • When Envoy writes a log to the system error stream (standard error), it acquires a process-wide lock. In general, Envoy's local logging is considered terrible in terms of performance, so there isn't much attention paid to improving it.
  • There are a few other random locks, but none of them are performance-critical and should never be contested.

Thread local storage

Because of the way Envoy separates main thread responsibilities from worker thread responsibilities, there is a requirement that complex processing can be done on the main thread and then provided to each worker thread with a high degree of concurrency. This section describes the Envoy Thread Local Storage (TLS) system at a high level. In the next section, I will describe how it is used to manage a cluster.
[Translation] Envoy threading model

As already described, the main thread handles almost all of the management functions and control plane functionality in the Envoy process. The control plane is a little overwhelmed here, but when viewed in terms of the Envoy process itself and compared to the forwarding that worker threads perform, it makes sense. As a general rule, the main thread process does some work, and then it needs to update each worker thread according to the result of this work, while the worker thread does not need to establish a lock on each access.

The TLS (Thread local storage) Envoy system works as follows:

  • Code running on the main thread can allocate a TLS slot for the entire process. Although this is abstract, in practice it is an index into a vector, providing O(1) access.
  • The main thread can set arbitrary data to its slot. When this is done, the data is published to each worker thread as a normal event in the event loop.
  • Worker threads can read from their TLS slot and retrieve any thread-local data available there.

Although this is a very simple and incredibly powerful paradigm that is very similar to the RCU (Read-Copy-Update) blocking concept. Essentially, worker threads never see any data changes in TLS slots while the work is running. Change occurs only during the dormant period between work events.

Envoy uses this in two different ways:

  • By storing different data on each worker thread, this data is accessed without any blocking.
  • By keeping a shared pointer to global data in read-only mode on each worker thread. Thus, each worker thread has a data reference count that cannot be decremented while the work is being done. Only when all workers calm down and upload new shared data will the old data be destroyed. This is identical to RCU.

Cluster update threading

In this section, I will describe how TLS (Thread local storage) is used to manage a cluster. Cluster management includes xDS and/or DNS API processing and health checking.
[Translation] Envoy threading model

Cluster flow management includes the following components and steps:

  1. The cluster manager is a component within Envoy that manages all known cluster upstreams, the CDS (Cluster Discovery Service) API, the SDS (Secret Discovery Service) and EDS (Endpoint Discovery Service) APIs, DNS, and active external checks. health checking. It is responsible for creating an "eventually consistent" view of each upstream cluster, which includes the discovered hosts as well as the health status.
  2. The health checker performs an active health check and reports health status changes to the cluster manager.
  3. CDS (Cluster Discovery Service) / SDS (Secret Discovery Service) / EDS (Endpoint Discovery Service) / DNS are performed to determine cluster membership. The state change is returned to the cluster manager.
  4. Each worker thread is constantly executing an event loop.
  5. When the cluster manager determines that the state for the cluster has changed, it creates a new read-only snapshot of the cluster and sends it to each worker thread.
  6. During the next dormant period, the worker thread will update the snapshot in the allocated TLS slot.
  7. During an I/O event that needs to determine the host to load balance, the load balancer will query the TLS (Thread local storage) slot for information about the host. This does not require locks. Note also that TLS can also fire events on refresh so that load balancers and other components can recalculate caches, data structures, and so on. This is outside the scope of this post, but is used in various places in the code.

Using the above procedure, Envoy can process each request without any blocking (other than those previously described). Aside from the complexity of the TLS code itself, most of the code doesn't need to understand how multithreading works and can be written in single threaded mode. This makes it easier to write most of the code in addition to excellent performance.

Other subsystems that make use of TLS

TLS (Thread local storage) and RCU (Read Copy Update) are widely used in Envoy.

Examples of using:

  • Mechanism for changing functionality during execution: The current list of enabled features is evaluated on the main thread. Each worker thread is then provided with a read-only snapshot using RCU semantics.
  • Replacing Route Tables: For route tables provided by RDS (Route Discovery Service), route tables are created on the main thread. The read-only snapshot will then be provided to each worker thread using RCU (Read Copy Update) semantics. This makes changing route tables atomically efficient.
  • HTTP header caching: As it turns out, per-request HTTP header calculation (at ~25K+ RPS per core) is quite expensive. Envoy centrally calculates a header approximately every half second and provides it to each worker via TLS and RCU.

There are other cases, but the previous examples should provide a good understanding of what TLS is used for.

Known performance pitfalls

While Envoy performs quite well overall, there are a few notable areas that require attention when used at very high concurrency and throughput:

  • As already described in this article, currently all worker threads acquire a lock while writing to the access log memory buffer. High concurrency and high throughput will require batching the access logs for each worker thread at the cost of out-of-order delivery when writing to the final file. Alternatively, you can create a separate access log for each worker thread.
  • Although the statistics are very heavily optimized, with very high concurrency and throughput, there is likely to be atomic contention on individual statistics. The solution to this problem is counters per worker thread with periodic reset of central counters. This will be discussed in a subsequent post.
  • The current architecture will not work well if Envoy is deployed in a scenario where there are very few connections requiring significant resources to process. There is no guarantee that links will be evenly distributed among worker threads. This can be solved by implementing worker connection balancing, which will allow the exchange of connections between worker threads.

Conclusion

The Envoy threading model is designed for ease of programming and massive parallelism at the cost of potentially wasteful memory and connections if not configured correctly. This model allows it to perform very well at very high thread counts and throughputs.
As I briefly mentioned on Twitter, the design can also run on top of a full-featured user-mode networking stack such as DPDK (Data Plane Development Kit), which can result in normal servers handling millions of requests per second with full L7 processing. It will be very interesting to see what will be built in the next few years.
One last quick comment: I've been asked many times why we chose C++ for Envoy. The reason is still that it is still the only widely used production-level language on which to build the architecture described in this post. C++ is definitely not suitable for all or even many projects, but for certain use cases it is still the only tool to get the job done.

Links to code

Links to files with interfaces and implementation of headers discussed in this post:

Source: habr.com

Add a comment