Linux optimization to handle 1.2 million JSON requests per second

A detailed guide has been published on tuning the Linux environment to achieve the best HTTP request processing performance. The proposed methods made it possible to increase the performance of the JSON processor based on the libreactor library in an Amazon EC2 environment (4 vCPUs) from 224 thousand API requests per second with stock settings of Amazon Linux 2 with a 4.14 kernel to 1.2 million requests per second after optimization (an increase of 436%), and also resulted in a 79% reduction in request processing latency. The suggested methods are not specific to libreactor and work with other http servers, including nginx, Actix, Netty and Node.js (libreactor was used in tests because a solution based on it showed better performance).

Linux optimization to handle 1.2 million JSON requests per second

Main optimizations:

  • libreactor code optimization. The R18 variant from the Techempower set was used as a basis, which was improved by removing the code to limit the number of CPU cores used (optimization made it possible to speed up work by 25-27%), building in GCC with the "-O3" options (increase 5-10% ) and "-march-native" (5-10%), replacing read/write calls with recv/send (5-10%), and reducing overhead when using pthreads (2-3%). The overall performance gain after code optimization was 55%, and throughput increased from 224k req/s to 347k req/s.
  • Disable protection against vulnerabilities caused by speculative instruction execution. Using parameters when loading the kernel "nospectre_v1 nospectre_v2 pti=off mds=off tsx_async_abort=off" allowed us to increase performance by 28%, and throughput increased from 347k req/s to 446k req/s. Separately, the increase from the "nospectre_v1" parameter (protection from Specter v1 + SWAPGS) was 1-2%, "nospectre_v2" (protection from Specter v2) - 15-20%, "pti=off" (Spectre v3/Meltdown) - 6 %, "mds=off tsx_async_abort=off" (MDS/Zombieload and TSX Asynchronous Abort) - 6%. Left unchanged settings to protect against attacks L1TF/Foreshadow (l1tf=flush), iTLB multihit, Speculative Store Bypass and SRBDS, which did not affect performance, as they did not intersect with the tested configuration (for example, specific to KVM, nested virtualization and others CPU models).
  • Disable auditing and system call blocking mechanisms with the "auditctl -a never,task" command and specifying the "--security-opt seccomp=unconfined" option when starting the docker container. The overall performance gain was 11% and throughput increased from 446k req/s to 495k req/s.
  • Disabling iptables/netfilter by unloading their associated kernel modules. The idea to disable a firewall that was not used in a specific server solution was prompted by profiling results, judging by which the nf_hook_slow function took 18% of the time to execute. It is noted that nftables is more efficient than iptables, but Amazon Linux continues to use iptables. After disabling iptables, there was a 22% performance gain and throughput increased from 495k req/s to 603k req/s.
  • Reduce the migration of handlers between different CPU cores to improve the efficiency of using the processor cache. Optimization was made both at the level of binding libreactor processes to CPU cores (CPU Pinning), and through pinning kernel network handlers (Receive Side Scaling). For example, disabled irqbalance and explicitly set CPU queuing affinities in /proc/irq/$IRQ/smp_affinity_list. To use the same CPU core to process the libreactor process and the network queue of incoming packets, a native BPF handler is used, connected by setting the SO_ATTACH_REUSEPORT_CBPF flag when creating the socket. Changed /sys/class/net/eth0/queues/tx- settings to bind outgoing packet queues to CPU /xps_cpus. The overall performance gain was 38% and throughput increased from 603k req/s to 834k req/s.
  • Optimization of interrupt handling and the use of polling. Enabling adaptive-rx mode in the ENA driver and manipulating the sysctl net.core.busy_read allowed us to increase performance by 28% (bandwidth increased from 834k req / s to 1.06M req / s, and latency decreased from 361μs to 292μs).
  • Disabling system services that lead to unnecessary locks in the network stack. Disabling dhclient and manually setting the IP resulted in a 6% performance improvement and throughput increased from 1.06M req/s to 1.12M req/s. The reason for the performance impact of dhclient is in analyzing traffic using a raw socket.
  • Spin lock fight. Putting the network stack in "noqueue" mode via sysctl "net.core.default_qdisc=noqueue" and "tc qdisc replace dev eth0 root mq" resulted in a performance gain of 2%, and throughput increased from 1.12M req/s to 1.15M req/s.
  • Final minor optimizations, such as disabling GRO (Generic Receive Offload) with "ethtool -K eth0 gro off" and replacing the cubic congestion control algorithm with reno with sysctl "net.ipv4.tcp_congestion_control=reno". The overall performance increase was 4%. Throughput increased from 1.15M req/s to 1.2M req/s.

In addition to the optimizations that worked, the article also discusses methods that did not lead to the expected performance increase. For example, the following turned out to be ineffective:

  • Running libreactor alone did not differ in performance from running it in a container. Replacing writev with send, increasing maxevents in epoll_wait, experimenting with versions and GCC flags had no effect (the effect was noticeable only for the "-O3" and "-march-native" flags).
  • Updating the Linux kernel to versions 4.19 and 5.4, using the SCHED_FIFO and SCHED_RR schedulers, manipulating sysctl kernel.sched_min_granularity_ns, kernel.sched_wakeup_granularity_ns, transparent_hugepages=never, skew_tick=1 and clocksource=tsc did not affect performance.
  • In the ENA driver, enabling Offload modes (segmentation, scatter-gather, rx/tx checksum), building with the "-O3" flag, and using the ena.rx_queue_size and ena.force_large_llq_header parameters did not affect.
  • Changes in the networking stack did not lead to performance improvements:
    • Disable IPv6: ipv6.disable=1
    • Disable VLAN: modprobe -rv 8021q
    • Disable package source check
      • net.ipv4.conf.all.rp_filter=0
      • net.ipv4.conf.eth0.rp_filter=0
      • net.ipv4.conf.all.accept_local=1 (bad effect)
    • net.ipv4.tcp_sack = 0
    • net.ipv4.tcp_dsack=0
    • net.ipv4.tcp_mem/tcp_wmem/tcp_rmem
    • net.core.netdev_budget
    • net.core.dev_weight
    • net.core.netdev_max_backlog
    • net.ipv4.tcp_slow_start_after_idle=0
    • net.ipv4.tcp_moderate_rcvbuf = 0
    • net.ipv4.tcp_timestamps = 0
    • net.ipv4.tcp_low_latency = 1
    • SO_PRIORITY
    • TCP_NODELAY

    Source: opennet.ru

Add a comment