Linux 5.16 kernel release

After two months of development, Linus Torvalds has released the Linux 5.16 kernel. Among the most notable changes: the futex_waitv system call to improve the performance of Windows games in Wine, error tracking in the FS through fanotify, the concept of tomes in the memory management system, support for AMX processor instructions, the ability to reserve memory behind network sockets, support in netfilter for classifying packages at the stage "egress", activating the DAMON subsystem to proactively evict unclaimed memory areas, improving the handling of overloads with a large amount of write operations, support for multi-drive hard drives.

The new version received 15415 fixes from 2105 developers, the patch size is 45 MB (the changes affected 12023 files, 685198 lines of code were added, 263867 lines were deleted). About 44% of all changes introduced in 5.16 are related to device drivers, about 16% of changes are related to updating code specific to hardware architectures, 16% are related to the networking stack, 4% to file systems, and 4% to internal kernel subsystems.

Key innovations in kernel 5.16:

  • Disk Subsystem, I/O and File Systems
    • Added tools to the fanotify mechanism to monitor the state of the file system and track the occurrence of errors. Error information is reported using a new event type - FAN_FS_ERROR, which can be intercepted in user-space monitoring systems to promptly inform the administrator or start recovery processes. When a series of errors cascade, fanotify ensures that the first error message is delivered, along with a total problem count, to facilitate later troubleshooting. Bug tracking support is currently implemented only for the Ext4 FS.
    • Improved handling of write congestion that occurs when the amount of write operations exceeds the drive's bandwidth and the system is forced to block write requests from a process until requests that have already been submitted are completed. The new version completely redesigned the kernel mechanism used to obtain information about the occurrence of overload and blocking tasks, since in the old implementation there were problems with pairing the processing of write overload with the displacement of memory pages to the swap partition when the system ran out of memory.
    • Btrfs implements support for the Zoned Namespace technology used in hard drives or NVMe SSDs to divide storage space into zones that make up groups of blocks or sectors, in which only sequential addition of data is allowed with updating the entire group of blocks. In addition, minor inode logging optimizations have been made, resulting in a 3% throughput increase in the dbench test and an 11% reduction in latency. The directory logging mechanism has been redesigned, in which the number of searches and locks in the tree has been reduced to improve efficiency. Faster insertion of elements into the btree structure in batch mode (the time of mass insertion of elements was reduced by 4%, and deletion by 12%). Added limited support for using compression when writing incomplete pages, as well as the ability to defragment subpages (subpage). Preparations have been made to enable support for the second version of the protocol for the "send" command.
    • The XFS file system has reduced memory consumption by using separate slab caches for frequently accessed items and by reducing some data structures.
    • In the Ext4 file system, only bug fixes and more accurate calculation of parameters for lazy initialization of the Inode table were noted.
    • At the block device level, optimizations have been implemented that can significantly increase the efficiency of linking operations to CPU cores.
    • Added initial support for hard drives with multiple independent drives (multi-actuator), which make it possible to simultaneously access multiple sectors in different areas of the magnetic plate.
    • Added new ioctl command CDROM_TIMED_MEDIA_CHANGE to define media change events in optical disc drive.
    • The EROFS (Enhanced Read-Only File System) file system has added the ability to work on top of multiple storage devices. Different devices can be mapped to the same 32-bit block address space. Also added support for compression using the LZMA algorithm.
    • Mount options have been added to F2FS to manage fragmentation of files when placed in storage (for example, to debug optimizations for working with fragmented storages).
    • CEPH has asynchronous directory creation and deletion enabled by default (use the '-o wsync' flag when mounting to return the old behavior). Added the maintenance of metrics that track the operations of copying external objects.
    • The tcpnodelay mount option has been added to CIFS, which sets the network socket mode to tcp_sock_set_nodelay, which disables waiting for the TCP stack to fill the queue. Added support for nested DFS Link (Distributed File System) when remounting.
    • Added support for completing requests to a block device in batch mode. Testing of the change showed an increase in random read operations from Optane drives from 6.1 to 6.6 million IOPS per CPU core.
  • Memory and system services
    • A new futex_waitv system call has been added that allows you to monitor the status of multiple futexes at once with a single system call. This feature resembles the WaitForMultipleObjects functionality available in Windows, emulation of which via futex_waitv can be useful for improving the performance of Windows games running under Wine or Proton. In addition, simultaneous waiting for futexes can also be used to optimize the performance of native builds of games for Linux.
    • The concept of page folios has been implemented, the use of which in some kernel subsystems will speed up memory management under typical loads. At present, the main memory management subsystem in the kernel and the implementation of the page cache have already been transferred to tomes, and file systems are planned to be transferred in the future. In the future, the kernel also plans to add support for multi-page tomes.

      Folios resemble compound pages, but differ in improved semantics and a more understandable organization of work. To manage system memory, available RAM is divided into memory pages, the size of which depends on the architecture, but on x86 systems it is calculated in kilobytes (typically 4096 bytes). Today's systems come with tens of gigabytes of RAM, which makes memory management more difficult due to the need to process a huge number of memory pages. To reduce the number of pages, the concept of compound pages was implemented earlier in the kernel, with structures spanning more than one physical page of memory. But the API for manipulating pooled memory pages left much to be desired and introduced additional overhead.

    • A handler has been added to the task scheduler that takes into account cache clustering in the CPU. In some processors, such as Kunpeng 920 (ARM) and Intel Jacobsville (x86), a certain number of CPU cores, usually 4, can share L3 or L2 cache. Accounting for such topologies can significantly increase the efficiency of distributing tasks across CPU cores in the task scheduler, since moving tasks within the same CPU cluster allows you to increase memory access throughput and reduce cache contention.
    • Added support for AMX (Advanced Matrix Extensions) instructions implemented in the upcoming Intel Xeon Scalable server processors, codenamed Sapphire Rapids. AMX offers new custom TMM "TILE" registers and instructions for manipulating data in these registers, such as TMUL (Tile matrix MULtiply) for matrix multiplication.
    • Several new features have been implemented based on the DAMON (Data Access MONitor) subsystem added in the last release, which allows you to monitor access to data in RAM, in relation to the selected process running in user space. For example, the subsystem makes it possible to analyze which areas of memory the process has accessed for the entire time of its operation, and which areas of memory have remained unclaimed.
      • DAMON_RECLAIM to identify and evict areas of memory that have not been accessed. The mechanism can be used to proactively soft page out of memory in conditions of approaching the exhaustion of free memory.
      • DAMOS (Data Access Monitoring-based Operation Schemes) to apply specified madvise() operations, such as reclaiming additional free memory, to process memory areas that are fixed at a certain memory access rate. DAMOS settings are configured via debugfs.
      • The ability to monitor the physical address space of memory (previously only virtual addresses could be monitored).
    • The implementation of the zstd compression algorithm has been updated to version 1.4.10, which has significantly improved the performance of various kernel subsystems that use compression (for example, decompression of the kernel image is accelerated by 35%, the performance of decompressing compressed data in Btrfs and SquashFS has increased by 15%, and in ZRAM - by 30%. Initially, the kernel used a separate implementation of zstd, based on version 1.3.1, which was released over three years ago and did not include many important optimizations. In addition to switching to the current version, the added patch also makes it easier to synchronize with the upstream branch of zstd, allowing you to generate code for inclusion in the kernel directly from the main zstd repository. In the future, the zstd code in the kernel is planned to be updated as new versions of the zstd library are released.
    • A large portion of improvements have been made to the eBPF subsystem. Added the ability to call kernel module functions from BPF programs. The bpf_trace_vprintk() function has been implemented, which, unlike bpf_trace_printk(), allows printing more than three arguments at a time. A new data storage structure (BPF map) bloom filter has been added, which allows using the probabilistic data structure of the same name to determine the presence of an element in a set. A new attribute BTF_KIND_TAG has been added that can be used in BPF programs to bind tags to function parameters, for example, to make it easier to identify errors in user programs. libbpf allows creating custom .rodata.*/.data.* sections, supports uprobe and kprobe tracing events, added an API for copying all BTF types from one object to another. AF_XDP support moved from libbpf to separate libxdp library. For the MIPS architecture, a JIT compiler for the BPF virtual machine has been implemented.
    • For the ARM64 architecture, support for ARMv8.6 extensions for the timer is implemented, including those that allow self-synchronization of the representation of system registers without using ISB instructions.
    • For the PA-RISC architecture, the ability to use the KFENCE mechanism to detect errors when working with memory has been implemented, and support for the KCSAN race condition detector has been added.
    • The ability to configure access rights to tracefs at the level of individual users and groups is provided, for example, now you can allow access to tracefs tools only to members of a certain group.
  • Virtualization and Security
    • The io_uring and device-mapper subsystems support the generation of audit events. io_uring provides the ability to control access through LSM modules. Added the ability to audit the openat2() system call.
    • The kernel code is completely free of non-continuous switch case statements (no return or break after each case block). When building the kernel, it will now be possible to use the "-Wimplicit-fallthrough" mode.
    • Included changes to tighten up bounds checks when executing the memcpy() function.
    • The io_uring asynchronous I/O interface implements the ability to apply security policies defined by the SELinux and Smack modules to I/O operations.
    • The IMA (Integrity Measurement Architecture) subsystem, which allows an external service to verify the state of kernel subsystems in order to verify their authenticity, implements the ability to apply rules based on the group identifier (GID) to which the file belongs or to which the user accessing the file belongs.
    • Disabled by default are some advanced mechanisms to protect seccomp() threads from Specter class attacks, which were considered redundant and do not significantly improve security, but negatively affect performance. Revised application of Retpoline protection.
    • The implementation of the cryptoloop mechanism has been removed, which was replaced by dm-crypt in 2004 and supports the same algorithms if necessary.
    • By default, unprivileged access to the eBPF subsystem is denied. The change was made to prevent BPF programs from being used to bypass protection against third-party attacks. If necessary, the administrator can return the ability to use eBPF to non-privileged users.
    • The ACRN hypervisor, designed to perform real-time tasks and use in critical systems, has added support for creating / deleting virtual devices and forwarding MMIO devices.
    • Support for KPP (Key-agreement Protocol Primitives) definitions has been added to the crypto engine, which simplifies the logic of developing drivers for cryptosystems.
    • For the Hyper-V hypervisor, support for the virtual machine isolation mode has been implemented, which implies encryption of the contents of memory.
    • Support for RISC-V architecture has been added to the KVM hypervisor. Implemented the ability to migrate within the host environment of virtual machines running using the AMD SEV and SEV-ES extensions. Added API for live migration of guest systems encrypted with AMD SEV (Secure Encrypted Virtualization).
    • For the PowerPC architecture, the STRICT_KERNEL_RWX mode is enabled by default, which blocks the use of memory pages that are simultaneously available for writing and execution.
    • On 32-bit x86 systems, support for the Memory hotplug, which has been out of service for over a year, has been dropped.
    • The liblockdep library has been removed from the core and will now be maintained separately from the core.
  • Network subsystem
    • For sockets, a new option SO_RESERVE_MEM has been implemented, with which you can reserve a certain amount of memory for a socket, which will always remain available for the socket and will not be removed. Using this option allows you to achieve performance gains by reducing the memory allocation and reclamation operations in the network stack, especially when low-memory conditions occur in the system.
    • Support for the Automatic Multicast Tunneling (RFC 7450) automatic tunneling protocol has been added, which allows delivering multicast traffic from networks that support Multicast to recipients in networks without Multicast. The protocol works through encapsulation in UDP packets.
    • Improved encapsulation of IOAM (In-situ Operations, Administration, and Maintenance) data in transit packets.
    • Added the ability to control transceiver power modes to the ethtool netlink API.
    • The netfilter subsystem implements the ability to classify packets at the egress level, i.e. at the stage when the driver receives a packet from the kernel network stack. In nftables, support for the corresponding filters appeared in version 1.0.1. Netfilter added the ability to match and change the internal headers and data for UDP and TCP (inner header / payload) that come after the transport header (transport header).
    • Added new sysctl parameters arp_evict_nocarrier and ndisc_evict_nocarrier, when set, the ARP cache and the ndisc table (neighbor discovery) will be cleared in case of a connection break (NOCARRIER).
    • Low Latency, Low Loss and Scalable Throughput (L4S) modes have been added to the fq_codel (Controlled Delay) network queue management mechanism.
  • Equipment
    • The amdgpu driver provides initial support for the DP 2.0 specification (DisplayPort 2.0) and DisplayPort tunneling capability over USB4. Added support for display controllers for Cyan Skillfish APUs (equipped with GPU Navi 1x). Expanded support for Yellow Carp APUs (Ryzen 6000 "Rembrandt" mobile processors).
    • The i915 driver stabilized support for Intel Alderlake S chips and implemented support for Intel PXP (Protected Xe Path) technology, which allows organizing a hardware-protected graphics session on systems with Intel Xe chips.
    • Work has been done in the nouveau driver to fix bugs and improve the style of the code.
    • Added support for x86-compatible Vortex CPUs (Vortex86MX). Linux has worked on such processors before, but the explicit identification of the specified CPUs was required to disable protection against Specter/Meltdown attacks, which are not applicable to the specified chips.
    • Added initial x86 support for Surface Pro 8 and Surface Laptop Studio.
    • Added driver to support sound chips used in AMD Yellow Carp, Van Gogh APUs, also added support for Cirrus CS35L41, Maxim MAX98520/MAX98360A, Mediatek MT8195, Nuvoton NAU8821, NVIDIA Tegra210, NXP i.MX8ULP, Qualcomm AudioReach, Realtek sound systems and codecs ALC5682I-VS, RT5682S, RT9120, Rockchip RV1126 and RK3568.
    • Added ishtp_eclite driver to access integrated Intel PSE (Programmable Service Engine) controllers using ISHTP (Integratd Sensor Hub Transport Protocol), for example, to get battery, temperature, and UCSI (USB Type-C Connector System Software) related information interface).
    • Added a driver for Nintendo Switch game controllers that supports Switch Pro and Joy-Cons devices. Added support for Wacom Intuos BT tablets (CTL-4100WL/CTL-6100WL) and Apple 2021 Magic Keyboard. Improved support for Sony PlayStation DualSense controllers. Added support for Xiaomi Mi mouse side buttons.
    • Added RT89 driver with support for Realtek 802.11ax wireless chips, as well as drivers for Asix AX88796C-SPI Ethernet adapters and Realtek RTL8365MB-VC switches.
    • Added drivers for PCI and PASemi i1c for Apple M2 chips.
    • Added support for ARM SoC, Raspberry Pi Compute Module 4, Fairphone 4, Snapdragon 690, LG G Watch R, Sony Xperia 10 III, Samsung Galaxy S4 Mini Value Edition, Xiaomi MSM8996 (Mi 5, Mi Note 2, Mi 5s, Mi Mix, Mi 5s Plus and Xiaomi Mi 5), Sony Yoshino (Sony Xperia XZ1, and Sony Xperia XZ Premium), F(x)tec Pro1 QX1000, Microchip LAN966, CalAmp LMU5000, Exegin Q5xR5, sama7g5, Samsung ExynosAutov9, Rockchip RK3566 , RK3399 ROCK Pi 4A+, RK3399 ROCK Pi 4B+, Firefly ROC-RK3328-PC, Firefly ROC-RK3399-PC-PLUS, ASUS Chromebook Tablet CT100, Pine64 Quartz64-A, Netgear GS110EMX, Globalscale MOCHAbin 7040, NXP S32G2, Renesas R8 A779M* , Xilinx Kria, Radxa Zero, JetHub D1/H1, Netronix E70K02.

Source: opennet.ru

Add a comment