After two months of development, Linus Torvalds released the Linux kernel 6.14. Among the most notable changes: ntsync driver with Windows NT synchronization primitives, read operation balancing settings in Btrfs RAID1, reflink support in XFS in realtime mode, the ability to uncache buffered I/O, dmem cgroup for limiting GPU memory, enabling io_uring in FUSE, attribute delegation in NFS, atomic write support in Device mapper, acceleration of symbolic links, control over the ability to execute scripts, support for Qualcomm Snapdragon 8 Elite chips, a driver for AMD NPU.
The new version includes 12115 fixes from 1984 developers, the patch size is 39 MB (the changes affected 10170 files, 531586 lines of code were added, 235999 lines were deleted). The previous release had 14172 fixes from 2086 developers, the patch size was 46 MB. About 41% of all changes presented in 6.14 are related to device drivers, about 13% of changes are related to updating code specific to hardware architectures, 14% are related to the network stack, 7% are related to file systems and 4% are related to internal kernel subsystems.
Key innovations in kernel 6.14:
- Disk Subsystem, I/O and File Systems
- The Btrfs file system now supports new methods for balancing read operations between drives in a RAID1 array. In addition to the previously existing load distribution based on process identifiers (pid), the new version offers three new balancing modes: "rotation" (uniform load distribution across all drives, enabled by default); "latency" (distribution taking into account delays, can be useful in case of failures or unstable operation of drives); devid (manual control). To change the balancing mode, the interface "/sys/fs/btrfs/" has been added. /read_policy". Among other changes in Btrfs is the implementation of the FS_IOC_READ_VERITY_METADATA ioctl.
- Added a non-cached buffered I/O mode, in which data is removed from the page cache immediately after the read or write operations are completed. The change can be useful when using very fast storage devices, for which caching operations in RAM is redundant. For such devices, the new mode allows eliminating unnecessary memory consumption by the page cache without resorting to the complex Direct I/O API.
- A new event FS_PRE_ACCESS has been added to fsnotify, a mechanism for tracking file system changes. It is generated at the stage before accessing the file contents. The event is processed in synchronous mode, i.e. the kernel sends the event and waits for a response. If the response is received, the operation is performed, and if a failure occurs, the system call returns an error code to the user space. Using FS_PRE_ACCESS, a process in the user space can, for example, organize the filling of a file as data is ready in a slow storage.
- The FUSE subsystem, which enables the creation of user-space file system implementations, has been updated to support the exchange of data between the kernel and a user-space handler using the io_uring I/O mechanism. The change improves FUSE performance by reducing context switches between the kernel and user space.
- The XFS file system now supports reverse mapping (rmap, reverse-mapping) in realtime device mode. Reverse mapping allows you to determine which file a given block on a storage device is used to store. Using rmap, XFS for realtime mode supports the reflink operation, which allows you to create copies of files by cloning file metadata and creating a link to existing data without actually copying it.
- VFS implements caching of symbolic link sizes, which allowed to speed up readlink operation by 1.5% (in the test with /initrd.img in ext4). Caching is enabled in ext4 and tmpfs.
- The NFSv4.2 implementation adds support for file attribute delegation, allowing file attributes such as modification time (mtime) to be managed on the NFS client side without having to flush changes to the server. NFS also improves support for the "LOCALIO" protocol, which allows determining whether the NFS client and server are on the same host, to enable appropriate optimizations such as the client using Direct I/O.
- Improved performance of read operations in NETFS, CIFS and AFS (Andrew File System) file systems.
- Squashfs now includes a direct block loading mode into the page cache (SQUASHFS_FILE_DIRECT), which eliminates the need for a separate read_page cache. This change reduces the amount of memory consumed by Squashfs.
- The statx() system call implements the STATX_DIO_READ_ALIGN flag to determine the required alignment for file read operations.
- The Bcachefs file system has an updated and stabilized disk structure format. Any further format changes will be classified as optional and will be implemented as optional add-ons. The speed of FS integrity checking has been significantly increased. In addition, Bcachefs has improved read-only mode; issues leading to memory access after it has been freed (use after free) have been eliminated; issues with reflink pointers in fsck have been resolved; transaction restart handling has been fixed.
- The md-linear module, designed to combine block devices, has been returned. This module was previously declared obsolete and removed from the 6.8 kernel, but as it turned out, it was in demand and therefore has now been restored.
- The F2FS and SQUASHFS file systems have been converted to use page folios.
- The OCFS2 and DLMFS file systems have been migrated to use the new partition mounting API.
- The null_blk driver implements the "rotational" attribute, which is exposed via configfs and allows simulating work with a device based on rotating disks to simplify testing of kernel functions.
- The Device mapper system and the dm-mirror, dm-io, dm-table, dm-linear, dm-stripe, and dm-raid1 modules support atomic writes.
- Memory and system services
- The ntsync driver has been integrated into the kernel. It implements the /dev/ntsync character device and a set of synchronization primitives used in the Windows NT kernel. Implementing such primitives at the kernel level can significantly increase the performance of Windows games launched using Wine. The performance gain is achieved by eliminating the overhead associated with using RPC in user space. The creation of a separate driver for the Linux kernel is explained by the problematic nature of correctly implementing the NT synchronization API over existing primitives in the kernel.
- Added a new DMEM cgroup controller for separately accounting for memory areas of devices such as GPUs. DMEM allows creating separate cgroups for different GPU-based tasks so that they can run without interfering with each other. This new feature solves the problem of GPU operations being forced to terminate when available memory is exhausted by accounting for GPU-mapped memory and driver-used CPU memory in separate cgroups.
- Optimizations have been made to scale the TLB (Translation Lookaside Buffer) cache flush operation, which is used to speed up the translation of virtual addresses into physical ones. The added optimizations come down to delayed updating of some data structures during context switching, which improves performance when passing some tests.
- Improved performance of the MGLRU (Multi-Generational LRU) mechanism used to determine which memory pages are in use and which can be pushed to the swap partition.
- The changes from the Rust-for-Linux branch related to using Rust as a second language for developing drivers and kernel modules have been ported (Rust support is not active by default, and does not result in Rust being included in the list of mandatory build dependencies for the kernel). The ability to use the "derive(CoercePointee)" macro in the kernel code has been introduced, allowing the use of smart pointers with trait objects. The kernel now includes Rust bindings for PCI, platforms, Open Firmware, character devices, and some I/O functions. Greg Kroah-Hartman, who is responsible for maintaining the stable branch of the Linux kernel, described the current state as "almost ready to write a real driver in Rust."
- New code for generating versions of debug symbols for loaded modules is proposed in build scripts, which now uses information from debug records in DWARF format, rather than parsing the source code directly. The change allows versioning of debug symbols for modules written in the Rust language. The old implementation is also left in the kernel, and the choice of the generator is made at the build options level.
- For the PowerPC architecture, support for the lazy preemption mode (PREEMPT_LAZY) is implemented, which corresponds to the full preemption mode for realtime tasks (RR/FIFO/DEADLINE), but delays the preemption of normal tasks (SCHED_NORMAL) until the tick boundary.
- The performance profiling subsystem "perf" has been updated to support AMD processor power consumption counters. The ability to work on systems with up to 2048 CPU cores has been added.
- The pid_max sysctl parameter has been made available for use with process ID namespaces. The pid_max parameter is intended to limit the maximum value of process IDs (PIDs), and can now be used to limit the maximum number of processes that can run in a given namespace. The parameter is processed hierarchically, meaning that restrictions in outer namespaces propagate to nested namespaces.
- When using the execveat system call to start a process, the /proc filesystem will now display the name of the file being run, rather than the file descriptor number.
- A mountinfo utility has been added to the kernel source code (in the samples/vfs directory), demonstrating the use of the statmount() and listmount() system calls.
- The BPF subsystem introduces new functions bpf_local_irq_save() and bpf_local_irq_restore() to temporarily disable interrupts on the local CPU. The functions can be used to implement structures whose processing is not suspended by interrupts.
- In the madvise() system call, when using the MADV_DONTNEED and MADV_FREE flags, the memory page tables associated with the address range being freed are freed, since in some situations empty memory pages can occupy quite a lot of memory.
- The OpenRISC architecture supports the restartable sequences (rseq) mechanism, which is designed to quickly execute atomically operations that, if interrupted by another thread, are cleared and a repeat attempt is made to execute them.
- The code was reorganized with the implementation of the CRC32 and CRC-T10DIF algorithms, which no longer intersects with the crypto subsystem and is called directly from the library interface. The change made it possible to simplify the code and increase its efficiency.
- The io_uring asynchronous I/O system has been updated to include an interface for passing additional integrity metadata when performing read and write operations.
- Virtualization and Security
- The AT_EXECVE_CHECK flag has been added to the execveat system call, which allows checking the permissibility of file execution without actually running it, but taking into account security policies, access rights, and active LSM modules. The securebit flags SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE have been proposed for use in combination with AT_EXECVE_CHECK, which can be used to restrict the execution of files with scripts in interpreted programming languages. The SECBIT_EXEC_RESTRICT_FILE flag instructs linkers and interpreters to use the AT_EXECVE_CHECK option to check the permissibility of execution, and the SECBIT_EXEC_DENY_INTERACTIVE flag prohibits processing interactive commands. The main idea of the change is the ability to apply security policies not only to traditional executable files, but also to text files with scripts that can be loaded by running the interpreter (i.e. the ban on execution can be implemented not only when running "./script.sh" but also when running in the form of "sh script.sh").
- On x86 systems, support for secure time counters for guest systems has been implemented, preventing the guest system clock from being modified from the host environment. This feature is based on the AMD SEV (Secure Encrypted Virtualization) mechanism, used in virtualization systems for protection. virtual machines from interference from the hypervisor or host system administrator.
- The SELinux mandatory access control system has been updated to support xperm rules, which allow SELinux policies to be bound to specific ioctl() calls or netlink messages.
- To digitally sign kernel modules, the SHA1 algorithm is used by default instead of SHA512.
- The VirtualBox guest drivers now include support for the ARM64 architecture.
- In the hypervisor KVM Work continued on utilizing the Intel TDX (Trusted Domain Extensions) mechanism for encrypting guest system memory.
- Added support for error recovery mode to virtio_blk.
- Network subsystem
- The RxRPC protocol implementation introduces the ability to use larger UDP frames to increase throughput.
- For TCP, support for the RACK-TLP algorithm for detecting packet loss has been added.
- Added a new sysctl parameter tcp_tw_reuse_delay, which operates on a network namespace basis and allows you to specify a delay before the system can reuse a network port number after a TCP socket is closed.
- Added the ability to select a Precise Time Protocol (PTP) provider to generate timestamps at the PHY and MAC levels.
- For IPsec, support for the mechanism for aggregation and fragmentation of encapsulated IP packets has been implemented - IP-TFS/AGGFRAG (IP Traffic Flow Security/Aggregation and Fragmentation Mode for Encapsulating Security Payload).
- The network sockets system has been updated to support the transmission of priority information (SO_PRIORITY) in the form of control messages (cmsg). The SO_RCVPRIORITY option has been added to network sockets, enabling the transmission of socket priority information in the recvmsg() function.
- Equipment
- Added amdxdna driver for AMD CPUs' integrated NPU (Neural Processing Unit) accelerators based on the XDNA architecture, designed to accelerate operations related to machine learning. The XDNA-based NPU is available in the 7040 and 8040 series of AMD Ryzen processors, AMD Alveo V70 accelerators, and AMD Versal SoCs.
- The i915 driver has been updated to include new GPU IDs, an HDMI initialization failure handler, and improved reliability of GPU engine resets on Haswell and older systems.
- Work continued on the Xe drm (Direct Rendering Manager) driver for GPUs based on the Intel Xe architecture, which is used in Intel Arc family graphics cards and integrated graphics, starting with Tiger Lake processors.
- The Nouveau driver now has the ability to transfer buffers with GSP-RM logs via debugfs.
- The AMDGPU driver now supports the DRM panic mechanism, which displays a "blue screen of death" when crashes. Preparations for support of the upcoming Radeon RX 9000 series of graphics cards based on the RDNA4 architecture have been continued. Support for DCN 3.5, GG 9.5, IH 4.4, PSP 13.x, SMU 13.x, VCN 5.x, JPEG 5.x, GC 12.x, DC FAMS, RAS, and ISP has been updated.
- Added support for the Qualcomm SM6150 (QCS615) platform to the msm (GPU Qualcomm Adreno) DRM driver.
- Added support for SoC MediaTek MT8188 with GPU Mali-G57 to panfrost DRM driver.
- Added support for Broadcom BCM4 SoC (Raspberry Pi 2712) to vc5 DRM driver.
- The vfio driver nvgrace-gpu has been updated to support NVIDIA Grace Blackwell 200 chips.
- The package includes a driver for Intel THC (Touch Host Controller), used to interact with touch screens and touchpads on some laptops. Added support for Wacom devices with a PCI interface. Added support for QH Electronics game controllers.
- Added support for ARM boards, SoC and devices: Qualcomm Snapdragon 8 Elite (SM8750), Qualcomm Snapdragon AR2 (SAR2130P), Qualcomm IQ6/IQ8, Snapdragon 425 (MSM8917), Samsung Exynos 9810, Blaize BLZP1600, Microchip SAMA7D65, Renesas R-Car V4H ES3.0, Renesas RZ/G3E. Added support for SoC SpacemiT K1 based on RISC-V architecture.
- The rawmidi and sequencer APIs have been expanded in the ALSA sound subsystem for MIDI 2.0. The API for offloading compression operations to the sound card has been updated to support ASRC (Asynchronous Sample Rate Conversion).
- Added support for sound systems of Allwinner suinv F1C100s, Awinc AW88083, Realtek ALC5682I-VE, TAS2781, Focusrite Scarlett 4th Gen 16i16, 18i16 and 18i20 devices. Added support for SteelSeries Arctis 9 wireless headphones.
Source: opennet.ru
