Linux 5.14 kernel release

After two months of development, Linus Torvalds has unveiled the release of the Linux 5.14 kernel. Notable changes include new quotactl_fd() and memfd_secret() system calls, removal of ide and raw drivers, new cgroup I/O priority controller, SCHED_CORE task scheduling mode, infrastructure for creating loaders for verified BPF programs.

The new version received 15883 fixes from 2002 developers, the patch size is 69 MB (the changes affected 12580 files, 861501 lines of code were added, 321654 lines were deleted). About 47% of all changes introduced in 5.14 are related to device drivers, approximately 14% of changes are related to updating code specific to hardware architectures, 13% are related to the networking stack, 3% to file systems, and 3% to internal kernel subsystems.

Main innovations:

  • Disk Subsystem, I/O and File Systems
    • A new I/O prioritization controller, rq-qos, has been implemented for cgroups, which can control the processing priority of requests to block devices generated by members of each cgroup. Support for a new priority controller has been added to the mq-deadline I/O scheduler.
    • The ext4 file system implements a new ioctl command, EXT4_IOC_CHECKPOINT, which forces all pending transactions from the log and their associated buffers to be flushed to disk, as well as overwriting the area used by the log in storage. The change was prepared as part of the initiative to prevent information leaks from file systems.
    • Performance optimizations have been made to Btrfs: by eliminating unnecessary extended attribute logging during fsync execution, the performance of intensive extended attribute operations has increased by up to 17%. In addition, when performing truncation operations that do not affect extents, full synchronization is disabled, which reduced the operation time by 12%. A setting has been added to sysfs to limit the I / O bandwidth when checking the FS. Added ioctl calls to cancel device resize and remove operations.
    • In XFS, the implementation of the buffer cache has been redesigned, which has been transferred to allocating memory pages in batch mode. Improved cache efficiency.
    • F2FS added an option to run in read-only mode and implemented a compressed block caching mode (compress_cache) to improve random read performance. Implemented support for compressing files mapped to memory using the mmap() operation. To selectively disable file compression by mask, a new mount option, nocompress, has been proposed.
    • Work has been done in the exFAT driver to improve compatibility with some digital camera storage.
    • The quotactl_fd() system call has been added, which allows you to manage quotas not through a special device file, but by specifying a file descriptor associated with the file system for which the quota is applied.
    • The old IDE block device drivers have been removed from the kernel, which were replaced by the libata subsystem a long time ago.
    • Removed the "raw" driver from the kernel, which provides unbuffered access to block devices via the /dev/raw interface. This functionality has long been implemented in applications using the O_DIRECT flag.
  • Memory and system services
    • The task scheduler has a new scheduling mode, SCHED_CORE, that allows you to control which processes can run together on the same CPU core. Each process can be assigned a cookie identifier that defines the scope of trust between processes (for example, belonging to the same user or container). When organizing code execution, the scheduler can ensure that only processes associated with the same owner share a single CPU core, which can be used to block some Specter attacks by preventing trusted and untrusted tasks from running on the same SMT (Hyper Threading) thread.
    • For cgroup, support for the kill operation is implemented, which allows you to immediately kill all processes attached to the group (send SIGKILL) by writing "1" to the virtual cgroup.kill file.
    • The capabilities related to the response to the detection of split locks ("split lock"), which occur when accessing unaligned data in memory due to the fact that when executing an atomic instruction, the data crosses two lines of the CPU cache, have been expanded. Such locks lead to a significant drop in performance, so in the past it was often possible to force the application to terminate the application that caused the lock. The new release adds a kernel command-line parameter "split_lock_detect=ratelimit:N" to set a system-wide limit on the rate of lock operations per second, after which any process that has become a source of a split lock will be forced to stop for 20 ms instead of terminating.
    • The cgroup bandwidth controller CFS (CFS bandwidth controller), which determines how much processor time can be allocated to each cgroup, has the ability to define limits limited by a given duration of action, which allows better regulation of workloads that are sensitive to delays. For example, setting cpu.cfs_quota_us to 50000 and cpu.cfs_period_us to 100000 will allow the process group to waste 100ms of CPU time every 50ms.
    • An initial infrastructure for creating BPF program loaders has been added, which will later allow loading only BPF programs signed with a trustworthy digital key.
    • A new futex operation, FUTEX_LOCK_PI2, has been added that uses a monotonic timer to calculate a timeout that takes into account the time the system has spent in sleep mode.
    • For the RISC-V architecture, support for large memory pages (Transparent Huge-Pages) and the ability to use the KFENCE mechanism to detect errors when working with memory are implemented.
    • Added the MADV_POPULATE_READ and MADV_POPULATE_WRITE flags to the madvise() system call, which provides a means to optimize process memory management, to generate a "page fault" in all memory pages mapped for read or write operations without actually performing a read or write (prefault). The use of flags can be useful for reducing delays in the program's execution by proactively executing the "page fault" handler for all unallocated pages at once, without waiting for them to actually be accessed.
    • Support for running tests in the QEMU environment has been added to the kunit unit testing system.
    • New tracers have been added: "osnoise" to monitor application delays caused by interrupt handling, and "timerlat" to display detailed information about delays on timer wake-ups.
  • Virtualization and Security
    • The memfd_secret() system call has been added to create a private memory area in an isolated address space that is visible only to the owner process, not reflected to other processes, and not directly accessible to the kernel.
    • The seccomp system call filtering system provides the ability to use a single atomic operation to create a file descriptor for an isolated task and return it when processing a system call when moving lock handlers to user space. The proposed operation solves the problem with interrupting the handler in user space when a signal arrives.
    • A new mechanism has been added to manage resource limits in the user ID namespace, which binds individual rlimit counters to a user in "user namespace". The change solves the problem with the use of common resource counters when the same user starts processes in different containers.
    • In the KVM hypervisor for ARM64 systems, the ability to use the MTE (MemTag, Memory Tagging Extension) extension in guest systems has been added, which allows you to bind tags to each memory allocation operation and arrange for checking the correctness of using pointers to block the exploitation of vulnerabilities caused by accessing already freed memory blocks, overflows buffer, pre-initialization calls, and use outside the current context.
    • Pointer Authentication provided by the ARM64 platform can now be configured separately for kernel and user space. The technology allows the use of specialized ARM64 instructions to verify return addresses using digital signatures that are stored in the unused upper bits of the pointer itself.
    • User-mode Linux adds support for using drivers for PCI devices with a virtual PCI bus implemented by the PCI-over-virtio driver.
    • For x86 systems, support has been added for the virtio-iommu paravirtualized device, which allows IOMMU requests such as ATTACH, DETACH, MAP, and UNMAP to be sent over the virtio transport without emulating page tables.
    • For Intel CPUs, from the Skylake family to Coffee Lake, the use of Intel TSX (Transactional Synchronization Extensions) extensions is disabled by default, which provide tools for improving the performance of multi-threaded applications by dynamically eliminating unnecessary synchronization operations. Extensions are disabled due to the possibility of Zombieload attacks manipulating information leakage through third-party channels that occur during the operation of the asynchronous abort of operations (TAA, TSX Asynchronous Abort) mechanism.
  • Network subsystem
    • Continued integration into the core of MPTCP (MultiPath TCP), an extension of the TCP protocol for organizing the operation of a TCP connection with the delivery of packets simultaneously along several routes through different network interfaces bound to different IP addresses. The new release adds a mechanism for setting your own traffic hashing policies for IPv4 and IPv6 (multipath hash policy), which makes it possible from user space to determine which of the fields in packets, including encapsulated ones, will be used when calculating the hash that determines the choice of path for the packet .
    • Support for SOCK_SEQPACKET sockets (ordered and reliable datagram transmission) has been added to the virtio virtual transport.
    • The capabilities of the SO_REUSEPORT socket mechanism have been extended, which allows several listening sockets to connect to one port at once to receive connections with the distribution of incoming requests simultaneously across all sockets connected via SO_REUSEPORT, which simplifies the creation of multi-threaded server applications. The new version adds tools to transfer control to another socket in case of a failure while processing a request by the initially selected socket (solves the problem with the loss of individual connections when services are restarted).
  • Equipment
    • The amdgpu driver adds support for the new AMD Radeon RX 6000 series of GPUs codenamed "Beige Goby" (Navi 24) and "Yellow Carp", as well as improved support for Aldebaran (gfx90a) GPUs and Van Gogh APUs. Added the ability to work with multiple eDP panels at the same time. For APU Renoir, support for working with encrypted buffers in video memory (TMZ, Trusted Memory Zone) has been implemented. Added hot-unplug support for graphics cards. For Radeon RX 6000 (Navi 2x) GPUs and older AMD GPUs, support for the ASPM (Active State Power Management) power saving mechanism is enabled by default, which was previously only enabled for Navi 1x, Vega and Polaris GPUs.
    • For AMD chips, support for shared virtual memory (SVM, shared virtual memory) has been added based on the HMM (Heterogeneous memory management) subsystem, which allows the use of devices with their own memory management units (MMU, memory management unit), which can access the main memory. In particular, with the help of HMM, you can organize a shared address space between the GPU and the CPU, in which the GPU can access the main memory of the process.
    • Added initial support for AMD Smart Shift technology, which dynamically changes CPU and GPU power settings on laptops with an AMD chipset and graphics card to boost gaming, video editing, and 3D rendering performance.
    • The i915 driver for Intel graphics cards includes support for Intel Alderlake P chips.
    • Added drm/hyperv driver for Hyper-V virtual graphics adapter.
    • Added support for the Raspberry Pi 400 monoblock computer.
    • Added the dell-wmi-privacy driver to support Dell-supplied camera and microphone hardware switches.
    • For Lenovo laptops, a WMI interface has been added to change BIOS settings via sysfs /sys/class/firmware-attributes/.
    • Extended support for USB4 devices.
    • Added support for AmLogic SM1 TOACODEC, Intel AlderLake-M, NXP i.MX8, NXP TFA1, TDF9897, Rockchip RK817, Qualcomm Quinary MI2 and Texas Instruments TAS2505 sound cards and codecs. Improved sound support on HP and ASUS laptops. Added patches to reduce delays before audio playback starts on USB devices.

Source: opennet.ru

Add a comment