Linux kernel 5.14

Linux kernel 5.14

After two months of development Linus Torvalds presented kernel release Linux 5.14. Notable changes include new quotactl_fd() and memfd_secret() system calls, removal of ide and raw drivers, new cgroup I/O priority controller, SCHED_CORE task scheduling mode, infrastructure for creating loaders for verified BPF programs.

The new version received 15883 fixes from 2002 developers, the patch size is 69 MB (the changes affected 12580 files, 861501 lines of code were added, 321654 lines were deleted). About 47% of all changes introduced in 5.14 are related to device drivers, approximately 14% of changes are related to updating code specific to hardware architectures, 13% are related to the networking stack, 3% to file systems, and 3% to internal kernel subsystems.

All innovations:

  • disk subsystem, I/O and file systems:
    • for group implemented a new I/O prioritization controller, rq-qos, which can control the processing priority of requests to block devices generated by members of each cgroup. Support for a new priority controller has been added to the mq-deadline I/O scheduler;
    • on ext4 file system implemented new ioctl command EXT4_IOC_CHECKPOINT, which forces all pending transactions from the log and their associated buffers to be flushed to disk, and also overwrites the area used by the log in storage. The change was prepared as part of the initiative to prevent information leaks from file systems;
    • in btrfs introduced performance optimizations: by eliminating unnecessary extended attribute logging during fsync execution, the performance of intensive operations with extended attributes increased by up to 17%. In addition, when performing truncation operations that do not affect extents, full synchronization is disabled, which reduced the operation time by 12%. A setting has been added to sysfs to limit the I / O bandwidth when checking the FS. Added ioctl calls to cancel device resize and delete operations;
    • in XFS revised implementation of a buffer cache, which is translated into allocating memory pages in batch mode. Improved cache efficiency;
    • F2FS added an option to work in read-only mode and implemented a compressed block caching mode (compress_cache) to improve random read performance. Implemented support for compressing files mapped to memory using the mmap() operation. To selectively disable file compression by mask, a new mount option, nocompress, has been proposed;
    • work has been done in the exFAT driver to improve compatibility with the storage of some digital cameras;
    • added system call quotactl_fd(), which allows you to manage quotas not through a special device file, but by specifying a file descriptor associated with the file system for which the quota is applied;
    • the old drivers for block devices with the IDE interface have been removed from the kernel, which were replaced by the libata subsystem a long time ago. Support for old devices is preserved in full, the changes concern only the possibility of using old drivers, when using which the drives were called /dev/hd*, and not /dev/sd*;
    • Removed the "raw" driver from the kernel, which provides unbuffered access to block devices via the /dev/raw interface. This functionality has long been implemented in applications using the O_DIRECT flag;
  • memory and system services:
    • task scheduler has a new scheduling mode SCHED_CORE, which allows you to control which processes can run together on the same CPU core. Each process can be assigned a cookie identifier that defines the scope of trust between processes (for example, belonging to the same user or container). When organizing code execution, the scheduler can ensure that the same CPU core is shared only between processes associated with the same owner, which can be used to block some attacks of the Specter class by preventing both trustworthy and untrustworthy tasks from executing on the same SMT (Hyper Threading) thread;
    • for the cgroup mechanism, support for the kill operation is implemented, which allows you to immediately kill all the processes associated with the group (send SIGKILL) by writing "1" to the virtual cgroup.kill file;
    • the capabilities related to the response to the detection of split locks ("split lock"), which occur when accessing unaligned data in memory due to the fact that when executing an atomic instruction, the data crosses two lines of the CPU cache, have been expanded. Such locks lead to a significant drop in performance, so in the past it was often possible to force the application to terminate the application that caused the lock. The new release adds a kernel command-line option "split_lock_detect=ratelimit:N" to define a system-wide limit on the intensity of lock operations per second, after exceeding which any process that became the source of a split lock will be forced to stop for 20 ms instead of terminating;
    • The cgroup bandwidth controller CFS (CFS bandwidth controller), which determines how much processor time can be allocated to each cgroup, has the ability to define limits limited by a given duration of action, which allows better regulation of workloads that are sensitive to delays. For example, setting cpu.cfs_quota_us to 50000 and cpu.cfs_period_us to 100000 will allow the process group to waste 100ms of CPU time every 50ms;
    • added an initial infrastructure for creating BPF program loaders, which will later allow loading only BPF programs signed with a trustworthy digital key;
    • added a new futex-operation FUTEX_LOCK_PI2, which uses a monotonic timer to calculate the timeout, which takes into account the time spent by the system in sleep mode;
    • for the RISC-V architecture, support for large memory pages (Transparent Huge-Pages) and the possibility of using the mechanism KFENCE to detect errors when working with memory;
    • into the madvise() system call, which provides a means to optimize process memory management, added flags MADV_POPULATE_READ and MADV_POPULATE_WRITE to generate a "page fault" in all pages of memory mapped for read or write operations without actually performing a read or write (prefault). The use of flags can be useful for reducing delays in the course of the program, due to the proactive execution of the "page fault" handler at once for all unallocated pages, without waiting for the actual access to them;
    • in unit testing system kunit added support for running tests in the QEMU environment;
    • added new tracers: "osnoise' to monitor application latencies caused by interrupts, and 'timerlat' to display detailed information about delays on timer wake-ups;
  • virtualization and security:
    • added system call memfd_secret(), which allows you to create a private memory area in an isolated address space, visible only to the owner process, not reflected to other processes and not directly accessible to the kernel;
    • in the seccomp system call filtering system, when moving lock handlers to user space, it is possible to use a single atomic operation to create a file descriptor for an isolated task and return it when processing a system call. The proposed operation solves problem with a handler interrupt in user space when a signal arrives;
    • added new mechanism to manage the resource limit in the user ID namespace, which binds individual rlimit counters to a user in "user namespace". The change solves the problem with the use of common resource counters when the same user starts processes in different containers;
    • the ability to use the MTE (MemTag, Memory Tagging Extension) extension in guest systems has been added to the KVM hypervisor for ARM64 systems, which allows you to bind tags to each memory allocation operation and arrange for checking the correctness of using pointers to block the exploitation of vulnerabilities caused by accessing already freed memory blocks, overflows buffer, calls before initialization and use outside the current context;
    • ARM64 platform-provided Pointer Authentication can now be configured separately for kernel and user space. The technology allows the use of specialized ARM64 instructions to verify return addresses using digital signatures that are stored in the unused upper bits of the pointer itself;
    • in user-mode Linux added support for using drivers for PCI devices with a virtual PCI bus implemented by the PCI-over-virtio driver;
    • for x86 systems, support for the paravirtualized virtio-iommu device has been added, allowing IOMMU requests such as ATTACH, DETACH, MAP, and UNMAP to be sent over the virtio transport without emulating page tables;
    • for Intel CPUs, starting from the Skylake family and ending with Coffee Lake, the use of Intel TSX (Transactional Synchronization Extensions) extensions is disabled by default, which provide tools for improving the performance of multi-threaded applications by dynamically eliminating unnecessary synchronization operations. Extensions are disabled due to the possibility of attacks Zombieload, manipulating the leakage of information through third-party channels that occurs during the operation of the asynchronous interruption of operations mechanism (TAA, TSX Asynchronous Abort);
  • network subsystem:
    • continued integration into the core of MPTCP (MultiPath TCP), an extension of the TCP protocol for organizing the operation of a TCP connection with the delivery of packets simultaneously along several routes through different network interfaces bound to different IP addresses. In the new release added a mechanism for setting your own traffic hashing policies for IPv4 and IPv6 (multipath hash policy), which makes it possible to determine from the user space which of the fields in the packets, including encapsulated ones, will be used when calculating the hash that determines the choice of path for the packet;
    • socket support added to virtio virtual transport SOCK_SEQPACKET (ordered and reliable transmission of datagrams);
    • the capabilities of the SO_REUSEPORT socket mechanism have been expanded, which allows several listening sockets to connect to one port at once to receive connections with the distribution of incoming requests simultaneously to all sockets connected via SO_REUSEPORT, which simplifies the creation of multi-threaded server applications. In the new version added means for transferring control to another socket in case of a failure while processing a request by the initially selected socket (solves the problem with the loss of individual connections when services are restarted);
  • equipment:
    • in the amdgpu driver implemented support for the new AMD Radeon RX 6000 GPU series codenamed "Beige Goby" (Navi 24) and "Yellow Carp", as well as improved support for Aldebaran (gfx90a) GPUs and Van Gogh APUs. Added the ability to work with multiple eDP panels at the same time. For APU Renoir, support for working with encrypted buffers in video memory (TMZ, Trusted Memory Zone) has been implemented. Added hot-unplug support for graphics cards. For Radeon RX 6000 (Navi 2x) GPUs and older AMD GPUs, support for the ASPM (Active State Power Management) power saving mechanism is enabled by default, which was previously only enabled for Navi 1x, Vega and Polaris GPUs;
    • for AMD chips added support for shared virtual memory (SVM, shared virtual memory) based on the HMM (Heterogeneous memory management) subsystem, which allows the use of devices with their own memory management units (MMU, memory management unit), which can access the main memory. In particular, with the help of HMM, you can organize a shared address space between the GPU and the CPU, in which the GPU can access the main memory of the process;
    • added initial technology support AMD Smart Shift, which dynamically changes CPU and GPU power settings on laptops with an AMD chipset and graphics card to boost performance in gaming, video editing, and 3D rendering;
    • in the i915 driver for Intel video cards is included support for Intel Alderlake P chips;
    • added drm/hyperv driver for Hyper-V virtual graphics adapter;
    • added simpledrm graphics driver that uses the EFI-GOP or VESA framebuffer provided by the UEFI firmware or BIOS for output. The main purpose of the driver is to provide graphical output during the initial boot stages, before a full DRM driver can be used. The driver can also be used as a temporary solution for hardware that does not yet have native DRM drivers;
    • added monoblock computer support Raspberry Pi 400;
    • Added dell-wmi-privacy driver to support Dell-supplied hardware camera and microphone switches.
    • for Lenovo laptops added WMI interface for changing BIOS settings via sysfs /sys/class/firmware-attributes/;
    • expanded support for devices with USB4 interface;
    • added support for AmLogic SM1 TOACODEC, Intel AlderLake-M, NXP i.MX8, NXP TFA1, TDF9897, Rockchip RK817, Qualcomm Quinary MI2 and Texas Instruments TAS2505 sound cards and codecs. Improved sound support on HP and ASUS laptops. Added patches to reduce delays before audio playback starts on devices with a USB interface.

Source - opennet.ru.

Source: linux.org.ru