Hello everyone, we are sharing with you the second part of the publication "Virtual file systems in Linux: why are they needed and how do they work?" The first part can be read
How to monitor VFS with eBPF and bcc tools
The easiest way to understand how the kernel operates on files sysfs
is to see it in practice, and the easiest way to watch ARM64 is to use eBPF. eBPF (short for Berkeley Packet Filter) consists of a virtual machine running in query
) from the command line. Kernel sources tell the reader what the kernel can do; running the eBPF tools on a booted system shows what the kernel is actually doing.
Luckily, getting started with eBPF is easy enough with the help of tools. bcc
are Python scripts with small inserts of C code, which means that anyone familiar with both languages can easily modify them. IN bcc/tools
there are 80 Python scripts, which means that most likely a developer or system administrator will be able to pick up something suitable for solving a problem.
To get at least a rough idea of what VFS does on a running system, try vfscount
or vfsstat
. This will show, let's say that dozens of calls vfs_open()
and "his friends" happen literally every second.
vfsstat.py
is a Python script, with inserts of C code, that simply counts VFS function calls.
Let's take a more trivial example and see what happens when we insert a USB flash drive into a computer and the system detects it.
With eBPF, you can see what's going on in
/sys
when a USB flash drive is inserted. Here is a simple and complex example.
In the example shown above, bcc
tool sysfs_create_files()
. We see that sysfs_create_files()
was launched with kworker
stream in response to the fact that the flash drive was inserted, but what file was created? The second example shows the power of eBPF. Here trace.py
prints the kernel backtrace (option -K) and the name of the file that was created sysfs_create_files()
. Insertion in single statements is C code that includes an easily recognizable format string provided by a Python script that runs LLVM just-in-time compiler. It compiles this line and executes it in a virtual machine inside the kernel. Full function signature sysfs_create_files ()
must be reproduced in the second command so that the format string can refer to one of the parameters. Mistakes in this piece of C code result in recognizable C compiler errors. For example, if the -l option is omitted, then you will see "Failed to compile BPF text." Developers who are familiar with C and Python will find tools bcc
easy to expand and change.
When the USB stick is inserted, the kernel backtrace will show that PID 7711 is a thread kworker
who created the file «events»
в sysfs
. Accordingly, a call from sysfs_remove_files()
will show that removing the drive resulted in the deletion of the file events
, which corresponds to the general concept of reference counting. At the same time, viewing sysfs_create_link ()
with eBPF while inserting the USB drive will show that at least 48 symbolic links have been created.
So what's the point of the events file? Usage disk_add_events ()
, and either "media_change"
or "eject_request"
can be written to the event file. Here, the kernel block layer informs userspace that the "disk" has appeared and ejected. Note how informative this research method is by inserting a USB drive, compared to trying to figure out how things work from purely source code.
Read-only root filesystems enable embedded devices
Of course, no one turns off the server or their computer by pulling the plug out of the socket. But why? This is because mounted file systems on physical storage devices may have write-backs, and data structures that record their state may not be synchronized with records in storage. When this happens, system owners have to wait for the next boot to run the utility. fsck filesystem-recovery
and, in the worst case, lose data.
However, we all know that many IoT devices, as well as routers, thermostats, and cars, now run Linux. Many of these devices have little to no user interface and there is no way to turn them off "cleanly". Imagine starting a car with a dead battery when power to the control unit is on. fsck
when the engine finally starts running? And the answer is simple. Embedded devices rely on the root filesystem ro-rootfs
(read-only root filesystem)).
ro-rootfs
offer many benefits that are less obvious than authenticity. One advantage is that malware cannot write to /usr
or /lib
if no Linux process can write there. Another is that a largely immutable file system is critical to field support for remote devices, as support personnel use local systems that are nominally identical to those in the field. Perhaps the most important (but also most insidious) benefit is that ro-rootfs forces developers to decide which system objects will be immutable at the system design stage. Working with ro-rootfs can be inconvenient and painful, as is often the case with const variables in programming languages, but their benefits easily outweigh the extra overhead.
Creation rootfs
read-only requires some extra effort for embedded developers, and this is where VFS comes into play. Linux requires files in /var
were writable, and in addition, many popular applications that run embedded systems will attempt to create configuration dot-files
в $HOME
. One solution for configuration files in the home directory is usually to pre-generate them and build them in rootfs
. For /var
one possible approach is to mount it on a separate writable partition, while /
mounted read-only. Another popular alternative is to use bind or overlay mounts.
Linkable and stackable mounts, their use by containers
Command execution man mount
is the best way to learn about linkable and overlay mounts, which give developers and system administrators the ability to create a file system in one path and then expose it to applications in another. For embedded systems, this means the ability to store files in /var
on a read-only flash drive, but an overlay or bind mount path from tmpfs
в /var
when loaded, will allow applications to write notes there (scrawl). The next time you turn on the changes in /var
will be lost. An overlay mount creates a union between tmpfs
and the underlying file system and allows you to make ostensible changes to existing files in ro-tootf
whereas bindable mount can make new empty tmpfs
folders visible as writable in ro-rootfs
ways. While overlayfs
this is the right one (proper
) filesystem type, linkable mount is implemented in
Based on the description of overlay and bind mounts, no one is surprised that mountsnoop
from bcc
.
Вызов system-nspawn
starts a container while running mountsnoop.py
.
Let's see what happened:
Release mountsnoop
during "loading" of the container shows that the container runtime is heavily dependent on linkable mounts (Only the beginning of the long output is shown).
Here systemd-nspawn
provides the selected files in procfs
и sysfs
host to container as path to it rootfs
. Besides MS_BIND
flag that sets a bind mount, some other flags on the system being mounted determine the relationship between changes in the host namespace and the container namespace. For example, a bindable mount can either skip changes to /proc
и /sys
into the container, or hide them depending on the call.
Conclusion
Understanding Linux internals can seem like an impossible task, as the kernel itself contains a huge amount of code, leaving aside Linux user-space applications and system call interfaces in C libraries such as glibc
. One way to make progress is to read the source code of one subsystem of the kernel, with an emphasis on understanding system calls and user-space headers, as well as basic internal kernel interfaces, such as the table file_operations
. File operations provide the "everything is a file" principle, so managing them is especially nice. Kernel C source files in the top-level directory fs/
represent the implementation of virtual file systems, which are a wrapper layer that provides broad and relatively simple compatibility between popular file systems and storage devices. Link and overlay mounts via Linux namespaces are the magic of VFS, which makes it possible to create read-only containers and root filesystems. Combined with studying the source code, the eBPF core tool and its interface bcc
make core exploration easier than ever.
Friends, write if this article was useful for you? Do you have any comments or remarks? And those who are interested in the course "Linux Administrator" are invited to
Source: habr.com