Virtual file systems in Linux: why are they needed and how do they work? Part 2

Hello everyone, we are sharing with you the second part of the publication "Virtual file systems in Linux: why are they needed and how do they work?" The first part can be read here. Recall that this series of publications is timed to coincide with the launch of a new stream at the rate "Linux Administrator"which will start very soon.

How to monitor VFS with eBPF and bcc tools

The easiest way to understand how the kernel operates on files sysfs is to see it in practice, and the easiest way to watch ARM64 is to use eBPF. eBPF (short for Berkeley Packet Filter) consists of a virtual machine running in core, which privileged users can request (query) from the command line. Kernel sources tell the reader what the kernel can do; running the eBPF tools on a booted system shows what the kernel is actually doing.

Virtual file systems in Linux: why are they needed and how do they work? Part 2

Luckily, getting started with eBPF is easy enough with the help of tools. bcc, which are available as packages from the general distribution Linux and well documented Bernard Gregg. Tools bcc are Python scripts with small inserts of C code, which means that anyone familiar with both languages ​​can easily modify them. IN bcc/tools there are 80 Python scripts, which means that most likely a developer or system administrator will be able to pick up something suitable for solving a problem.
To get at least a rough idea of ​​what VFS does on a running system, try vfscount or vfsstat. This will show, let's say that dozens of calls vfs_open() and "his friends" happen literally every second.

Virtual file systems in Linux: why are they needed and how do they work? Part 2

vfsstat.py is a Python script, with inserts of C code, that simply counts VFS function calls.

Let's take a more trivial example and see what happens when we insert a USB flash drive into a computer and the system detects it.

Virtual file systems in Linux: why are they needed and how do they work? Part 2

With eBPF, you can see what's going on in /syswhen a USB flash drive is inserted. Here is a simple and complex example.

In the example shown above, bcc tool trace.py prints a message when the command is run sysfs_create_files(). We see that sysfs_create_files() was launched with kworker stream in response to the fact that the flash drive was inserted, but what file was created? The second example shows the power of eBPF. Here trace.py prints the kernel backtrace (option -K) and the name of the file that was created sysfs_create_files(). Insertion in single statements is C code that includes an easily recognizable format string provided by a Python script that runs LLVM just-in-time compiler. It compiles this line and executes it in a virtual machine inside the kernel. Full function signature sysfs_create_files () must be reproduced in the second command so that the format string can refer to one of the parameters. Mistakes in this piece of C code result in recognizable C compiler errors. For example, if the -l option is omitted, then you will see "Failed to compile BPF text." Developers who are familiar with C and Python will find tools bcc easy to expand and change.

When the USB stick is inserted, the kernel backtrace will show that PID 7711 is a thread kworkerwho created the file «events» в sysfs. Accordingly, a call from sysfs_remove_files() will show that removing the drive resulted in the deletion of the file events, which corresponds to the general concept of reference counting. At the same time, viewing sysfs_create_link () with eBPF while inserting the USB drive will show that at least 48 symbolic links have been created.

So what's the point of the events file? Usage cscope For search __device_add_disk(), shows what she calls disk_add_events (), and either "media_change"or "eject_request" can be written to the event file. Here, the kernel block layer informs userspace that the "disk" has appeared and ejected. Note how informative this research method is by inserting a USB drive, compared to trying to figure out how things work from purely source code.

Read-only root filesystems enable embedded devices

Of course, no one turns off the server or their computer by pulling the plug out of the socket. But why? This is because mounted file systems on physical storage devices may have write-backs, and data structures that record their state may not be synchronized with records in storage. When this happens, system owners have to wait for the next boot to run the utility. fsck filesystem-recovery and, in the worst case, lose data.

However, we all know that many IoT devices, as well as routers, thermostats, and cars, now run Linux. Many of these devices have little to no user interface and there is no way to turn them off "cleanly". Imagine starting a car with a dead battery when power to the control unit is on. Linux constantly bouncing up and down. How is it that the system boots without a long fsckwhen the engine finally starts running? And the answer is simple. Embedded devices rely on the root filesystem only for reading (Abbreviated ro-rootfs (read-only root filesystem)).

ro-rootfs offer many benefits that are less obvious than authenticity. One advantage is that malware cannot write to /usr or /libif no Linux process can write there. Another is that a largely immutable file system is critical to field support for remote devices, as support personnel use local systems that are nominally identical to those in the field. Perhaps the most important (but also most insidious) benefit is that ro-rootfs forces developers to decide which system objects will be immutable at the system design stage. Working with ro-rootfs can be inconvenient and painful, as is often the case with const variables in programming languages, but their benefits easily outweigh the extra overhead.

Creation rootfs read-only requires some extra effort for embedded developers, and this is where VFS comes into play. Linux requires files in /var were writable, and in addition, many popular applications that run embedded systems will attempt to create configuration dot-files в $HOME. One solution for configuration files in the home directory is usually to pre-generate them and build them in rootfs. For /var one possible approach is to mount it on a separate writable partition, while / mounted read-only. Another popular alternative is to use bind or overlay mounts.

Linkable and stackable mounts, their use by containers

Command execution man mount is the best way to learn about linkable and overlay mounts, which give developers and system administrators the ability to create a file system in one path and then expose it to applications in another. For embedded systems, this means the ability to store files in /var on a read-only flash drive, but an overlay or bind mount path from tmpfs в /var when loaded, will allow applications to write notes there (scrawl). The next time you turn on the changes in /var will be lost. An overlay mount creates a union between tmpfs and the underlying file system and allows you to make ostensible changes to existing files in ro-tootf whereas bindable mount can make new empty tmpfs folders visible as writable in ro-rootfs ways. While overlayfs this is the right one (proper) filesystem type, linkable mount is implemented in VFS namespace.

Based on the description of overlay and bind mounts, no one is surprised that Linux containers actively use them. Let's see what happens when we use systemd-nspawn to run a container using a tool mountsnoop from bcc.

Вызов system-nspawn starts a container while running mountsnoop.py.

Let's see what happened:

Release mountsnoop during "loading" of the container shows that the container runtime is heavily dependent on linkable mounts (Only the beginning of the long output is shown).

Here systemd-nspawn provides the selected files in procfs и sysfs host to container as path to it rootfs. Besides MS_BIND flag that sets a bind mount, some other flags on the system being mounted determine the relationship between changes in the host namespace and the container namespace. For example, a bindable mount can either skip changes to /proc и /sys into the container, or hide them depending on the call.

Conclusion

Understanding Linux internals can seem like an impossible task, as the kernel itself contains a huge amount of code, leaving aside Linux user-space applications and system call interfaces in C libraries such as glibc. One way to make progress is to read the source code of one subsystem of the kernel, with an emphasis on understanding system calls and user-space headers, as well as basic internal kernel interfaces, such as the table file_operations. File operations provide the "everything is a file" principle, so managing them is especially nice. Kernel C source files in the top-level directory fs/ represent the implementation of virtual file systems, which are a wrapper layer that provides broad and relatively simple compatibility between popular file systems and storage devices. Link and overlay mounts via Linux namespaces are the magic of VFS, which makes it possible to create read-only containers and root filesystems. Combined with studying the source code, the eBPF core tool and its interface bcc
make core exploration easier than ever.

Friends, write if this article was useful for you? Do you have any comments or remarks? And those who are interested in the course "Linux Administrator" are invited to Open Daywhich will take place on 18 April.

The first part.

Source: habr.com

Add a comment