Second interview with Eduard Shishkin, developer of FS Reiser4

The second interview with Eduard Shishkin, the developer of the Reiser4 file system, has been published.

To begin with, please remind readers where and by whom you work.

I work as a Principal Storage Architect at Huawei Technologies, German Research Center. In the virtualization department, I deal with various aspects of data storage. My activity is not related to a specific operating system.

Are you currently committing to the main kernel branch?

Very rarely, and only if my employer requires it. The last time, about three years ago, I sent patches to increase throughput for storages shared on hosts using the 9p protocol (another name for this business is VirtFS). One important remark must be made here: although I have been working next to Linux for a long time, I have never been a fan of it, that is, I “breathe evenly”, as well as everything else. In particular, if I notice a flaw, I can point to it at most once. And so that later to go after someone and persuade - this will not happen.

I remember last time, ten years ago, you were quite critical of the style of kernel development. Has anything changed from your (or perhaps corporate) point of view, has the community become more responsive or not? If not, who do you think is to blame?

I have not seen any changes for the better. The main problem of the community is the replacement of science with political technologies, personal relationships, majority opinion, populism, advice from "inner voices", rotten compromises, anything but science. Computer science, whatever one may say, is primarily an exact science. And if someone starts proclaiming their own value for 2 × 2, different from 4, under the “Linux way” flag, or under some other flag, then this is unlikely to bring anything but harm.

All troubles are primarily due to the incompetence and lack of education of those who make decisions. If the leader is incompetent, he is not able to make an objective adequate decision. If he is also uncultured, he is not able to find a competent specialist who will give him the right advice. With a high probability, the choice will fall on a scammer who says "seemingly the right things." There is always a corrupt environment around incompetent lone leaders. Moreover, history knows no exceptions to this, and the community is the clearest confirmation of this.

How do you rate the progress in Btrfs development? This FS got rid of childhood diseases? How do you position it for yourself - as a file system "for home" or for corporate use too?

Didn't get rid of it. Everything I mentioned 11 years ago is still relevant today. One of the problems with Btrfs, which makes the latter unsuitable for serious needs, is the problem of free space. I'm not talking about the fact that the user is invited to run to the store for a new disk in situations where any other file system would show a lot of free space on the partition. The inability to complete an operation on a logical volume due to lack of free space is also not the worst thing. Worst of all, an unprivileged user can almost always bypass any disk quotas in a fairly short time to deprive everyone of free space.

It looks like this (tested for the Linux 5.12 kernel). On a freshly installed system, a script is launched that in a loop creates files with certain names in the home directory, writes data to them at certain offsets, and then deletes these files. After a minute of running this script, nothing unusual happens. After five minutes, the portion of the occupied space on the partition increases slightly. After two or three hours, it reaches 50% (with an initial value of 15%). And after five or six hours of work, the script crashes with the error "there is no free space on the partition." After that, you are no longer able to write even a 4K file to your partition.

An interesting situation occurs: in the end, you didn’t write anything to the partition, and all the free space (about 85%) disappeared somewhere. Analysis of a partition subjected to such an attack will show many tree nodes containing only one item (an object provided with a key), several bytes in size. That is, the content that previously occupied 15% of the disk space turned out to be evenly “smeared” over the entire partition so that there is nowhere to write a new file, because its key is larger than all existing ones, and the free blocks on the partition are over.

Moreover, all this happens already on the basic Btrfs configuration (without any snapshots, subvolumes, etc.), and it doesn’t matter how you decide to store the file bodies in that FS (as “fragments” in a tree, or as unformatted block extents) - the end result will be the same.

You will not succeed in subjecting other file systems from upstream to such an attack (no matter what they tell you). I already explained the cause of the problem a long time ago: this is a complete perversion in Btrfs of the concept of a B-tree, which makes it possible for it to spontaneously or intentionally degenerate. In particular, under certain loads, your FS during operation will continuously “fall apart” on its own, without outside help. It is clear that all sorts of "running out" background processes will save the case only on individual desktops.

On collective servers, an attacker will always be able to "get ahead" of them. The system administrator will not even be able to determine who exactly mocked him. The fastest way to fix this problem in Btrfs is to restore the structure of the regular B-tree, i.e. by redesigning the disk format and rewriting a significant portion of the Btrfs code. This will take 8-10 years along with debugging, provided that the developers strictly followed the original articles on the relevant algorithms and data structures, and did not play "broken phone", as is accepted (and encouraged) in the "Linux way".

Added to this is the time it takes developers to understand all this. Here it is more difficult. In any case, 10 years was not enough for them to understand. Well, until then, you can not hope for a miracle. It won't come in the form of a "we didn't know about" mount option, or a "just business" patch to prepare. For each such hasty "correction" I will present a new scenario of degeneration. B-trees are one of my favorite topics, and I must say that these structures do not take liberties with themselves!

How do I position Btrfs for myself? As something that categorically cannot be called a file system, not to mention use. For by definition, a file system is an OS subsystem responsible for the efficient management of the “disk space” resource, which we do not observe in the case of Btrfs. Well, imagine that you came to the store to buy a watch so as not to be late for work, and instead of a watch you were given an electric grill with a timer for 30 minutes maximum. So - with Btrfs the situation is even worse.

Looking through the mailing lists, I often come across the statement that it is no longer relevant to effectively manage disk space due to the cheapness of drives. This is complete nonsense. Without an effective disk space manager, the OS will become vulnerable and unusable. Regardless of what capacity the disks are on your machine.

I would like to ask for comment on the end of support for Btrfs in RHEL.

There is nothing special to comment on here, everything is very clear. She was their "technology preview". So, this “preview” did not pass. Do not hang forever this label! And they cannot launch a flawed by-design product with full support. RHEL is an enterprise, that is, prescribed commodity-money relations. Red Hat cannot taunt users as they do on the Btrfs mailing list. Just imagine the situation: a client who paid his hard-earned money for a disk and also for your support wants to understand where his disk space has gone after he has not recorded anything. What will you answer him?

Further. Red Hat's clients include well-known large banks, exchanges. Imagine what would happen if they were subjected to DoS attacks based on the mentioned vulnerability in Btrfs. Who do you think is responsible for this? For those who are about to point a finger at the line of the GPL license, where it is written that the author is not responsible for anything, I will immediately say: “hide it away!” Red Hat will answer, and in such a way that it will not seem a little! But I know that Red Hat is not in danger of this kind of problem, given their especially strong team of QA engineers, with whom I had the opportunity to work closely in my time.

Why do some companies continue to support Btrfs in their enterprise products?

Note that the prefix "enterprise" in the product name doesn't mean much. Enterprise is a measure of responsibility inherent in the contractual relationship with the client. The only GNU/Linux-based enterprise I know of is RHEL. Everything else, from my point of view, is only presented as an enterprise, but it is not. And, finally, if there is a demand for something, then there will always be a supply (in our case, this is the mentioned “support”). Demand is absolutely everything, incl. and unusable software. How such demand is formed, and who feeds it, is another topic.

So, I wouldn't draw any conclusions after Facebook is rumored to have deployed Btrfs on their servers. Moreover, I would recommend that the addresses of those servers be carefully kept secret for the reasons mentioned above.

Why has so much effort been put into licking XFS code lately? After all, initially it is a third-party FS, and ext4 has long been stable and has continuity from previous stable versions. What is Red Hat's interest in XFS? Does it make sense to actively develop two file systems similar in purpose - ext4 and XFS?

I don't remember what motivated it. It is possible that the initiative came from Red Hat customers. I remember that this kind of research was done: on some file systems from upstream, a huge number of objects were created on new generation high-end drives. Based on the results, XFS performed better than ext4. So it began to be promoted as the most promising. In any case, I would not be looking for something sensational here.

For me, they changed the awl to soap. It makes no sense to develop ext4 and XFS. As in parallel, and any of them to choose from. Nothing good will come of this. Although, in nature there are often situations when there are many potentials for growth, but there is no room to grow. In this case, various bizarrely ugly neoplasms arise, at which everyone points with a finger (“Oh, look, what you won’t see in this life!”).

Do you consider the issue of layer violation settled (in a negative sense) with the advent of encryption functions in ext4, F2FS (not to mention RAID in Btrfs)?

In general, the introduction of any levels and the decision to not violate them is usually a matter of politics, and I do not undertake to comment on anything here. The objective aspects of level violations are of little interest to anyone, but we can consider some of them using the example of a “from above” violation, namely, the implementation in the FS of the functionality already available on the block layer. Such a “violation” is justified with only rare exceptions. For each such case, you must first prove two things: that it is really necessary, and that the design of the system will not be harmed by this.

For example, mirroring, which has traditionally been a block layer activity, makes sense to implement at the file system level. For different reasons. For example, "silent" data corruption (bit rot) takes place on disk drives. This is when the device is working properly, but the data of the block is unexpectedly damaged under the influence of a hard gamma-ray quantum emitted by a distant quasar, etc. Worst of all, if this block turns out to be a FS system block (superblock, bitmap block, storage tree node, etc.), because this will certainly lead to a kernel panic.

Note that the mirrors offered by the block layer (so-called RAID 1) will not save you from this problem. Well, really: someone should check the checksums and read the replica in case of failure? In addition, it makes sense to mirror not stupidly everything, but only metadata. Some important data (for example, executable files of critical applications) can be stored as metadata. In this case, they receive the same security guarantees. It makes sense to entrust the protection of other data to other subsystems (perhaps even user applications) - we have provided all the necessary conditions for this.

Such "economical" mirrors have a right to exist and they can be effectively organized only at the file system level. The rest of the layering violation is littering the subsystem with duplicated code for the sake of some microscopic benefits. A vivid example of this is the implementation of RAID-5 using FS tools. Such solutions (own RAID/LVM in the file system) kills the latter architecturally. It should also be noted here that the layering violation is “put on stream” by various marketing scammers. In the absence of any ideas, functionality that has long been implemented at neighboring levels is added to the subsystems, this is presented as a new extremely useful feature and is actively pushed through.

Reiser4 was accused of violating the levels "from below". Based on the fact that the file system is not monolithic, like everyone else, but modular, an unsubstantiated assumption was made that it does what the layer above (VFS) should do.

Can we talk about the death of ReiserFS v3.6 and, for example, JFS? Recently, almost no attention has been paid to them in the core. Are they obsolete?

Here it is necessary to define what the death of a software product means. On the one hand, they are successfully used (they were created for this, after all) - that means they live. On the other hand, I can’t speak for JFS (I don’t know much), but ReiserFS (v3) is very difficult to adapt to new trends (tested in practice). This means that developers will continue to pay attention not to her, but to those that are easier to adapt. From this side it turns out that, alas, it is dead in architectural terms. I would not manipulate the concept of "morally obsolete" at all. It is well applicable, for example, to a wardrobe, but not to software products. There is a concept to yield and surpass in something. I can definitely say that ReserFS v3 is now inferior to Reiser4 in everything, but on some types of workload it surpasses all other upstream file systems.

Are you aware of the development of Tux3 and HAMMER/HAMMER2 FS (FS for DragonFly BSD)?

Yes, it is known. In Tux3, I was once interested in the technology of their snapshots (the so-called "version pointers"), but in Reiser4 we will most likely go the other way. I have been thinking about supporting snapshots for a long time and have not yet decided how to implement them for simple Reiser4 volumes. The fact is that the newfangled technique of "lazy" reference counters, proposed by Ohad Rodech, only works for B-trees. We don't have them. For those data structures that are used in Reiesr4, "lazy" counters are not defined - for their introduction it is necessary to solve certain algorithmic problems, which no one has taken up yet.

According to HAMMER: I read an article from the creator. Not interested. Again, B-trees. This data structure is hopelessly outdated. We abandoned it in the last century.

How do you assess the growing demand for network cluster FS like CephFS/GlusterFS/etc? Does this demand mean a shift in developers' priorities towards network file systems and a lack of attention to local file systems?

Yes, there has been such a shift in priorities. The development of local FS has stagnated. Alas, doing something significant for local volumes is now quite difficult and not everyone can do it. Nobody wants to invest in their development. This is about the same as asking a commercial structure to allocate money for mathematical research - you will be asked without ethnusiasm how you can make money on a new theorem. Now the local FS is something that magically appears “out of the box” and “should always work”, and if it doesn’t work, it causes unaddressed grumbling like: “yes, what do they think they are!”.

Hence the lack of attention to local FS, although there is still a lot of work in that area. And yes, everyone turned to distributed storages, which are built on the basis of already existing local file systems. It is very fashionable now. The phrase “Big Data” causes an adrenaline rush for many, being associated with conferences, workshops, high salaries, etc.

How reasonable, in principle, is the approach in which the network FS is implemented in kernel space, and not in user space?

A very reasonable approach, which has not yet been implemented anywhere. In general, the question of in what space it is necessary to implement a network FS is a “double-edged sword”. Well, let's look at an example. The client wrote the data on the remote machine. They fell into her page cache as dirty pages. This is a job for a "thin gateway" network FS in kernel space. Then the operating system will sooner or later ask you to write those pages to disk to free them. Then the IO-forwarding (sending) network FS module comes into play. It determines which server machine (server node) these pages will get to.

Then the baton is taken over by the network stack (which, as we know, is implemented in the kernel space). Further, the server node receives that packet with data or metadata and instructs the backend storage - the module (i.e., the local file system that operates in the kernel space) to write all this economy. So, we have reduced the question to where the "sending" and "receiving" modules should work. If any of those modules run in user space, then this will inevitably lead to a context switch (due to the need to use kernel services). The number of such switches depends on implementation details.

If there are many such switches, then storage throughput (I / O performance) will fall. If your backend storage is made up of slow disks, then you will not notice a significant drop. But if you have fast drives (SSD, NVRAM, etc.), then context switching is already becoming a “bottleneck” and by saving on context switching, performance can be increased significantly. The standard way to do this is to move modules into kernel space. For example, we found that migrating a 9p server from QEMU to the kernel on the host machine resulted in a threefold increase in VirtFS performance.

This, of course, is not a network FS, but it fully reflects the essence of things. The downside of this optimization is portability issues. For some, the latter may be critical. For example, GlusterFS does not have modules in the kernel at all. As a result, it now runs on many platforms, including NetBSD.

What concepts could local file systems borrow from network ones and vice versa?

Now network file systems, as a rule, have add-ons over local file systems, so I don’t quite understand how you can borrow something from the latter. Well, really, consider a company of 4 employees, in which everyone does their own thing: one distributes, another sends, the third receives, the fourth stores. And the question, what can a company borrow from its employee who stores, sounds somehow incorrect (it has long had what it was possible to borrow from him).

But local file systems have a lot to learn from network ones. First, they should learn how to aggregate logical volumes at a high level. Now the so-called. "Advanced" local file systems aggregate logical volumes exclusively using "virtual devices" technology borrowed from LVM (the same contagious layering violation that was first implemented in ZFS). In other words, virtual addresses (block numbers) are translated into real ones and back at a low level (that is, after the file system has issued an I / O request).

Note that adding and removing devices to logical volumes (not mirrors) compiled on the block layer leads to problems that vendors of such "features" are modestly silent about. I'm talking about fragmentation on real devices, which can reach monstrous values, while on a virtual device everything is fine. However, few people are interested in virtual devices: everyone is interested in what happens on real devices. But ZFS-like file systems (as well as any file systems in conjunction with LVM) work only with virtual disk devices (allocate virtual disk addresses from among free ones, defragment these virtual devices, etc.). And what happens on real devices, they have no idea!

Now imagine that on a virtual device you have zero fragmentation (that is, you have only one giant extent living there), you add a disk to your logical volume, and then remove another random disk from your logical volume, and then rebalance. And so it is many times. It is easy to figure that on a virtual device you will still have the same extent, but on real devices you will not see anything good.

The worst thing is that you are not even able to fix this situation! The only thing you can do here is to ask the file system to defragment the virtual device. But she will tell you that everything is wonderful there - there is only one extent, fragmentation is zero, and there can be no better! So, logical volumes compiled at the block level are not intended for multiple addition / removal of devices. In a good way, you only need to compile the logical volume at the block level once, give it to the file system, and then do nothing more with it.

In addition, a bunch of independent FS + LVM subsystems does not allow taking into account the different nature of the drives from which logical volumes are aggregated. Indeed, suppose you have compiled a logical volume from a hard drive and solid state devices. But then the former will require defragmentation, while the latter will not. For the latter, you need to issue discard requests, but not for the former, and so on. However, it is rather difficult to demonstrate such selectivity in the specified bundle.

Note that once you create your own LVM on the filesystem, things don't get much better. Moreover, by this you actually put an end to the prospect of ever improving it in the future. This is very bad. Drives of different types can live on the same machine. And if not the file system will distinguish them, then who will?

Another problem lies in wait for the so-called. "Write-Anywhere" filesystems (this also includes Reiser4 if you set the appropriate transactional model during the mount). Such file systems should provide unparalleled defragmentation power. A low-level volume manager does not help here, but only interferes. The fact is that with such a manager, your FS will store a map of free blocks of only one device - a virtual one. Accordingly, you can only defragment a virtual device. This means that your defragmenter will plow for a long, long time on a huge single space of virtual addresses.

And if you have a lot of users doing random overwrites, then the useful effect of such a defragmenter will be reduced to zero. Your system will inevitably start to slow down, and you will only have to fold your hands before the disappointing diagnosis of “broken design”. Multiple defragmenters running on the same address space will only interfere with each other. It is quite another thing if you maintain your own map of free blocks for each real device. This will effectively parallelize the defragmentation process.

But this can only be done if you have a high-level logical volume manager. Local file systems with such managers did not exist before (at least I don't know about them). Only network file systems (for example, GlusterFS) had such managers. Another very important example is the volume integrity checker (fsck). If you have an independent map of free blocks for each subvolume, then the procedure for checking the logical volume can be effectively parallelized. In other words, logical volumes with high-level managers scale better.

In addition, with low-level volume managers, you will not be able to organize full-fledged snapshots. With LVM and ZFS-like file systems, you can only take local snapshots, not global snapshots. Local snapshots allow you to instantly rollback only regular file operations. And no one will roll back operations with logical volumes (adding / removing devices). Let's look at this with an example. At some point in time, when you have a logical volume of two devices A and B containing 100 files, you take a snapshot S of the system and then create another hundred files.

After that, you add device C to your volume, and finally rollback your system to snapshot S. Q: How many files and devices does your logical volume contain after rolling back to S? There will be 100 files, you guessed it, but there will be 3 devices - these are the same devices A, B and C, although at the time the snapshot was taken there were only two devices in the system (A and B). The operation of adding device C did not roll back, and if you now remove device C from the computer, this will corrupt your data, so before deleting you will first need to perform an expensive operation of removing the device from the rebalance logical volume, which will scatter all data from device C to devices A and B. But if your FS supported global snapshots, such a rebalancing would not be required, and after an instant rollback to S, you could safely remove device C from the computer.

So, global snapshots are good in that they allow you to avoid the costly removal (adding) of a device from a logical volume (to a logical volume) with a lot of data (of course, if you didn’t forget to “take a picture” of your system at the right time). Let me remind you that creating snapshots and rolling back the file system to them are instant operations. The question may arise: how is it even possible to instantly roll back an operation with a logical volume that took you three days? But it's possible! Provided that your FS is correctly designed. The idea of ​​such "3D snapshots" came to me three years ago, and last year I patented this technique.

The next thing that local file systems should learn from network ones is to store metadata on separate devices in the same way that network file systems store them on separate machines (so-called metadata servers). There are applications that work primarily with metadata, and these applications can be greatly accelerated by placing the metadata on expensive, high-performance drives. With a bunch of FS + LVM, you will not be able to show such selectivity: LVM does not know what is on the block that you transferred to it (data there or metadata).

You won’t get much benefit from implementing your own low-level LVM in a file system compared to a bunch of file systems + LVM, but what you can do very well is litter the file system so that it will then become impossible to work with its code. ZFS and Btrfs, rushing with virtual devices, are all clear examples of how layering violation kills the system architecturally. So, why am I all this? And besides, you don’t need to fence your own low-level LVM in the file system. Instead, you need to aggregate devices into logical volumes at a high level, as some network file systems with different machines (storage nodes) do. True, they do it disgustingly because of the use of bad algorithms.

Examples of absolutely terrible algorithms are the DHT translator in the GlusterFS file system and the so-called CRUSH map in the Ceph file system. None of the algorithms that I saw suited me in terms of simplicity and good scalability. And so I had to remember algebra and invent everything myself. In 2015, while experimenting with hash function bundles, I came up with and patented something that suits me. Now I can say that the attempt to implement all this in practice has been successful. I do not see any problems with scalability in the new approach.

Yes, each subvolume will require a separate superblock type structure in memory. Is this very scary? In general, I don’t know who is going to “boil the ocean” and create logical volumes from hundreds of thousands or more devices on one local machine. If someone can explain this to me, I would be very grateful. In the meantime, this is a marketing bullshit for me.

How did changes in the block device subsystem of the kernel (for example, the appearance of blk-mq) affect the requirements for the implementation of the file system?

Didn't influence at all. I don’t know what should happen on the block layer that would have to design a new file system. The interaction interface of these subsystems is very poor. From the side of drivers, the FS should be affected only by the appearance of new types of drives, to which the block layer will first adapt, and then the FS (for reiser4, this will mean the emergence of new plug-ins).

Does the emergence of new types of media (for example, SMR, or the ubiquity of SSD) mean fundamentally new challenges for the design of file systems?

Yes. And these are normal incentives for the development of FS. Challenges can be different and completely unexpected. For example, I heard about drives, where the speed of an I / O operation is highly dependent on the size of a piece of data and its offset. On Linux, where the size of the FS block cannot exceed the page size, such a drive will not show its full capabilities by default. However, if your FS is properly designed, then there is a chance to “squeeze” a lot more out of it.

How many people are currently working with the Reiser4 code besides you?

Less than I would like, but I do not experience an acute shortage of resources either. The pace of development of Reiser4 more than suits me. I'm not going to "drive horses" - this is not the right area. Here "quieter you go - you will continue!" A modern FS is the most complex subsystem of the kernel, the wrong decisions in the design of which can cancel out the subsequent many years of work of people.

Offering volunteers to implement something, I always guarantee that the efforts will certainly lead to the correct result, which can be claimed for serious needs. As you understand, there cannot be many such guarantees at once. At the same time, I can’t stand the “doers” who shamelessly promote the “features” of obviously unusable software, deceiving hundreds of users and developers, and at the same time sit and smile at the kernel summits.

Has any company expressed its willingness to support the development of Reiser4?

Yes, there were such proposals, incl. and from a major vendor. But, for this I had to move to another country. Unfortunately, I am no longer 30 years old, I cannot break loose and leave like this at the first whistle.

What features are missing in Reiser4 now?

The "resize" function is missing for simple volumes, similar to the one in ReiserFS (v3). In addition, file operations with the DIRECT_IO flag would not interfere. Further, one would like to be able to segregate a volume into "semantic subvolumes", which do not have a fixed size, and which can be mounted as independent volumes. These tasks are good for beginners who want to try their hand at the "real business".

And finally, I would like to have network logical volumes with simple implementation and administration (modern algorithms already allow this). But what Reiser4 will definitely never have is RAID-Z, scrubs, free space caches, 128-bit variables and other marketing nonsense that arose against the backdrop of a lack of ideas among the developers of some FS.

Can everything that you need be implemented by plugins?

If we speak only in terms of interfaces and plugins (modules) that implement them, then not all. But if you also introduce relations on these interfaces, then, among other things, you will have the concepts of higher polymorphisms, which you can already get by with. Imagine that you hypothetically freeze an object-oriented runtime system, change the instruction pointer to point to another plugin that implements the same X interface, and then unfreeze the system so that it continues execution.

If at the same time the end user does not notice such a “substitution”, then we say that the system has zero-order polymorphism in the X interface (or the system is heterogeneous in the X interface, which is the same). If now you have not just a set of interfaces, but also relations on them (interface graph), then you can introduce polymorphisms of higher orders that will characterize the heterogeneity of the system already in the “neighborhood” of any interface. I once introduced such a classification a long time ago, but, unfortunately, it did not work out to publish.

So, with the help of plugins and such higher polymorphisms, you can describe any known feature, as well as “predict” those that have never even been mentioned. I have not succeeded in rigorously proving this, but I do not yet know a counterexample either. Actually, this question reminded me of Felix Klein's Erlangen Program. At one time he tried to present all geometry as a branch of algebra (specifically, group theory).

Now to the main question - how are things going with the promotion of Reiser4 to the main core? Were there any publications on the architecture of this FS that you mentioned in the last interview? How relevant is this question from your point of view?

In general, we have been asking for inclusion in the main branch for three years. Reiser's last comment in a public thread where he made a pull request went unanswered. So all further questions are not for us. I personally don’t understand why we need to “merge” into a specific operating system. On Linux, the world did not converge like a wedge. So, there is a separate repository, in which there will be several branches-ports for different operating systems. Who needs it - you can clone the corresponding port and do whatever you want with it (within the framework of the license, of course). Well, if someone doesn't need it, then it's not my problem. At this point, I propose to consider the question of "moving into the main Linux kernel" exhausted.

Publications on FS architecture are relevant, but so far I have only found time for my new results, which I consider to be of higher priority. Another thing is that I am a mathematician, and in mathematics any publication is a summary of theorems and their proofs. Publishing something there without proof is a sign of bad taste. If I rigorously prove or disprove any statement on the FS architecture, then I will get such heaps through which it will be quite difficult to wade through. Who needs it? This is probably why everything continues to remain in its old form - the source code and comments to it.

What's new in Reiser4 over the past few years?

The long-awaited stability has finally materialized. One of the last to give up was a bug that resulted in "unremovable" directories. The difficulty was that it appeared only against the background of name hash collisions and at a certain location of director entries in a tree node. However, for production, I still cannot recommend Reiser4: for this, you need to do some work with active interaction with production system administrators.

We finally managed to realize our long-standing idea - different transactional models. Before that, only one hard-coded MacDonald-Reiser model worked in Reiser4. This created design problems. In particular, in such a transactional model, snapshots are impossible - they will be spoiled by an atom component called “OVERWRITE SET”. Reiser4 currently supports three transactional models. In one of them (Write-Anywhere), the OVERWRITE SET atomic component includes only system pages (images of disk bitmaps, etc.) that are not subject to “photography” (the chicken and egg problem).

So the pictures can now be realized in the best possible way. In another transactional model, all modified pages only go to OVERWRITE SET (that is, it is essentially pure logging). This model is for those who complained about the fast fragmentation of Reiser4 partitions. Now in this model your partition will fragment no faster than with ReiserFS (v3). All three existing models, with some reservations, guarantee the atomicity of operations, but models with loss of atomicity and with preservation of only the integrity of the partition can also be useful. Such models can be useful for all sorts of applications (databases, etc.) that have already taken on some of these functions. It is very easy to add these models to Reiser4, but I did not do it, because no one asked me, and personally I do not need it.

Metadata checksums have appeared and recently I supplemented them with “economical” mirrors ”(still unstable material). If the checksum of any block fails, Reiser4 immediately reads the corresponding block from the replica device. Note that ZFS and Btrfs can't do this: the design doesn't allow it. There you have to run a special background scanning process called "scrub" and wait for it to get to the problematic block. Programmers figuratively call such events “crutches”.

And finally, heterogeneous logical volumes appeared, offering everything that ZFS, Btrfs, block layer, as well as FS + LVM bundles, in principle, cannot give - this is parallel scaling, O (1) disk address allocator, transparent data migration between subvolumes. The latter also has a user interface. Now you can easily move the hottest data to the highest performing drive in your volume.

In addition, it is possible to urgently flush any dirty pages to such a drive, and thereby significantly speed up applications that frequently call fsync (2). I note that the block layer functionality called bcache does not provide such freedom of action at all. New logical volumes are based on my algorithms (there are corresponding patents). The software is already quite stable, it is quite possible to try it, measure performance, etc. The only inconvenience is that for now you need to manually update the volume configuration and store it somewhere.

So far, I have managed to implement my ideas by 10 percent. However, what I considered the most difficult was to “make friends” of logical volumes with a flash procedure that performs all pending actions in reiser4. This is all so far in the experimental branch "format41".

Does Reiser4 pass xfstests?

At least it went away for me when I was preparing the last release.

Is it possible in principle to make Reiser4 a network (cluster) file system with the help of plugins?

It is possible, and even necessary! If a network file system is created on the basis of a properly designed local file system, the result will be very impressive! In modern network file systems, I am not satisfied with the backend storage level, which is implemented using any local file system. The existence of this level is completely unjustified. The network file system should directly interact with the block layer, and not ask the local file system to create some more service files!

In general, the division of file systems into local and network ones is from the evil one. It arose from the imperfection of the algorithms that were used thirty years ago, and for which nothing has been proposed so far. This is also the reason for the appearance of a mass of unnecessary software components (various services, etc.). In a good way, there should be only one FS in the form of a kernel module and a set of user utilities installed on each machine - a cluster node. This FS is both local and network. And nothing more!

If nothing happens with Reiser4 on Linux, I would like to suggest a FS for FreeBSD (quote from a previous interview: "... FreeBSD ... has academic roots ... And this means that with a high degree of probability we will find a common language with developers") ?

So, as we just found out, everything has already worked out fine with Linux: there is a separate working Reiser4 port for it in the form of a master branch of our repository. I didn't forget about FreeBSD! Suggest! Ready to work closely with those who know FreeBSD internals well. By the way: what I really like about their community is that decisions are made there by an updated council of independent experts, which has nothing to do with the swindle of one permanent person.

How do you rate the Linux user community today? Has it become more "poppy"?

It is quite difficult for me to assess this by the nature of my work. Mostly users come to me with bug reports and requests to fix the partition. Users as users. Some are more savvy, some less. Everyone has the same attitude. Well, if the user ignores my instructions, then excuse me: the ignore will be introduced on my part.

Is it possible to predict the development of file systems for the next five to ten years? What, in your opinion, are the main challenges that FS developers may face?

Yes, it is easy to make such a prediction. In Upstream, there has been no development of file systems for a long time. Only the appearance of such is created. The developers of local file systems ran into problems associated with unsuccessful design. Here it is necessary to make a reservation. I don’t consider the so-called “storage”, “licking” and porting of code for development and development. And I don’t rank the misunderstanding called “Btrfs” as a development for the reasons that I have already explained.

Each patch only exacerbates her problems. Well. and there are always various kinds of "evangelists" for whom "everything works." Basically, these are schoolchildren and students skipping lectures. Just imagine: it works for him, but the professor does not. What an adrenaline rush! The greatest harm, from my point of view, is brought by “craftsmen” who enthusiastically rushed to “screw” the miracle features of Btrfs to all kinds of layers like systemd, docker, etc. - it already resembles metastases.

Let's now try to make a forecast for five to ten years. What we will do in Reiser4 I have already briefly listed. The main challenge for developers of local file systems from upstream will be (yes, it has already become) the inability to do decent work for a salary. Without any storage ideas, they will keep trying to patch those poor VFS, XFS and ext4. Against this background, the situation with VFS looks especially comical, reminiscent of a frenzied modernization of a restaurant in which there are no chefs, and is not expected.

Now the VFS code unconditionally freezes several memory pages at the same time and offers the underlying FS to operate on them. This was introduced to improve Ext4's performance on deletions, but as you can see, such a concurrent lock is completely incompatible with advanced transactional models. That is, you simply cannot add support for some kind of smart FS in the kernel. I don't know how things are in other areas of Linux, but as far as file systems are concerned, any development here is hardly compatible with the policy pursued by Torvalds in practice (academic projects are expelled, and scammers who have no idea what a B-tree is , infinite credits of trust are issued). Therefore, a course was set for slow decay. Of course, they will try with all their might to pass it off as “development”.

Further, the “custodians” of file systems, realizing that you won’t earn much on “storage” alone, will try themselves in a more profitable business. These are, as a rule, distributed file systems and virtualization. Perhaps somewhere else they will port the fashionable ZFS where it does not yet exist. But it, like all FS from Upstream, resembles a Christmas tree: if you can hang something else from the little things on top, then you won’t crawl any deeper. I admit that it is possible to build a serious enterprise system based on ZFS, but since we are now discussing the future, it remains for me to state with regret that ZFS is hopeless in this regard: with their virtual devices, the guys cut off the oxygen for themselves and future generations for further development. ZFS is yesterday. And ext4 and XFS are not even the day before yesterday.

Separately, it is worth mentioning the sensational concept of “Linux file system of next generation”. This is a completely political and marketing project, created to be able to, so to speak, "stalk the future of file systems" in Linux for specific characters. The fact is that before Linux was “just for fun”. And now it is primarily a machine for making money. They are made on everything that is possible. For example, it is very difficult to create a good software product, but smart “developers” have long realized that there is no need to strain at all: you can also successfully sell non-existent software, announced and promoted at various public events - the main thing is that the presentation slides should have more "features".

File systems are the best suited for this, because you can safely bargain for ten years on the result. Well, if someone then complains about the lack of this very result, then he simply does not understand anything about file systems! This is reminiscent of a financial pyramid: at the top are the adventurers who made this mess, and those few who are “lucky”: they “withdrawn dividends”, i.e. received money for development, got a job as managers in a highly paid job, “lit up” at conferences, etc.

Next come those who are “unfortunate”: they will count the losses, disentangle the consequences of deploying an unusable software product into production, “etc. There are many more. Well, at the base of the pyramid - a huge mass of developers, "sawing" useless code. They are the biggest loser, because you can’t get back the time spent wasted. Such pyramids are extremely beneficial to Torvalds and his associates. And the more of these pyramids, the better for them. Anything can be taken into the core to feed such pyramids. Of course, in public they claim otherwise. But I judge not by words but by deeds.

So, “the future of file systems in Linux” is another highly publicized, but not very usable software. After Btrfs, this "future" will most likely be replaced by Bcachefs, which is another attempt to cross the Linux block layer with the file system (a bad example is contagious). And what is characteristic: there are the same problems as in Btrfs. I suspected this for a long time, and then somehow I could not resist and looked into the code - it is!

The authors of Bcachefs and Btrfs, when creating their file systems, actively used other people's sources, understanding little of them. The situation is very reminiscent of the children's game "broken phone". And I can roughly imagine how this code will be included in the kernel. Actually, no one will see the “rake” (everyone will step on them later). After numerous nit-picking about the style of the code code, accusations of non-existent violations, etc., a conclusion will be made about the “loyalty” of the author, about how well he “interacts” with other developers, and how successfully all this can then be sold to corporations.

The end result is of no interest to anyone. Twenty years ago, perhaps, I would have been interested, but now the questions are posed differently: will it be possible to promote this so that certain people will be employed in the next ten years. And to ask a question about the final result, alas, is not accepted.

In general, I would strongly discourage starting to invent your file system from scratch. For even significant financial investments will not be enough to get something competitive in ten years. Of course, I'm talking about serious projects, and not about those that are intended to be "pushed" into the core. So, a more effective way to make yourself known is to join real developments, for example, to us. Of course, this is not easy to do - but this is the case with any high-level project.

First you will need to independently overcome the problem that I will offer. After that, convinced of the seriousness of your intentions, I will begin to help. Traditionally, we use only our own developments. The exceptions are compression algorithms and some hash functions. We don’t send developers to travel around conferences and then sit and combine other people’s ideas (“maybe it works out”), as is customary in most startups.

We develop all algorithms ourselves. At the moment I am interested in the algebraic and combinatorial aspects of data storage science. In particular, finite fields, asymptotics, proof of inequalities. For simple programmers, there is also work, but I must immediately warn you: all proposals to “look at another file system and do the same” are ignored. There will also go patches aimed at closer integration with Linux along the VFS line.

So, we do not have a rake, but we have an understanding of where we need to move, and there is confidence that this direction is the right one. This understanding did not come in the form of manna from heaven. Let me remind you that behind 29 years of development experience, two file systems written from scratch. And as many data recovery utilities. And this is a lot!

Source: opennet.ru

Add a comment