Reiser5 announces support for Burst Buffers (Data Tiering)

Edward Shishkin announced new features developed within the framework of the Reiser5 project. Reiser5 represents a significantly redesigned version of the ReiserFS file system, in which support for parallel scalable logical volumes is implemented at the file system level, and not at the block device level, which makes it possible to efficiently distribute data across a logical volume.

Of the innovations developed recently, the provision of
the user the opportunity to add a small high-performance
block device (ex. NVRAM) called proxy disk, to
relatively large logical volume composed of slow
budget drives. This will give the impression that the whole
volume is composed of the same expensive high-performance
devices, just like "proxy disk".

The implemented method was based on a simple observation that in practice disk writing is not carried out constantly, and the I / O load curve has the form of peaks. In the interval between such "peaks" it is always possible to reset the data from the proxy disk, rewriting in the background all the data (or only a part) to the main, "slow" storage. Thus, the proxy disk is always ready to receive a new piece of data.

This technique (known as Burst Buffers) originally originated in
high performance computing (HPC). But it turned out to be also in demand for ordinary applications, especially for those that place high demands on data integrity (usually these are various kinds of databases). Such applications perform any changes in any file in an atomic way, namely:

  • first, a new file is created that contains the changed data;
  • this new file is then written to disk with fsync(2);
  • after that, the new file is renamed to the old one, which is automatically
    releases blocks occupied by old data.

    All these steps, to one degree or another, lead to significant
    performance degradation on any file system. Situation
    improves if the new file is first written to the allocated
    high-performance device, which is exactly what happens in
    file system with "Burst Buffers" support.

    In Reiser5, it is planned to optionally send to the proxy disk not only
    new logical file blocks, but all dirty pages in general. Moreover,
    not only data pages, but also metadata that
    are written in steps (2) and (3).

    Support for proxy disks is carried out in the context of regular work with
    Reiser5 logical volumes, announced at the beginning of the year. That is,
    aggregate system "proxy disk - main storage" is a common
    logical volume, with the only difference that the proxy drive takes precedence
    among other volume components in the disk address allocation policy.

    Adding a proxy drive to a logical volume is not accompanied by any
    data rebalancing, and its removal occurs in the same way as
    removing a normal disk. All proxy disk operations are atomic.
    Error handling and system deployment (including after a system crash) is exactly the same as if the proxy disk were a normal component
    logical volume.

    After adding a proxy disk, the total capacity of the logical volume
    increases by the capacity of that disk. Free space monitoring
    proxy disk is done in the same way as for the rest of the volume components, i.e. using the volume.reiser4(8) utility.

    The proxy disk needs to be cleaned up periodically, i.e. dump data from
    it to the main repository. After reaching beta stability Reiser5
    cleaning is planned to be done automatically (she will be in charge
    special kernel thread). At this stage, the responsibility for cleaning
    is the responsibility of the user. Resetting data from the proxy disk to the primary
    storage is done by simply calling the volume.reiser4 utility with the option
    "-b". As an argument, you need to specify the mount point of the logical
    volumes. Of course, cleaning should not be forgotten to be carried out periodically. For
    You can write a simple shell script to do this.

    If there is no free space on the proxy disk, all data
    are automatically written to the main storage. However, by default
    the overall performance of the FS is reduced (due to the constant call
    procedures for committing all existing transactions). Optionally, you can set
    mode without loss of performance. However, in this case the disk
    the proxy device space will be used less efficiently.
    It is convenient to use a subsection (bric) of metadata as a proxy disk, provided that it is created on a sufficiently high-performance block device.

    Source: opennet.ru

  • Add a comment