What do LVM and matryoshka have in common?

Good day.
I want to share with the community my practical experience in building a storage system for KVM using md RAID + LVM.

The program will:

  • Building md RAID 1 from NVMe SSD.
  • Building md RAID 6 from SATA SSD and regular drives.
  • Features of TRIM/DISCARD on SSD RAID 1/6.
  • Creating a bootable md RAID 1/6 array on a shared set of disks.
  • Installing the system on NVMe RAID 1 in the absence of NVMe support in the BIOS.
  • Using LVM cache and LVM thin.
  • Using BTRFS snapshots and send/receive for backup.
  • Using LVM thin snapshots and thin_delta for BTRFS style backups.

If interested, please under cat.

Statement

The author does not bear any responsibility for the consequences of using or not using materials/examples/code/tips/data from this article. By reading or in any way using this material, you assume responsibility for all consequences of these actions. Possible consequences include:

  • Crispy fried NVMe SSDs.
  • Fully consumed write resource and failure of SSD drives.
  • Complete loss of all data on all drives, including backups.
  • Faulty computer hardware.
  • Wasted time, nerves and money.
  • Any other consequences not listed above.

Hardware

Available were:

Motherboard around 2013 with Z87 chipset, complete with Intel Core i7 / Haswell.

  • Processor 4 cores, 8 threads
  • 32 GB DDR3 RAM
  • 1 x 16 or 2 x 8 PCIe 3.0
  • 1 x 4 + 1 x 1 PCIe 2.0
  • 6 x 6 GBps SATA 3 connectors

SAS adapter LSI SAS9211-8I flashed in IT / HBA mode. The RAID-enabled firmware has been deliberately changed to HBA firmware in order to:

  1. It was possible at any time to throw out this adapter and replace it with any other first one that came across.
  2. TRIM/Discard worked fine on disks, because in RAID firmware, these commands are not supported at all, and HBA, in general, does not care what commands to send over the bus.

Hard Drives - 8 pieces HGST Travelstar 7K1000 1 TB in 2.5 form factor, as for laptops. These drives were previously in a RAID 6 array. In the new system, they will also find application. To store local backups.

Additionally added:

6 pieces of SATA SSD model Samsung 860 QVO 2TB. These SSDs required a large volume, the presence of an SLC cache, reliability is desirable, and a low price. It was obligatory to support discard/zero which is checked by a line in dmesg:

kernel: ata1.00: Enabling discard_zeroes_data

2 pieces of NVMe SSD model Samsung SSD 970 EVO 500GB.

For these SSDs, random read / write speed and a resource for your needs are important. radiator for them. Necessarily. Absolutely necessary. Otherwise, fry them to a crisp at the first RAIDa synchronization.

Adapter StarTech PEX8M2E2 for 2 x NVMe SSD with installation in PCIe 3.0 8x slot. This, again, is just an HBA, but for NVMe. It differs from cheap adapters by the absence of the requirement to support PCIe bifurcation from the motherboard due to the presence of an integrated PCIe switch. Will work even in the most ancient system where there is PCIe, even if it is x1 PCIe 1.0 slot. Naturally, with the appropriate speed. There are no RAIDs. There is no built-in BIOS on board. So, your system will not magically learn to boot from NVMe, let alone do NVMe RAID thanks to this device.

This component was due solely to the presence of only one free 8x PCIe 3.0 in the system, and, if there are 2 free slots, it can be easily replaced with two penny PEX4M2E1 or analogues, which can be bought anywhere at a price of 600 rubles.

The rejection of all kinds of hardware or built-in chipset / BIOS RAIDs was made consciously, in order to be able to completely replace the entire system, with the exception of the SSD / HDD themselves, preserving all data. Ideally, you can even save the installed operating system when moving to a completely new / different hardware. The main thing is that there are SATA and PCIe ports. It's like a live CD or a bootable flash drive, only very fast and a bit bulky.

HumorAnd then, you know how it happens - sometimes you urgently need to take the entire array with you to take away. And you don't want to lose your data. To do this, all the mentioned media are conveniently located on the slide in the compartments 5.25 of the standard case.

Well, and, of course, for experimenting with different ways of SSD caching in Linux.

Hardware raids are boring. You turn it on. It either works or it doesn't. And with mdadm there are always options.

Soft

Previously, Debian 8 Jessie was installed on the hardware, which is close to EOL. RAID 6 was assembled from the above mentioned HDDs paired with LVM. It ran virtual machines in kvm/libvirt.

Because the author has suitable experience in creating portable bootable SATA / NVMe flash drives, and also, in order not to tear the usual apt-template, Ubuntu 18.04 was chosen as the target system, which has already stabilized enough, but still has 3 years of support in the future.

The mentioned system contains all the hardware drivers we need out of the box. We do not need any third-party software and drivers.

Preparing for installation

To install the system, we need Ubuntu Desktop Image. The server system has some kind of vigorous installer, which shows excessive non-switchable independence, necessarily pushing the UEFI system partition onto one of the disks, spoiling all the beauty. Accordingly, it is installed only in UEFI mode. Doesn't offer options.

We are not satisfied with this.

Why?Unfortunately, UEFI booting is extremely incompatible with bootable software RAID, as no one offers us reservations for the UEFI ESP partition. There are recipes on the net that suggest placing the ESP partition on a flash drive in a USB port, but this is a point of failure. There are recipes using software mdadm RAID 1 with version 0.9 metadata that do not prevent the UEFI BIOS from seeing this partition, but it lives up to the happy moment when the BIOS or another OS on hardware writes something to the ESP forgetting to synchronize to other mirrors.

In addition, UEFI booting depends on NVRAM, which will not move along with the disks to a new system, because. is part of the motherboard.

So, we will not reinvent the wheel. We already have a ready-made grandfather's bicycle, proven over the years, now called the Legacy / BIOS boot, bearing the proud name of CSM on UEFI-compatible systems. We just take it off the shelf, lubricate it, pump up the wheels and wipe it with a damp cloth.

The desktop version of Ubuntu also does not know how to install normally with the Legacy bootloader, but here, as they say, at least there are options.

And so, we assemble the hardware and load the system from the Ubuntu Live bootable flash drive. We will need to download packages, so we set up the network that you have earned. If it doesn’t work, you can download the necessary packages to a USB flash drive in advance.

We go into the Desktop environment, launch the terminal emulator, and let's go:

#sudo bash

How…?The line above is the canonical sudo holivar trigger. C bΠΎmore opportunities come and bΠΎmore responsibility. The question is, can you take it on yourself. Many people think that using sudo in this way is at least not cautious. However:

#apt-get install mdadm lvm2 thin-provisioning-tools btrfs-tools util-linux lsscsi nvme-cli mc

Why not ZFS...?When we install software on our computer, we are essentially lending our hardware to the developers of that software.
When we trust this software with the safety of our data, we take a loan equal to the cost of restoring this data, which we will have to pay someday.

From this point of view, ZFS is a Ferrari, and mdadm+lvm is more like a bicycle.

Subjectively, the author prefers to lend a borrowed bicycle to unknown persons instead of a Ferrari. There, the price of the issue is not high. You don't need rights. Easier traffic rules. Parking is free. Walkability is better. You can always attach legs to a bicycle, and you can fix a bicycle with your own hands.

Why then BTRFS...?In order to boot the operating system, we need a file system supported by Legacy / BIOS GRUB out of the box, and at the same time, supporting live snapshots. We will use it for the /boot partition. In addition, the author prefers to use this FS for / (root), not forgetting to note that for any other software, you can create separate sections on LVM and mount to the desired directories.

We will not store images of virtual machines or databases on this FS.
This FS will be used only to create snapshots of the system without shutting it down, followed by transferring these snapshots to a backup disk using send / receive.

In addition, the author generally prefers to keep a minimum of software directly on the hardware and run the rest of the software in virtual machines using such things as forwarding GPU and PCI-USB Host controllers to KVM via IOMMU.

Only data storage, virtualization and backup remain on the hardware.

If you trust ZFS more, then, in principle, for the specified application, they are interchangeable.

However, the author is deliberately ignoring the built-in mirroring/RAID and redundancy features found in ZFS, BRTFS, and LVM.

As an additional argument, BTRFS has the ability to turn random writes into sequential writes, which has a very positive effect on the speed of synchronization of snapshots / backups on the HDD.

Rescan all devices:

#udevadm control --reload-rules && udevadm trigger

Let's look around:

#lsscsi && nvme list
[0:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sda
[1:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sdb
[2:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sdc
[3:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sdd
[4:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sde
[5:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sdf
[6:0:0:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdg
[6:0:1:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdh
[6:0:2:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdi
[6:0:3:0] disk ATA HGST HTS721010A9 A3B0 /dev/sdj
[6:0:4:0] disk ATA HGST HTS721010A9 A3B0 /dev/sdk
[6:0:5:0] disk ATA HGST HTS721010A9 A3B0 /dev/sdl
[6:0:6:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdm
[6:0:7:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdn
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S466NXXXXXXX15L Samsung SSD 970 EVO 500GB 1 0,00 GB / 500,11 GB 512 B + 0 B 2B2QEXE7
/dev/nvme1n1 S5H7NXXXXXXX48N Samsung SSD 970 EVO 500GB 1 0,00 GB / 500,11 GB 512 B + 0 B 2B2QEXE7

Partitioning "disks"

NVMe SSDs

But in any way we will not mark them. All the same, our BIOS does not see these drives. So, they will go entirely to software RAID. We will not even create sections there. If you want according to the "canon" or "fundamentally" - create one large partition, like an HDD.

SATA HDD

There is nothing special to invent here. We will create one section for everything. We will create a partition because the BIOS sees these disks and can even try to boot from them. We'll even install GRUB later on these drives so that the system suddenly has it.

#cat >hdd.part << EOF
label: dos
label-id: 0x00000000
device: /dev/sdg
unit: sectors

/dev/sdg1 : start= 2048, size= 1953523120, type=fd, bootable
EOF
#sfdisk /dev/sdg < hdd.part
#sfdisk /dev/sdh < hdd.part
#sfdisk /dev/sdi < hdd.part
#sfdisk /dev/sdj < hdd.part
#sfdisk /dev/sdk < hdd.part
#sfdisk /dev/sdl < hdd.part
#sfdisk /dev/sdm < hdd.part
#sfdisk /dev/sdn < hdd.part

SATA SSD

Here we have the most interesting.

First, we have 2 TB drives. This is within the acceptable range for the MBR, which we will use. If necessary, you can replace it with GPT. GPT disks have a compatibility layer that allows MBR compatible systems to see the first 4 partitions if they are located within the first 2 terabytes. The main thing is that the boot partition and the bios_grub partition on these drives are at the beginning. It even allows you to boot from Legacy/BIOS GPT disks.

But, this is not our case.

Here we will create two sections. The first one will be 1GB in size and used for RAID 1 /boot.

The second one will be used for RAID 6 and will take up all the remaining free space except for a small unallocated area at the end of the drive.

What is the unallocated area?According to sources on the net, our SATA SSDs have on board a dynamically expanding SLC cache ranging in size from 6 to 78 gigabytes. We get 6 gigabytes "for free" due to the difference between "gigabytes" and "gibibytes" in the data sheet of the drive. The remaining 72 gigabytes are allocated from unused space.

Here it should be noted that the cache is SLC, and the place is occupied in 4 bit MLC mode. Which effectively means for us that for every 4 gigabytes of free space we get only 1 gigabyte of SLC cache.

We multiply 72 gigabytes by 4 and get 288 gigabytes. This is the free space that we will not mark up in order to allow drives to fully use the SLC cache.

Thus, we will effectively get up to 312 gigabytes of SLC cache in total from six drives. Of all the drives, 2 will be used in the RAID for redundancy.

Such an amount of cache will allow us to rarely encounter a situation when the write goes not to the cache in real life. This extremely well compensates for the saddest drawback of QLC memory - the extremely low write speed when data is written to bypass the cache. If your loads do not match this, then I recommend that you think hard about how long your SSDs will live under such a load, taking into account the TBW from the data sheet.

#cat >ssd.part << EOF
label: dos
label-id: 0x00000000
device: /dev/sda
unit: sectors

/dev/sda1 : start= 2048, size= 2097152, type=fd, bootable
/dev/sda2 : start= 2099200, size= 3300950016, type=fd
EOF
#sfdisk /dev/sda < ssd.part
#sfdisk /dev/sdb < ssd.part
#sfdisk /dev/sdc < ssd.part
#sfdisk /dev/sdd < ssd.part
#sfdisk /dev/sde < ssd.part
#sfdisk /dev/sdf < ssd.part

Creating arrays

First we need to rename the machine. It is necessary because the hostname is part of the array name somewhere inside mdadm and affects something somewhere. Arrays, of course, can be renamed later, but these are unnecessary steps.

#mcedit /etc/hostname
#mcedit /etc/hosts
#hostname
vdesk0

NVMe SSDs

#mdadm --create --verbose --assume-clean /dev/md0 --level=1 --raid-devices=2 /dev/nvme[0-1]n1

Why --assume-clean...?To not initialize arrays. For both RAID levels 1 and 6, this is acceptable. Everything can work without initialization if it is a new array. Moreover, initializing an SSD array at creation is a waste of the TBW resource. We use TRIM/DISCARD where possible on assembled SSD arrays to "initialize" them.

For SSD arrays, RAID 1 DISCARD is supported out of the box.

For SSD RAID 6 arrays, DISCARD must be enabled in the kernel module parameters.

This should only be done if generally all SSDs used in tier 4/5/6 arrays on this system have discard_zeroes_data support working. Sometimes strange drives come across that tell the kernel that this feature is supported, but, in fact, it is not there, or the feature does not always work. At the moment, support is available almost everywhere, however, old drives and firmware with errors are encountered. For this reason, DISCARD support is disabled by default for RAID 6.

Attention, the following command will destroy all data on NVMe drives by "initializing" the array with "zeros".

#blkdiscard /dev/md0

If something went wrong, then try specifying the step.

#blkdiscard --step 65536 /dev/md0

SATA SSD

#mdadm --create --verbose --assume-clean /dev/md1 --level=1 --raid-devices=6 /dev/sd[a-f]1
#blkdiscard /dev/md1
#mdadm --create --verbose --assume-clean /dev/md2 --chunk-size=512 --level=6 --raid-devices=6 /dev/sd[a-f]2

Why so big...?An increase in chunk-size has a positive effect on the speed of random reading in blocks up to chunk-size inclusive. This is because one operation of the appropriate size or less can be performed entirely on a single device. Therefore, IOPS from all devices are summed up. According to statistics, 99% of IO does not exceed 512K.

RAID 6 IOPS per write always less than or equal to IOPS per drive. When, as for random reading, IOPS can be several times greater than that of one drive, and here the block size is of key importance.
The author sees no point in trying to optimize a parameter that RAID 6 by-design is bad at and instead optimizes what RAID 6 performs well at.
We will compensate for the bad random write of RAID 6 with NVMe cache and thin-provisioning tricks.

We have not yet enabled DISCARD for RAID 6. So we will not β€œinitialize” this array yet. We will do this later - after installing the OS.

SATA HDD

#mdadm --create --verbose --assume-clean /dev/md3 --chunk-size=512 --level=6 --raid-devices=8 /dev/sd[g-n]1

LVM on NVMe RAID

For speed, we want to place the root FS on an NVMe RAID 1 which is /dev/md0.
However, we will still need this fast array for other needs, such as swap, LVM-cache metadata and cache, and LVM-thin metadata, so we will create an LVM VG on this array.

#pvcreate /dev/md0
#vgcreate root /dev/md0

Let's create a partition for the root FS.

#lvcreate -L 128G --name root root

Let's create a swap partition according to the size of the RAM.

#lvcreate -L 32G --name swap root

OS installation

In total, we have everything you need to install the system.

We launch the system installation wizard from the Ubuntu Live environment. Normal installation. Only at the stage of selecting disks for installation, you need to specify the following:

  • /dev/md1, - mount point /boot, FS - BTRFS
  • /dev/root/root (aka /dev/mapper/root-root), - mount point / (root), FS - BTRFS
  • /dev/root/swap (aka /dev/mapper/root-swap), - use as swap partition
  • Install bootloader on /dev/sda

When selecting BTRFS as the root FS, the installer will automatically create two BTRFS volumes named "@" for / (root), and "@home" for /home.

Let's start the installation...

The installation will end with a modal dialog box reporting an error installing the bootloader. Unfortunately, you will not be able to exit this dialog using standard means and continue the installation. We log out from the system and log in again, getting into a clean Ubuntu Live desktop. Open a terminal and again:

#sudo bash

Create a chroot environment to continue with the installation:

#mkdir /mnt/chroot
#mount -o defaults,space_cache,noatime,nodiratime,discard,subvol=@ /dev/mapper/root-root /mnt/chroot
#mount -o defaults,space_cache,noatime,nodiratime,discard,subvol=@home /dev/mapper/root-root /mnt/chroot/home
#mount -o defaults,space_cache,noatime,nodiratime,discard /dev/md1 /mnt/chroot/boot
#mount --bind /proc /mnt/chroot/proc
#mount --bind /sys /mnt/chroot/sys
#mount --bind /dev /mnt/chroot/dev

Set up network and hostname in chroot:

#cat /etc/hostname >/mnt/chroot/etc/hostname
#cat /etc/hosts >/mnt/chroot/etc/hosts
#cat /etc/resolv.conf >/mnt/chroot/etc/resolv.conf

We go into the chroot environment:

#chroot /mnt/chroot

First of all, we will deliver the packages:

apt-get install --reinstall mdadm lvm2 thin-provisioning-tools btrfs-tools util-linux lsscsi nvme-cli mc debsums hdparm

Let's check and fix all the packages that were crookedly installed due to the unfinished installation of the system:

#CORRUPTED_PACKAGES=$(debsums -s 2>&1 | awk '{print $6}' | uniq)
#apt-get install --reinstall $CORRUPTED_PACKAGES

If something does not grow together, you may need to edit /etc/apt/sources.list before doing this

Let's fix the parameters for the RAID 6 module to enable TRIM/DISCARD:

#cat >/etc/modprobe.d/raid456.conf << EOF
options raid456 devices_handle_discard_safely=1
EOF

Let's tweak our arrays a bit:

#cat >/etc/udev/rules.d/60-md.rules << EOF
SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}="32768"
SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/sync_speed_min", ATTR{md/sync_speed_min}="48000"
SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/sync_speed_max", ATTR{md/sync_speed_max}="300000"
EOF
#cat >/etc/udev/rules.d/62-hdparm.rules << EOF
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", RUN+="/sbin/hdparm -B 254 /dev/%k"
EOF
#cat >/etc/udev/rules.d/63-blockdev.rules << EOF
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", RUN+="/sbin/blockdev --setra 1024 /dev/%k"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md*", RUN+="/sbin/blockdev --setra 0 /dev/%k"
EOF

What was it..?We have created a set of udev rules that will do the following:

  • Set an adequate block cache size for 2020 for RAID 6. The default value does not seem to have changed since the creation of Linux, and has long been inadequate.
  • Reserve at least IO for the period of checks / synchronization of arrays. This is necessary so that your arrays do not get stuck in a state of eternal synchronization under load.
  • Limit the maximum IO for the duration of checks/synchronizations of arrays. This is to ensure that syncing/checking SSD RAIDs doesn't fry your drives to a crisp. This is especially true for NVMe. (Remember the radiator? I wasn't joking.)
  • Prevent disks from stopping the spindle (HDD) via APM and set the sleep timeout for disk controllers to 7 hours. You can completely disable APM if your drives can do it (-B 255). With the default value, disks will stop after five seconds. Then the OS will want to flush the disk cache, the disks will spin up again, and everything is new. Discs have a limited maximum number of spindle spins. Such a simple default cycle can easily kill your disks in a couple of years. Not all disks suffer from this, but our β€œlaptop” ones, with the corresponding default settings, which make a crooked mini-MAID out of RAID.
  • Set readahead on disks (spinning) to 1 megabyte - two consecutive blocks/chunk RAID 6
  • Disable readahead on arrays themselves.

Edit /etc/fstab:

#cat >/etc/fstab << EOF
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
# file-system mount-point type options dump pass
/dev/mapper/root-root / btrfs defaults,space_cache,noatime,nodiratime,discard,subvol=@ 0 1
UUID=$(blkid -o value -s UUID /dev/md1) /boot btrfs defaults,space_cache,noatime,nodiratime,discard 0 2
/dev/mapper/root-root /home btrfs defaults,space_cache,noatime,nodiratime,discard,subvol=@home 0 2
/dev/mapper/root-swap none swap sw 0 0
EOF

Why is that..?We will search for the /boot section by UUID. array naming can theoretically change.

We will search for the remaining sections by LVM names in /dev/mapper/vg-lv notation, because they uniquely identify partitions.

We don't use UUID for LVM. The UUID of LVM volumes and their snapshots can be the same.Mount /dev/mapper/root-root twice..?Yes. Exactly. Feature of BTRFS. This FS can be mounted several times with different subvol.

As a result of this feature, I recommend that you never create LVM snapshots of active BTRFS volumes. You may get a surprise when you reboot.

Regenerate the mdadm config:

#/usr/share/mdadm/mkconf | sed 's/#DEVICE/DEVICE/g' >/etc/mdadm/mdadm.conf

Let's adjust the LVM settings:

#cat >>/etc/lvm/lvmlocal.conf << EOF

activation {
thin_pool_autoextend_threshold=90
thin_pool_autoextend_percent=5
}
allocation {
cache_pool_max_chunks=2097152
}
devices {
global_filter=["r|^/dev/.*_corig$|","r|^/dev/.*_cdata$|","r|^/dev/.*_cmeta$|","r|^/dev/.*gpv$|","r|^/dev/images/.*$|","r|^/dev/mapper/images.*$|","r|^/dev/backup/.*$|","r|^/dev/mapper/backup.*$|"] issue_discards=1
}
EOF

What was it..?We have enabled automatic expansion of LVM thin pools upon reaching 90% of the occupied space by 5% of the volume.

We have increased the maximum number of cache blocks for LVM cache.

We have prevented LVM from searching for LVM volumes (PVs) on:

  • devices containing LVM cache (cdata)
  • devices cached with LVM cache bypassing the cache ( _corig). In this case, the cached device itself will still be scanned through the cache (just ).
  • devices containing LVM cache metadata (cmeta)
  • all devices in a VG called images. Here we will have disk images of virtual machines, and we do not want LVM on the host to activate volumes belonging to the guest OS.
  • all devices in the VG called backup. Here we will have backup copies of virtual machine images.
  • all devices whose name ends with "gpv" ( guest physical volume )

We have enabled DISCARD support when freeing space on LVM VG. Be careful. This will make deleting LV on SSD long enough. This is especially true for SSD RAID 6. However, according to the plan, we will use thin provisioning, so this will not bother us at all.

Update the initramfs image:

#update-initramfs -u -k all

Install and configure grub:

#apt-get install grub-pc
#apt-get purge os-prober
#dpkg-reconfigure grub-pc

Which discs should you choose?All that are sd*. The system must be able to boot from any working SATA drive or SSD.

Why nailed os-prober..?For excessive independence and playful hands.

It does not work correctly if one of the RAIDs is in a degraded state. It tries to search for OS on partitions that are used in virtual machines running on this hardware.

If you need it, you can leave it, but keep in mind all of the above. I recommend looking for recipes for getting rid of playful hands on the net.

This completes the initial installation. It's time to reboot into the newly installed OS. Don't forget to remove the bootable Live CD/USB.

#exit
#reboot

As a boot device, select any of the SATA SSDs.

LVM on SATA SSD

At this point, we have already booted into the new OS, set up the network, apt, opened the terminal emulator, and ran:

#sudo bash

Let's continue.

"Initialize" an array of SATA SSD:

#blkdiscard /dev/md2

If it doesn’t work, then we try:

#blkdiscard --step 65536 /dev/md2
Create LVM VG on SATA SSD:

#pvcreate /dev/md2
#vgcreate data /dev/md2

Why another VG..?Indeed, we already have a VG called root. Why not put everything in one VG?

If there are several PVs in the VG, then all PVs must be present (online) for the correct activation of the VG. The exception is LVM RAID, which we intentionally don't use.

We really want that in the event of a crash (read data loss) on any of the RAID 6 arrays, the operating system will boot normally and give us the opportunity to solve the problem.

To do this, at the first level of abstraction, we will isolate each type of physical "carrier" into a separate VG.

Scientifically speaking, different RAID arrays belong to different "reliability domains". You should not create an additional common point of failure for them, stuffing them into one VG.

The presence of LVM at the "iron" level will allow us to arbitrarily cut pieces of different RAID arrays by combining them in different ways. For example, run simultaneously bcache + LVM thin, bcache + BTRFS, LVM cache + LVM thin, a complex ZFS configuration with caches, or any other hellish mixture to feel and compare it all.

At the "iron" level, we will not use anything but the good old "thick" LVM volumes. The exception to this rule may be the backup partition.

I think by this point, many readers have already begun to suspect something about the matryoshka.

LVM on SATA HDD

#pvcreate /dev/md3
#vgcreate backup /dev/md3

Another new VG..?We really want that when the array of disks that we will use for data backup fails, our operating system continues to work normally, while maintaining access to non-backup data. Therefore, in order to avoid VG activation problems, we create a separate VG.

Setting up LVM cache

Let's create an LV on NVMe RAID 1 to use it as a caching device.

#lvcreate -L 70871154688B --name cache root

Why so little...?The fact is that our NVMe SSDs also have an SLC cache. 4 gigabytes of "free" and 18 gigabytes of dynamic due to the free space occupied in 3-bit MLC. After depleting this cache, NVMe SSDs will not be much faster than our cached SATA SSDs. Actually, for this reason, it makes no sense for us to make the LVM cache partition much larger than twice the size of the SLC cache of the NVMe drive. For the NVMe drives used, the author considers it reasonable to make 32-64 gigabytes of cache.

The given partition size is required to organize 64 gigabytes of cache, cache metadata and metadata backup.

Additionally, I note that after a dirty shutdown of the system, LVM will mark the entire cache as dirty and will synchronize again. Moreover, this will repeat every time you use lvchange on that device until a new system reboot. Therefore, I recommend immediately recreating the cache with the appropriate script.

Let's create an LV on SATA RAID 6 to use it as a cacheable device.

#lvcreate -L 3298543271936B --name cache data

Why only three terabytes..?So that, if necessary, you can use SATA SSD RAID 6 for some other needs. The size of the cached space can be increased dynamically, on the fly, without stopping the system. To do this, you need to temporarily stop and re-enable the cache, but the distinctive advantages of LVM-cache over, for example, bcache, is that it can be done on the fly.

Let's create a new VG for caching.

#pvcreate /dev/root/cache
#pvcreate /dev/data/cache
#vgcreate cache /dev/root/cache /dev/data/cache

Let's create an LV on the cached device.

#lvcreate -L 3298539077632B --name cachedata cache /dev/data/cache

Here we immediately took all the free space on /dev/data/cache so that all other necessary partitions were created immediately on /dev/root/cache. If you have something created in the wrong place, you can move it using pvmove.

Let's create and enable the cache:

#lvcreate -y -L 64G -n cache cache /dev/root/cache
#lvcreate -y -L 1G -n cachemeta cache /dev/root/cache
#lvconvert -y --type cache-pool --cachemode writeback --chunksize 64k --poolmetadata cache/cachemeta cache/cache
#lvconvert -y --type cache --cachepool cache/cache cache/cachedata

Why such chunksize..?Through practical experiments, the author managed to find out that the best result is achieved if the size of the LVM cache block is the same as the size of the LVM thin block. At the same time, the smaller the size, the better the configuration shows itself on a random record.

64k is the minimum block size allowed for LVM thin.

Caution writeback..!Yes. This type of cache defers write synchronization to the cached device. This leads to the fact that, in the event of a cache loss, you can lose data on the cached device. Later, the author will tell you what measures, in addition to NVMe RAID 1, can be taken to compensate for this risk.

This type of cache was chosen intentionally to compensate for the poor performance of RAID 6 on random writes.

Let's check what we got:

#lvs -a -o lv_name,lv_size,devices --units B cache
LV LSize Devices
[cache] 68719476736B cache_cdata(0)
[cache_cdata] 68719476736B /dev/root/cache(0)
[cache_cmeta] 1073741824B /dev/root/cache(16384)
cachedata 3298539077632B cachedata_corig(0)
[cachedata_corig] 3298539077632B /dev/data/cache(0)
[lvol0_pmspare] 1073741824B /dev/root/cache(16640)

/dev/data/cache should only contain [cachedata_corig]. If something is wrong, then use pvmove.

You can disable the cache if necessary with one command:

#lvconvert -y --uncache cache/cachedata

This is done online. LVM simply syncs the cache to disk, deletes it, and renames cachedata_corig back to cachedata.

Setting up LVM thin

Let's roughly estimate how much space we need for LVM thin metadata:

#thin_metadata_size --block-size=64k --pool-size=6terabytes --max-thins=100000 -u bytes
thin_metadata_size - 3385794560 bytes estimated metadata area size for "--block-size=64kibibytes --pool-size=6terabytes --max-thins=100000"

Round up to 4 gigabytes: 4294967296B

Multiply by two and add 4194304B for LVM PV metadata: 8594128896B
Let's create a separate partition on NVMe RAID 1 to place LVM thin metadata and its backup on it:

#lvcreate -L 8594128896B --name images root

For what..?Here the question may arise, why place the LVM thin metadata separately if they will be cached on NVMe anyway and will work quickly.

Speed ​​here though is important, but not the main reason. The thing is that the cache is a point of failure. Something can happen to it, and if the LVM thin metadata is cached, it will lead to a complete loss of everything. Without whole metadata, building thin volumes would be next to impossible.

By moving the metadata to a separate, non-cached, but fast volume, we guarantee the safety of the metadata in case the cache is lost or corrupted. In this case, all damage caused by cache loss will be localized inside thin volumes, which will simplify the recovery procedure by orders of magnitude. With a high probability, these damages will be recoverable using FS logs.

Moreover, if a snapshot of a thin volume was previously taken, and, after that, the cache was fully synchronized at least once, then, due to the nature of the LVM thin internal device, the integrity of the snapshot will be guaranteed in case of loss of the cache.

Let's create a new VG that will be responsible for thin-provisioning:

#pvcreate /dev/root/images
#pvcreate /dev/cache/cachedata
#vgcreate images /dev/root/images /dev/cache/cachedata

Let's create a pool:

#lvcreate -L 274877906944B --poolmetadataspare y --poolmetadatasize 4294967296B --chunksize 64k -Z y -T images/thin-pool
Why -Z yIn addition to what this mode is actually intended for - to prevent data from one virtual machine from leaking into another virtual machine when reallocating space - zeroing is additionally used to increase the random write speed in blocks less than 64k. Any write less than 64k to a previously unallocated area of ​​the thin volume will become 64K cache-aligned. This will allow the operation to be performed entirely through the cache, bypassing the cached device.

Let's move the LVs to the appropriate PVs:

#pvmove -n images/thin-pool_tdata /dev/root/images /dev/cache/cachedata
#pvmove -n images/lvol0_pmspare /dev/cache/cachedata /dev/root/images
#pvmove -n images/thin-pool_tmeta /dev/cache/cachedata /dev/root/images

Check:

#lvs -a -o lv_name,lv_size,devices --units B images
LV LSize Devices
[lvol0_pmspare] 4294967296B /dev/root/images(0)
thin-pool 274877906944B thin-pool_tdata(0)
[thin-pool_tdata] 274877906944B /dev/cache/cachedata(0)
[thin-pool_tmeta] 4294967296B /dev/root/images(1024)

Let's create a thin volume for tests:

#lvcreate -V 64G --thin-pool thin-pool --name test images

Let's install packages for testing and monitoring:

#apt-get install sysstat fio

This is how you can observe the behavior of our storage configuration in real time:

#watch 'lvs --rows --reportformat basic --quiet -ocache_dirty_blocks,cache_settings cache/cachedata && (lvdisplay cache/cachedata | grep Cache) && (sar -p -d 2 1 | grep -E "sd|nvme|DEV|md1|md2|md3|md0" | grep -v Average | sort)'

This is how we can test our configuration:

#fio --loops=1 --size=64G --runtime=4 --filename=/dev/images/test --stonewall --ioengine=libaio --direct=1
--name=4kQD32read --bs=4k --iodepth=32 --rw=randread
--name=8kQD32read --bs=8k --iodepth=32 --rw=randread
--name=16kQD32read --bs=16k --iodepth=32 --rw=randread
--name=32KQD32read --bs=32k --iodepth=32 --rw=randread
--name=64KQD32read --bs=64k --iodepth=32 --rw=randread
--name=128KQD32read --bs=128k --iodepth=32 --rw=randread
--name=256KQD32read --bs=256k --iodepth=32 --rw=randread
--name=512KQD32read --bs=512k --iodepth=32 --rw=randread
--name=4Kread --bs=4k --rw=read
--name=8Kread --bs=8k --rw=read
--name=16Kread --bs=16k --rw=read
--name=32Kread --bs=32k --rw=read
--name=64Kread --bs=64k --rw=read
--name=128Kread --bs=128k --rw=read
--name=256Kread --bs=256k --rw=read
--name=512Kread --bs=512k --rw=read
--name=Seqread --bs=1m --rw=read
--name=Longread --bs=8m --rw=read
--name=Longwrite --bs=8m --rw=write
--name=Seqwrite --bs=1m --rw=write
--name=512Kwrite --bs=512k --rw=write
--name=256write --bs=256k --rw=write
--name=128write --bs=128k --rw=write
--name=64write --bs=64k --rw=write
--name=32write --bs=32k --rw=write
--name=16write --bs=16k --rw=write
--name=8write --bs=8k --rw=write
--name=4write --bs=4k --rw=write
--name=512KQD32write --bs=512k --iodepth=32 --rw=randwrite
--name=256KQD32write --bs=256k --iodepth=32 --rw=randwrite
--name=128KQD32write --bs=128k --iodepth=32 --rw=randwrite
--name=64KQD32write --bs=64k --iodepth=32 --rw=randwrite
--name=32KQD32write --bs=32k --iodepth=32 --rw=randwrite
--name=16KQD32write --bs=16k --iodepth=32 --rw=randwrite
--name=8KQD32write --bs=8k --iodepth=32 --rw=randwrite
--name=4kQD32write --bs=4k --iodepth=32 --rw=randwrite
| grep -E 'read|write|test' | grep -v ioengine

Carefully! Resource!This code will run 36 different tests, each running for 4 seconds. Half of the write tests. You can record a lot in 4 seconds on NVMe. Up to 3 gigabytes per second. So, each run of write tests can eat up to 216 gigabytes of SSD resource from you.

Reading and writing mixed?Yes. It makes sense to run read and write tests separately. Moreover, it makes sense to make sure that all caches are synchronized so that a previous write does not affect the read.

The results will vary greatly on the first run and subsequent ones as the cache and thin volume fill up, and also, depending on whether the system managed to synchronize the caches filled at the last run.

Among other things, I recommend measuring the speed on an already filled thin volume from which a snapshot has just been taken. The author had the opportunity to observe how random write accelerates dramatically immediately after the creation of the first snapshot, especially when the cache is not yet completely full. This is due to the copy-on-write semantics of the write, the alignment of the cache and thin volume blocks, and the fact that a random write to RAID 6 turns into a random read from RAID 6 followed by a write to the cache. In our configuration, random reads from RAID 6 are up to 6 times faster (the number of SATA SSDs in the array) than writes. Because blocks for CoW are allocated sequentially from the thin pool, then the record, for the most part, also turns into sequential.

Both of these features can be used to advantage.

Cache "coherent" snapshots

To reduce the risk of data loss in case of cache damage / loss, the author proposes to introduce the practice of snapshot rotation, which guarantees their integrity in this case.

First, because the thin volume metadata resides on a non-cached device, the metadata will be consistent and possible losses will be isolated within data blocks.

The following snapshot rotation cycle guarantees data integrity within snapshots in case of cache loss:

  1. For each thin volume named <name>, create a snapshot named <name>.cached
  2. Set the migration threshold to a reasonably high value: #lvchange --quiet --cachesettings "migration_threshold=16384" cache/cachedata
  3. In a loop, we check the number of dirty blocks in the cache: #lvs --rows --reportformat basic --quiet -ocache_dirty_blocks cache/cachedata | awk '{print $2}' until we get zero. If a zero is missing for too long, it can be created by temporarily setting the cache to writethrough mode. However, given the speed characteristics of our SATA and NVMe SSD arrays, as well as their TBW resource, you will either be able to quickly catch the moment without changing the cache mode, or your hardware will completely eat up all its resource in a few days. Due to resource limitations, the system is in principle not capable of being under 100% write load all the time. Our NVMe SSDs under 100% write load will completely use up the resource in 3-4 days. SATA SSDs will only last twice as long. Therefore, we will assume that most of the load goes to reading, and for writing we have relatively short bursts of extremely high activity combined with a low load on average.
  4. As soon as we caught (or made) a zero, we rename <name>.cached to <name>.committed. The old <name>.committed is removed.
  5. Optionally, if the cache is 100% full, it can be recreated by the script, thus clearing it. With a half-empty cache, the system is much faster on writes.
  6. Set the migration threshold to zero: #lvchange --quiet --cachesettings "migration_threshold=0" cache/cachedata This will temporarily prevent the cache from being synchronized to the primary media.
  7. Waiting for enough changes to accumulate in the cache #lvs --rows --reportformat basic --quiet -ocache_dirty_blocks cache/cachedata | awk '{print $2}' or the timer will run.
  8. We repeat again.

Why the difficulty with the migration threshold...?The thing is that in real practice, a β€œrandom” record is actually not completely random. If we have written something to a 4 kilobyte sector, there is a high probability that in the next couple of minutes a record will be made to the same or one of the neighboring (+- 32K) sectors.

By setting the migration threshold to zero we defer write sync to the SATA SSD and aggregate multiple changes to the same 64K block in the cache. Thus, the SATA SSD resource is noticeably saved.

Where is the code..?Unfortunately, the author considers himself insufficiently competent in the development of bash scripts, because he is 100% self-taught and practices β€œgoogle”-driven development, therefore he believes that the terrible code that comes out of his hands is better not to be used by anyone else.

I think that professionals in this field will be able to independently depict all the logic described above if necessary, and maybe even beautifully arrange it as a systemd service, as the author tried to do.

Such a simple snapshot rotation scheme will allow us not only to constantly have one snapshot fully synchronized on the SATA SSD, but also allow us to find out which blocks were changed after it was created using the thin_delta utility, and thus localize damage to the main volumes, greatly simplifying recovery. .

TRIM/DISCARD in libvirt/KVM

Because Since the data store will be used for KVM running libvirt, it would be nice to teach our VMs not only to take up free space, but also to free up unnecessary space.

This is done by emulating TRIM/DISCARD support on virtual disks. To do this, you need to change the controller type to virtio-scsi and edit the xml.

#virsh edit vmname
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='writethrough' io='threads' discard='unmap'/>
<source dev='/dev/images/vmname'/>
<backingStore/>
<target dev='sda' bus='scsi'/>
<alias name='scsi0-0-0-0'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>

<controller type='scsi' index='0' model='virtio-scsi'>
<alias name='scsi0'/>
<address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</controller>

Such DISCARDs from guest OSes are correctly handled by LVM, and blocks are correctly deallocated both in the cache and in the thin pool. In our case, this happens mainly deferred, when the next snapshot is deleted.

BTRFS backup

Use ready-made scripts with extreme caution and at one's own risk. The author wrote this code himself and exclusively for himself. I am sure that many experienced Linux users have similar crutches, and there is no need to copy others.

Let's create a volume on the backup device:

#lvcreate -L 256G --name backup backup

Format in BTRFS:

#mkfs.btrfs /dev/backup/backup

Let's create mount points and mount the root subsections of the file system:

#mkdir /backup
#mkdir /backup/btrfs
#mkdir /backup/btrfs/root
#mkdir /backup/btrfs/back
#ln -s /boot /backup/btrfs
# cat >>/etc/fstab << EOF

/dev/mapper/root-root /backup/btrfs/root btrfs defaults,space_cache,noatime,nodiratime 0 2
/dev/mapper/backup-backup /backup/btrfs/back btrfs defaults,space_cache,noatime,nodiratime 0 2
EOF
#mount -a
#update-initramfs -u
#update-grub

Create directories for backups:

#mkdir /backup/btrfs/back/remote
#mkdir /backup/btrfs/back/remote/root
#mkdir /backup/btrfs/back/remote/boot

Let's create a directory for backup scripts:

#mkdir /root/btrfs-backup

Let's copy the script:

Lots of scary bash code. Use at your own risk. Do not write angry letters to the author ...#cat >/root/btrfs-backup/btrfs-backup.sh << EOF
#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

SCRIPT_FILE="$(realpath $0)"
SCRIPT_DIR="$(dirname $SCRIPT_FILE)"
SCRIPT_NAME="$(basename -s .sh $SCRIPT_FILE)"

LOCK_FILE="/dev/shm/$SCRIPT_NAME.lock"
DATE_PREFIX='%Y-%m-%d'
DATE_FORMAT=$DATE_PREFIX'-%H-%M-%S'
DATE_REGEX='[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]'
BASE_SUFFIX=".@base"
PEND_SUFFIX=".@pend"
SNAP_SUFFIX=".@snap"
MOUNTS="/backup/btrfs/"
BACKUPS="/backup/btrfs/back/remote/"

function terminate ()
{
echo "$1" >&2
exit 1
}

function wait_lock()
{
flock 98
}

function wait_lock_or_terminate()
{
echo "Wating for lock..."
wait_lock || terminate "Failed to get lock. Exiting..."
echo "Got lock..."
}

function suffix()
{
FORMATTED_DATE=$(date +"$DATE_FORMAT")
echo "$SNAP_SUFFIX.$FORMATTED_DATE"
}

function filter()
{
FORMATTED_DATE=$(date --date="$1" +"$DATE_PREFIX")
echo "$SNAP_SUFFIX.$FORMATTED_DATE"
}

function backup()
{
SOURCE_PATH="$MOUNTS$1"
TARGET_PATH="$BACKUPS$1"
SOURCE_BASE_PATH="$MOUNTS$1$BASE_SUFFIX"
TARGET_BASE_PATH="$BACKUPS$1$BASE_SUFFIX"
TARGET_BASE_DIR="$(dirname $TARGET_BASE_PATH)"
SOURCE_PEND_PATH="$MOUNTS$1$PEND_SUFFIX"
TARGET_PEND_PATH="$BACKUPS$1$PEND_SUFFIX"
if [ -d "$SOURCE_BASE_PATH" ] then
echo "$SOURCE_BASE_PATH found"
else
echo "$SOURCE_BASE_PATH File not found creating snapshot of $SOURCE_PATH to $SOURCE_BASE_PATH"
btrfs subvolume snapshot -r $SOURCE_PATH $SOURCE_BASE_PATH
sync
if [ -d "$TARGET_BASE_PATH" ] then
echo "$TARGET_BASE_PATH found out of sync with source... removing..."
btrfs subvolume delete -c $TARGET_BASE_PATH
sync
fi
fi
if [ -d "$TARGET_BASE_PATH" ] then
echo "$TARGET_BASE_PATH found"
else
echo "$TARGET_BASE_PATH not found. Synching to $TARGET_BASE_DIR"
btrfs send $SOURCE_BASE_PATH | btrfs receive $TARGET_BASE_DIR
sync
fi
if [ -d "$SOURCE_PEND_PATH" ] then
echo "$SOURCE_PEND_PATH found removing..."
btrfs subvolume delete -c $SOURCE_PEND_PATH
sync
fi
btrfs subvolume snapshot -r $SOURCE_PATH $SOURCE_PEND_PATH
sync
if [ -d "$TARGET_PEND_PATH" ] then
echo "$TARGET_PEND_PATH found removing..."
btrfs subvolume delete -c $TARGET_PEND_PATH
sync
fi
echo "Sending $SOURCE_PEND_PATH to $TARGET_PEND_PATH"
btrfs send -p $SOURCE_BASE_PATH $SOURCE_PEND_PATH | btrfs receive $TARGET_BASE_DIR
sync
TARGET_DATE_SUFFIX=$(suffix)
btrfs subvolume snapshot -r $TARGET_PEND_PATH "$TARGET_PATH$TARGET_DATE_SUFFIX"
sync
btrfs subvolume delete -c $SOURCE_BASE_PATH
sync
btrfs subvolume delete -c $TARGET_BASE_PATH
sync
mv $SOURCE_PEND_PATH $SOURCE_BASE_PATH
mv $TARGET_PEND_PATH $TARGET_BASE_PATH
sync
}

function list()
{
LIST_TARGET_BASE_PATH="$BACKUPS$1$BASE_SUFFIX"
LIST_TARGET_BASE_DIR="$(dirname $LIST_TARGET_BASE_PATH)"
LIST_TARGET_BASE_NAME="$(basename -s .$BASE_SUFFIX $LIST_TARGET_BASE_PATH)"
find "$LIST_TARGET_BASE_DIR" -maxdepth 1 -mindepth 1 -type d -printf "%fn" | grep "${LIST_TARGET_BASE_NAME/$BASE_SUFFIX/$SNAP_SUFFIX}.$DATE_REGEX"
}

function remove()
{
REMOVE_TARGET_BASE_PATH="$BACKUPS$1$BASE_SUFFIX"
REMOVE_TARGET_BASE_DIR="$(dirname $REMOVE_TARGET_BASE_PATH)"
btrfs subvolume delete -c $REMOVE_TARGET_BASE_DIR/$2
sync
}

function removeall()
{
DATE_OFFSET="$2"
FILTER="$(filter "$DATE_OFFSET")"
while read -r SNAPSHOT ; do
remove "$1" "$SNAPSHOT"
done < <(list "$1" | grep "$FILTER")

}

(
COMMAND="$1"
shift

case "$COMMAND" in
"--help")
echo "Help"
;;
"suffix")
suffix
;;
"filter")
filter "$1"
;;
"backup")
wait_lock_or_terminate
backup "$1"
;;
"list")
list "$1"
;;
"remove")
wait_lock_or_terminate
remove "$1" "$2"
;;
"removeall")
wait_lock_or_terminate
removeall "$1" "$2"
;;
*)
echo "None.."
;;
esac
) 98>$LOCK_FILE

EOF

What does it even do..?Contains a set of simple commands for creating BTRFS snapshots and copying them to another file system using BTRFS send/receive.

The first start can be relatively long, as all data will be copied at the beginning. Further launches will be very fast, because. only changes will be copied.

Another script that we will push into cron:

Some more bash code#cat >/root/btrfs-backup/cron-daily.sh << EOF
#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

SCRIPT_FILE="$(realpath $0)"
SCRIPT_DIR="$(dirname $SCRIPT_FILE)"
SCRIPT_NAME="$(basename -s .sh $SCRIPT_FILE)"

BACKUP_SCRIPT="$SCRIPT_DIR/btrfs-backup.sh"
RETENTION="-60 day"
$BACKUP_SCRIPT backup root/@
$BACKUP_SCRIPT removeall root/@ "$RETENTION"
$BACKUP_SCRIPT backup root/@home
$BACKUP_SCRIPT removeall root/@home "$RETENTION"
$BACKUP_SCRIPT backup boot/
$BACKUP_SCRIPT removeall boot/ "$RETENTION"
EOF

What does it do..?Creates and synchronizes incremental snapshots of the listed BTRFS volumes on the backup FS. After that, it deletes all snapshots created 60 days ago. After launch, the /backup/btrfs/back/remote/ subdirectories will contain dated snapshots of the listed volumes.

Let's give the code the right to execute:

#chmod +x /root/btrfs-backup/cron-daily.sh
#chmod +x /root/btrfs-backup/btrfs-backup.sh

Let's check it and put it in cron:

#/usr/bin/nice -n 19 /usr/bin/ionice -c 3 /root/btrfs-backup/cron-daily.sh 2>&1 | /usr/bin/logger -t btrfs-backup
#cat /var/log/syslog | grep btrfs-backup
#crontab -e
0 2 * * * /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /root/btrfs-backup/cron-daily.sh 2>&1 | /usr/bin/logger -t btrfs-backup

Backup LVM thin

Let's create a thin pool on the backup device:

#lvcreate -L 274877906944B --poolmetadataspare y --poolmetadatasize 4294967296B --chunksize 64k -Z y -T backup/thin-pool

Install ddrescue, because scripts will use this tool:

#apt-get install gddrescue

Let's create a directory for scripts:

#mkdir /root/lvm-thin-backup

Let's copy the scripts:

Lots of bash inside...#cat >/root/lvm-thin-backup/lvm-thin-backup.sh << EOF
#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

SCRIPT_FILE="$(realpath $0)"
SCRIPT_DIR="$(dirname $SCRIPT_FILE)"
SCRIPT_NAME="$(basename -s .sh $SCRIPT_FILE)"

LOCK_FILE="/dev/shm/$SCRIPT_NAME.lock"
DATE_PREFIX='%Y-%m-%d'
DATE_FORMAT=$DATE_PREFIX'-%H-%M-%S'
DATE_REGEX='[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]'
BASE_SUFFIX=".base"
PEND_SUFFIX=".pend"
SNAP_SUFFIX=".snap"
BACKUPS="backup"
BACKUPS_POOL="thin-pool"

export LVM_SUPPRESS_FD_WARNINGS=1

function terminate ()
{
echo "$1" >&2
exit 1
}

function wait_lock()
{
flock 98
}

function wait_lock_or_terminate()
{
echo "Wating for lock..."
wait_lock || terminate "Failed to get lock. Exiting..."
echo "Got lock..."
}

function suffix()
{
FORMATTED_DATE=$(date +"$DATE_FORMAT")
echo "$SNAP_SUFFIX.$FORMATTED_DATE"
}

function filter()
{
FORMATTED_DATE=$(date --date="$1" +"$DATE_PREFIX")
echo "$SNAP_SUFFIX.$FORMATTED_DATE"
}

function read_thin_id {
lvs --rows --reportformat basic --quiet -othin_id "$1/$2" | awk '{print $2}'
}

function read_pool_lv {
lvs --rows --reportformat basic --quiet -opool_lv "$1/$2" | awk '{print $2}'
}

function read_lv_dm_path {
lvs --rows --reportformat basic --quiet -olv_dm_path "$1/$2" | awk '{print $2}'
}

function read_lv_active {
lvs --rows --reportformat basic --quiet -olv_active "$1/$2" | awk '{print $2}'
}

function read_lv_chunk_size {
lvs --rows --reportformat basic --quiet --units b --nosuffix -ochunk_size "$1/$2" | awk '{print $2}'
}

function read_lv_size {
lvs --rows --reportformat basic --quiet --units b --nosuffix -olv_size "$1/$2" | awk '{print $2}'
}

function activate_volume {
lvchange -ay -Ky "$1/$2"
}

function deactivate_volume {
lvchange -an "$1/$2"
}

function read_thin_metadata_snap {
dmsetup status "$1" | awk '{print $7}'
}

function thindiff()
{
DIFF_VG="$1"
DIFF_SOURCE="$2"
DIFF_TARGET="$3"
DIFF_SOURCE_POOL=$(read_pool_lv $DIFF_VG $DIFF_SOURCE)
DIFF_TARGET_POOL=$(read_pool_lv $DIFF_VG $DIFF_TARGET)

if [ "$DIFF_SOURCE_POOL" == "" ] then
(>&2 echo "Source LV is not thin.")
exit 1
fi

if [ "$DIFF_TARGET_POOL" == "" ] then
(>&2 echo "Target LV is not thin.")
exit 1
fi

if [ "$DIFF_SOURCE_POOL" != "$DIFF_TARGET_POOL" ] then
(>&2 echo "Source and target LVs belong to different thin pools.")
exit 1
fi

DIFF_POOL_PATH=$(read_lv_dm_path $DIFF_VG $DIFF_SOURCE_POOL)
DIFF_SOURCE_ID=$(read_thin_id $DIFF_VG $DIFF_SOURCE)
DIFF_TARGET_ID=$(read_thin_id $DIFF_VG $DIFF_TARGET)
DIFF_POOL_PATH_TPOOL="$DIFF_POOL_PATH-tpool"
DIFF_POOL_PATH_TMETA="$DIFF_POOL_PATH"_tmeta
DIFF_POOL_METADATA_SNAP=$(read_thin_metadata_snap $DIFF_POOL_PATH_TPOOL)

if [ "$DIFF_POOL_METADATA_SNAP" != "-" ] then
(>&2 echo "Thin pool metadata snapshot already exist. Assuming stale one. Will release metadata snapshot in 5 seconds.")
sleep 5
dmsetup message $DIFF_POOL_PATH_TPOOL 0 release_metadata_snap
fi

dmsetup message $DIFF_POOL_PATH_TPOOL 0 reserve_metadata_snap
DIFF_POOL_METADATA_SNAP=$(read_thin_metadata_snap $DIFF_POOL_PATH_TPOOL)

if [ "$DIFF_POOL_METADATA_SNAP" == "-" ] then
(>&2 echo "Failed to create thin pool metadata snapshot.")
exit 1
fi

#We keep output in variable because metadata snapshot need to be released early.
DIFF_DATA=$(thin_delta -m$DIFF_POOL_METADATA_SNAP --snap1 $DIFF_SOURCE_ID --snap2 $DIFF_TARGET_ID $DIFF_POOL_PATH_TMETA)

dmsetup message $DIFF_POOL_PATH_TPOOL 0 release_metadata_snap

echo $"$DIFF_DATA" | grep -E 'different|left_only|right_only' | sed 's/</"/g' | sed 's/ /"/g' | awk -F'"' '{print $6 "t" $8 "t" $11}' | sed 's/different/copy/g' | sed 's/left_only/copy/g' | sed 's/right_only/discard/g'

}

function thinsync()
{
SYNC_VG="$1"
SYNC_PEND="$2"
SYNC_BASE="$3"
SYNC_TARGET="$4"
SYNC_PEND_POOL=$(read_pool_lv $SYNC_VG $SYNC_PEND)
SYNC_BLOCK_SIZE=$(read_lv_chunk_size $SYNC_VG $SYNC_PEND_POOL)
SYNC_PEND_PATH=$(read_lv_dm_path $SYNC_VG $SYNC_PEND)

activate_volume $SYNC_VG $SYNC_PEND

while read -r SYNC_ACTION SYNC_OFFSET SYNC_LENGTH ; do
SYNC_OFFSET_BYTES=$((SYNC_OFFSET * SYNC_BLOCK_SIZE))
SYNC_LENGTH_BYTES=$((SYNC_LENGTH * SYNC_BLOCK_SIZE))
if [ "$SYNC_ACTION" == "copy" ] then
ddrescue --quiet --force --input-position=$SYNC_OFFSET_BYTES --output-position=$SYNC_OFFSET_BYTES --size=$SYNC_LENGTH_BYTES "$SYNC_PEND_PATH" "$SYNC_TARGET"
fi

if [ "$SYNC_ACTION" == "discard" ] then
blkdiscard -o $SYNC_OFFSET_BYTES -l $SYNC_LENGTH_BYTES "$SYNC_TARGET"
fi
done < <(thindiff "$SYNC_VG" "$SYNC_PEND" "$SYNC_BASE")
}

function discard_volume()
{
DISCARD_VG="$1"
DISCARD_LV="$2"
DISCARD_LV_PATH=$(read_lv_dm_path "$DISCARD_VG" "$DISCARD_LV")
if [ "$DISCARD_LV_PATH" != "" ] then
echo "$DISCARD_LV_PATH found"
else
echo "$DISCARD_LV not found in $DISCARD_VG"
exit 1
fi
DISCARD_LV_POOL=$(read_pool_lv $DISCARD_VG $DISCARD_LV)
DISCARD_LV_SIZE=$(read_lv_size "$DISCARD_VG" "$DISCARD_LV")
lvremove -y --quiet "$DISCARD_LV_PATH" || exit 1
lvcreate --thin-pool "$DISCARD_LV_POOL" -V "$DISCARD_LV_SIZE"B --name "$DISCARD_LV" "$DISCARD_VG" || exit 1
}

function backup()
{
SOURCE_VG="$1"
SOURCE_LV="$2"
TARGET_VG="$BACKUPS"
TARGET_LV="$SOURCE_VG-$SOURCE_LV"
SOURCE_BASE_LV="$SOURCE_LV$BASE_SUFFIX"
TARGET_BASE_LV="$TARGET_LV$BASE_SUFFIX"
SOURCE_PEND_LV="$SOURCE_LV$PEND_SUFFIX"
TARGET_PEND_LV="$TARGET_LV$PEND_SUFFIX"
SOURCE_BASE_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_BASE_LV")
SOURCE_PEND_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_PEND_LV")
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
TARGET_PEND_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_PEND_LV")

if [ "$SOURCE_BASE_LV_PATH" != "" ] then
echo "$SOURCE_BASE_LV_PATH found"
else
echo "Source base not found creating snapshot of $SOURCE_VG/$SOURCE_LV to $SOURCE_VG/$SOURCE_BASE_LV"
lvcreate --quiet --snapshot --name "$SOURCE_BASE_LV" "$SOURCE_VG/$SOURCE_LV" || exit 1
SOURCE_BASE_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_BASE_LV")
activate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
echo "Discarding $SOURCE_BASE_LV_PATH as we need to bootstrap."
SOURCE_BASE_POOL=$(read_pool_lv $SOURCE_VG $SOURCE_BASE_LV)
SOURCE_BASE_CHUNK_SIZE=$(read_lv_chunk_size $SOURCE_VG $SOURCE_BASE_POOL)
discard_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
sync
if [ "$TARGET_BASE_LV_PATH" != "" ] then
echo "$TARGET_BASE_LV_PATH found out of sync with source... removing..."
lvremove -y --quiet $TARGET_BASE_LV_PATH || exit 1
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
sync
fi
fi
SOURCE_BASE_SIZE=$(read_lv_size "$SOURCE_VG" "$SOURCE_BASE_LV")
if [ "$TARGET_BASE_LV_PATH" != "" ] then
echo "$TARGET_BASE_LV_PATH found"
else
echo "$TARGET_VG/$TARGET_LV not found. Creating empty volume."
lvcreate --thin-pool "$BACKUPS_POOL" -V "$SOURCE_BASE_SIZE"B --name "$TARGET_BASE_LV" "$TARGET_VG" || exit 1
echo "Have to rebootstrap. Discarding source at $SOURCE_BASE_LV_PATH"
activate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
SOURCE_BASE_POOL=$(read_pool_lv $SOURCE_VG $SOURCE_BASE_LV)
SOURCE_BASE_CHUNK_SIZE=$(read_lv_chunk_size $SOURCE_VG $SOURCE_BASE_POOL)
discard_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
TARGET_BASE_POOL=$(read_pool_lv $TARGET_VG $TARGET_BASE_LV)
TARGET_BASE_CHUNK_SIZE=$(read_lv_chunk_size $TARGET_VG $TARGET_BASE_POOL)
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
echo "Discarding target at $TARGET_BASE_LV_PATH"
discard_volume "$TARGET_VG" "$TARGET_BASE_LV"
sync
fi
if [ "$SOURCE_PEND_LV_PATH" != "" ] then
echo "$SOURCE_PEND_LV_PATH found removing..."
lvremove -y --quiet "$SOURCE_PEND_LV_PATH" || exit 1
sync
fi
lvcreate --quiet --snapshot --name "$SOURCE_PEND_LV" "$SOURCE_VG/$SOURCE_LV" || exit 1
SOURCE_PEND_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_PEND_LV")
sync
if [ "$TARGET_PEND_LV_PATH" != "" ] then
echo "$TARGET_PEND_LV_PATH found removing..."
lvremove -y --quiet $TARGET_PEND_LV_PATH
sync
fi
lvcreate --quiet --snapshot --name "$TARGET_PEND_LV" "$TARGET_VG/$TARGET_BASE_LV" || exit 1
TARGET_PEND_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_PEND_LV")
SOURCE_PEND_LV_SIZE=$(read_lv_size "$SOURCE_VG" "$SOURCE_PEND_LV")
lvresize -L "$SOURCE_PEND_LV_SIZE"B "$TARGET_PEND_LV_PATH"
activate_volume "$TARGET_VG" "$TARGET_PEND_LV"
echo "Synching $SOURCE_PEND_LV_PATH to $TARGET_PEND_LV_PATH"
thinsync "$SOURCE_VG" "$SOURCE_PEND_LV" "$SOURCE_BASE_LV" "$TARGET_PEND_LV_PATH" || exit 1
sync

TARGET_DATE_SUFFIX=$(suffix)
lvcreate --quiet --snapshot --name "$TARGET_LV$TARGET_DATE_SUFFIX" "$TARGET_VG/$TARGET_PEND_LV" || exit 1
sync
lvremove --quiet -y "$SOURCE_BASE_LV_PATH" || exit 1
sync
lvremove --quiet -y "$TARGET_BASE_LV_PATH" || exit 1
sync
lvrename -y "$SOURCE_VG/$SOURCE_PEND_LV" "$SOURCE_BASE_LV" || exit 1
lvrename -y "$TARGET_VG/$TARGET_PEND_LV" "$TARGET_BASE_LV" || exit 1
sync
deactivate_volume "$TARGET_VG" "$TARGET_BASE_LV"
deactivate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
}

function verify()
{
SOURCE_VG="$1"
SOURCE_LV="$2"
TARGET_VG="$BACKUPS"
TARGET_LV="$SOURCE_VG-$SOURCE_LV"
SOURCE_BASE_LV="$SOURCE_LV$BASE_SUFFIX"
TARGET_BASE_LV="$TARGET_LV$BASE_SUFFIX"
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
SOURCE_BASE_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_BASE_LV")

if [ "$SOURCE_BASE_LV_PATH" != "" ] then
echo "$SOURCE_BASE_LV_PATH found"
else
echo "$SOURCE_BASE_LV_PATH not found"
exit 1
fi
if [ "$TARGET_BASE_LV_PATH" != "" ] then
echo "$TARGET_BASE_LV_PATH found"
else
echo "$TARGET_BASE_LV_PATH not found"
exit 1
fi
activate_volume "$TARGET_VG" "$TARGET_BASE_LV"
activate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
echo Comparing "$SOURCE_BASE_LV_PATH" with "$TARGET_BASE_LV_PATH"
cmp "$SOURCE_BASE_LV_PATH" "$TARGET_BASE_LV_PATH"
echo Done...
deactivate_volume "$TARGET_VG" "$TARGET_BASE_LV"
deactivate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
}

function resync()
{
SOURCE_VG="$1"
SOURCE_LV="$2"
TARGET_VG="$BACKUPS"
TARGET_LV="$SOURCE_VG-$SOURCE_LV"
SOURCE_BASE_LV="$SOURCE_LV$BASE_SUFFIX"
TARGET_BASE_LV="$TARGET_LV$BASE_SUFFIX"
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
SOURCE_BASE_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_BASE_LV")

if [ "$SOURCE_BASE_LV_PATH" != "" ] then
echo "$SOURCE_BASE_LV_PATH found"
else
echo "$SOURCE_BASE_LV_PATH not found"
exit 1
fi
if [ "$TARGET_BASE_LV_PATH" != "" ] then
echo "$TARGET_BASE_LV_PATH found"
else
echo "$TARGET_BASE_LV_PATH not found"
exit 1
fi
activate_volume "$TARGET_VG" "$TARGET_BASE_LV"
activate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
SOURCE_BASE_POOL=$(read_pool_lv $SOURCE_VG $SOURCE_BASE_LV)
SYNC_BLOCK_SIZE=$(read_lv_chunk_size $SOURCE_VG $SOURCE_BASE_POOL)

echo Syncronizing "$SOURCE_BASE_LV_PATH" to "$TARGET_BASE_LV_PATH"

CMP_OFFSET=0
while [[ "$CMP_OFFSET" != "" ]] ; do
CMP_MISMATCH=$(cmp -i "$CMP_OFFSET" "$SOURCE_BASE_LV_PATH" "$TARGET_BASE_LV_PATH" | grep differ | awk '{print $5}' | sed 's/,//g' )
if [[ "$CMP_MISMATCH" != "" ]] ; then
CMP_OFFSET=$(( CMP_MISMATCH + CMP_OFFSET ))
SYNC_OFFSET_BYTES=$(( ( CMP_OFFSET / SYNC_BLOCK_SIZE ) * SYNC_BLOCK_SIZE ))
SYNC_LENGTH_BYTES=$(( SYNC_BLOCK_SIZE ))
echo "Synching $SYNC_LENGTH_BYTES bytes at $SYNC_OFFSET_BYTES from $SOURCE_BASE_LV_PATH to $TARGET_BASE_LV_PATH"
ddrescue --quiet --force --input-position=$SYNC_OFFSET_BYTES --output-position=$SYNC_OFFSET_BYTES --size=$SYNC_LENGTH_BYTES "$SOURCE_BASE_LV_PATH" "$TARGET_BASE_LV_PATH"
else
CMP_OFFSET=""
fi
done
echo Done...
deactivate_volume "$TARGET_VG" "$TARGET_BASE_LV"
deactivate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
}

function list()
{
LIST_SOURCE_VG="$1"
LIST_SOURCE_LV="$2"
LIST_TARGET_VG="$BACKUPS"
LIST_TARGET_LV="$LIST_SOURCE_VG-$LIST_SOURCE_LV"
LIST_TARGET_BASE_LV="$LIST_TARGET_LV$SNAP_SUFFIX"
lvs -olv_name | grep "$LIST_TARGET_BASE_LV.$DATE_REGEX"
}

function remove()
{
REMOVE_TARGET_VG="$BACKUPS"
REMOVE_TARGET_LV="$1"
lvremove -y "$REMOVE_TARGET_VG/$REMOVE_TARGET_LV"
sync
}

function removeall()
{
DATE_OFFSET="$3"
FILTER="$(filter "$DATE_OFFSET")"
while read -r SNAPSHOT ; do
remove "$SNAPSHOT"
done < <(list "$1" "$2" | grep "$FILTER")

}

(
COMMAND="$1"
shift

case "$COMMAND" in
"--help")
echo "Help"
;;
"suffix")
suffix
;;
"filter")
filter "$1"
;;
"backup")
wait_lock_or_terminate
backup "$1" "$2"
;;
"list")
list "$1" "$2"
;;
"thindiff")
thindiff "$1" "$2" "$3"
;;
"thinsync")
thinsync "$1" "$2" "$3" "$4"
;;
"verify")
wait_lock_or_terminate
verify "$1" "$2"
;;
"resync")
wait_lock_or_terminate
resync "$1" "$2"
;;
"remove")
wait_lock_or_terminate
remove "$1"
;;
"removeall")
wait_lock_or_terminate
removeall "$1" "$2" "$3"
;;
*)
echo "None.."
;;
esac
) 98>$LOCK_FILE

EOF

What does it do...?Contains a set of commands for manipulating thin snapshots and synchronizing the difference between two thin snapshots obtained via thin_delta to another block device using ddrescue and blkdiscard.

Another script that we will cram into cron:

Some more bash#cat >/root/lvm-thin-backup/cron-daily.sh << EOF
#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

SCRIPT_FILE="$(realpath $0)"
SCRIPT_DIR="$(dirname $SCRIPT_FILE)"
SCRIPT_NAME="$(basename -s .sh $SCRIPT_FILE)"

BACKUP_SCRIPT="$SCRIPT_DIR/lvm-thin-backup.sh"
RETENTION="-60 days"

$BACKUP_SCRIPT backup images linux-dev
$BACKUP_SCRIPT backup images win8
$BACKUP_SCRIPT backup images win8-data
#etc

$BACKUP_SCRIPT removeall images linux-dev "$RETENTION"
$BACKUP_SCRIPT removeall images win8 "$RETENTION"
$BACKUP_SCRIPT removeall images win8-data "$RETENTION"
#etc

EOF

What does it do...?Uses the previous script to create and synchronize backups of the listed thin volumes. The script will leave inactive snapshots of the listed volumes, which are needed to track changes since the last synchronization.

This script needs to be edited to include a list of thin volumes that need to be backed up. The names given are for example only. If desired, you can write a script that will synchronize all volumes.

Let's give rights:

#chmod +x /root/lvm-thin-backup/cron-daily.sh
#chmod +x /root/lvm-thin-backup/lvm-thin-backup.sh

Let's check it and put it in cron:

#/usr/bin/nice -n 19 /usr/bin/ionice -c 3 /root/lvm-thin-backup/cron-daily.sh 2>&1 | /usr/bin/logger -t lvm-thin-backup
#cat /var/log/syslog | grep lvm-thin-backup
#crontab -e
0 3 * * * /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /root/lvm-thin-backup/cron-daily.sh 2>&1 | /usr/bin/logger -t lvm-thin-backup

The first launch will be long, because. thin volumes will be fully synchronized by copying all used space. Thanks to the LVM thin metadata, we know which blocks are actually used, so only the really used blocks of thin volumes will be copied.

Subsequent runs will copy the data incrementally thanks to change tracking through the LVM thin metadata.

Let's see what happened:

#time /root/btrfs-backup/cron-daily.sh
real 0m2,967s
user 0m0,225s
sys 0m0,353s

#time /root/lvm-thin-backup/cron-daily.sh
real 1m2,710s
user 0m12,721s
sys 0m6,671s

#ls -al /backup/btrfs/back/remote/*
/backup/btrfs/back/remote/boot:
total 0
drwxr-xr-x 1 root root 1260 ΠΌΠ°Ρ€ 26 09:11 .
drwxr-xr-x 1 root root 16 ΠΌΠ°Ρ€ 6 09:30 ..
drwxr-xr-x 1 root root 322 ΠΌΠ°Ρ€ 26 02:00 .@base
drwxr-xr-x 1 root root 516 ΠΌΠ°Ρ€ 6 09:39 [email protected]
drwxr-xr-x 1 root root 516 ΠΌΠ°Ρ€ 6 09:39 [email protected]
...
/backup/btrfs/back/remote/root:
total 0
drwxr-xr-x 1 root root 2820 ΠΌΠ°Ρ€ 26 09:11 .
drwxr-xr-x 1 root root 16 ΠΌΠ°Ρ€ 6 09:30 ..
drwxr-xr-x 1 root root 240 ΠΌΠ°Ρ€ 26 09:11 @.@base
drwxr-xr-x 1 root root 22 ΠΌΠ°Ρ€ 26 09:11 @home.@base
drwxr-xr-x 1 root root 22 ΠΌΠ°Ρ€ 6 09:39 @[email protected]
drwxr-xr-x 1 root root 22 ΠΌΠ°Ρ€ 6 09:39 @[email protected]
...
drwxr-xr-x 1 root root 240 ΠΌΠ°Ρ€ 6 09:39 @[email protected]
drwxr-xr-x 1 root root 240 ΠΌΠ°Ρ€ 6 09:39 @[email protected]
...

#lvs -olv_name,lv_size images && lvs -olv_name,lv_size backup
LV LSize
linux-dev 128,00g
linux-dev.base 128,00g
thin-pool 1,38t
win8 128,00g
win8-data 2,00t
win8-data.base 2,00t
win8.base 128,00g
LV LSize
backup 256,00g
images-linux-dev.base 128,00g
images-linux-dev.snap.2020-03-08-10-09-11 128,00g
images-linux-dev.snap.2020-03-08-10-09-25 128,00g
...
images-win8-data.base 2,00t
images-win8-data.snap.2020-03-16-14-11-55 2,00t
images-win8-data.snap.2020-03-16-14-19-50 2,00t
...
images-win8.base 128,00g
images-win8.snap.2020-03-17-04-51-46 128,00g
images-win8.snap.2020-03-18-03-02-49 128,00g
...
thin-pool <2,09t

And what about nesting dolls?

Most likely though LVM LVs can be physical LVM PVs for other VGs. LVM can be recursive like nesting dolls. This gives LVM extreme flexibility.

PS

In the next article, we will try to use several similar mobile storage / KVM as the basis for creating a geo-distributed storage / vm cluster with redundancy on several continents via home desktops, home Internet and P2P networks.

Source: habr.com

Add a comment