Backup storage for thousands of virtual machines with free tools

Backup storage for thousands of virtual machines with free tools

Hello, I recently came across an interesting task to set up a storage for backup of a large number of block devices.

Every week we back up all the virtual machines in our cloud, so you need to be able to serve thousands of backups and do it as quickly and efficiently as possible.

Unfortunately, standard configurations RAID5, RAID6 in this case, we will not be allowed to do so, since the recovery process on such large disks as ours will be painfully long and will most likely never end.

Let's take a look at the alternatives:

erasure coding β€” Similar to RAID5, RAID6, but with configurable parity level. In this case, the reservation is performed not block by block, but for each object separately. The easiest way to try erasure coding is to expand minion.

DRAID is a currently unreleased feature of ZFS. Unlike RAIDZ, DRAID has a distributed parity block and, during recovery, uses all the disks of the array at once, thanks to which it better survives disk failures and recovers faster after a failure.

Backup storage for thousands of virtual machines with free tools

Backup storage for thousands of virtual machines with free tools

Server available Fujitsu Primergy RX300 S7 with processor Intel Xeon CPU E5-2650L 0 @ 1.80GHz, nine sticks of RAM Samsung DDR3-1333 8Gb PC3L-10600R ECC Registered (M393B1K70DH0-YH9), disk shelf Supermicro SuperChassis 847E26-RJBOD1, connected via Dual LSI SAS2X36 Expander and 45 disks Seagage ST6000NM0115-1YZ110 by 6TB each.

Before we decide anything, we first need to properly test everything.

To do this, I prepared and tested various configurations. To do this, I used minio, which acted as an S3 backend and ran it in different modes with a different number of targets.

Basically, the minio case was tested in erasure coding vs software raid with the same number of disks and parity disks, and these are: RAID6, RAIDZ2 and DRAID2.

For reference: when you run minio with only one target, minio works in S3 gateway mode, giving your local file system as S3 storage. If you start minio with several targets, the Erasure Coding mode will automatically turn on, which will spread the data between your targets providing fault tolerance.

By default, minio divides targets into groups of 16 disks, where each group has 2 parities. Those. Two drives can fail at the same time without data loss.

For performance testing, I used 16 disks of 6TB each and wrote small objects of 1MB in size on them, this most accurately described our future load, since all modern backup tools divide data into blocks of several megabytes and write them in this way.

To conduct the benchmark, the s3bench utility was used, which is launched on a remote server and sends tens of thousands of such objects to minio in a hundred threads. Then in the same way I tried to request them back.

The benchmark results are shown in the following table:

Backup storage for thousands of virtual machines with free tools

As we can see, minio in its own erasure coding mode works much worse for writing than minio running on top of software RAID6, RAIDZ2 and DRAID2 in the same configuration.

Separate me have asked test minio on ext4 vs XFS. Surprisingly, for my type of workload, XFS was significantly slower than ext4.

In the first batch of tests, Mdadm showed superiority over ZFS, but later gmelikov suggestedthat you can improve the performance of ZFS by setting the following options:

xattr=sa atime=off recordsize=1M

and after that tests with ZFS got much better.

You can also notice that DRAID does not provide much performance gain over RAIDZ, but in theory it should be much safer.

In the last two tests, I also tried to transfer metadata (special) and ZIL (log) to the mirror from the SSD. But removing metadata did not give much gain in recording speed, and when removing ZIL, my SSDSC2KI128G8 hit the ceiling with 100% utilization, so I consider this test failed. I do not rule out that if I had faster SSD drives, then perhaps this could greatly improve my results, but, unfortunately, I did not have them.

In the end, I decided to stick with DRAID and despite its beta status, it is the fastest and most efficient storage solution in our case.

I created a simple DRAID2 in a configuration with three groups and two distributed spares:

# zpool status data
  pool: data
 state: ONLINE
  scan: none requested
config:

    NAME                 STATE     READ WRITE CKSUM
    data                 ONLINE       0     0     0
      draid2:3g:2s-0     ONLINE       0     0     0
        sdy              ONLINE       0     0     0
        sdam             ONLINE       0     0     0
        sdf              ONLINE       0     0     0
        sdau             ONLINE       0     0     0
        sdab             ONLINE       0     0     0
        sdo              ONLINE       0     0     0
        sdw              ONLINE       0     0     0
        sdak             ONLINE       0     0     0
        sdd              ONLINE       0     0     0
        sdas             ONLINE       0     0     0
        sdm              ONLINE       0     0     0
        sdu              ONLINE       0     0     0
        sdai             ONLINE       0     0     0
        sdaq             ONLINE       0     0     0
        sdk              ONLINE       0     0     0
        sds              ONLINE       0     0     0
        sdag             ONLINE       0     0     0
        sdi              ONLINE       0     0     0
        sdq              ONLINE       0     0     0
        sdae             ONLINE       0     0     0
        sdz              ONLINE       0     0     0
        sdan             ONLINE       0     0     0
        sdg              ONLINE       0     0     0
        sdac             ONLINE       0     0     0
        sdx              ONLINE       0     0     0
        sdal             ONLINE       0     0     0
        sde              ONLINE       0     0     0
        sdat             ONLINE       0     0     0
        sdaa             ONLINE       0     0     0
        sdn              ONLINE       0     0     0
        sdv              ONLINE       0     0     0
        sdaj             ONLINE       0     0     0
        sdc              ONLINE       0     0     0
        sdar             ONLINE       0     0     0
        sdl              ONLINE       0     0     0
        sdt              ONLINE       0     0     0
        sdah             ONLINE       0     0     0
        sdap             ONLINE       0     0     0
        sdj              ONLINE       0     0     0
        sdr              ONLINE       0     0     0
        sdaf             ONLINE       0     0     0
        sdao             ONLINE       0     0     0
        sdh              ONLINE       0     0     0
        sdp              ONLINE       0     0     0
        sdad             ONLINE       0     0     0
    spares
      s0-draid2:3g:2s-0  AVAIL   
      s1-draid2:3g:2s-0  AVAIL   

errors: No known data errors

Well, we figured out the storage, now about what we will backup. Here I immediately want to talk about three solutions that I managed to try, and this:

Benji Backup - fork Backy2, a dedicated block backup solution, has tight integration with Ceph. Able to take diffs between snapshots and form an incremental backup from them. Supports a large number of storage backends, including both local and S3. Requires a separate database to store the deduplication hash table. Of the minuses: written in python, has a slightly unresponsive cli.

BorgBackup - fork Attic, a well-known and proven backup tool, can backup data and dedupe it well. Able to save backups both locally and to a remote server via scp. Able to backup block devices if launched with a flag --special, of the minuses: when creating a backup, the repository is completely blocked, therefore it is recommended to create a separate repository for each virtual machine, in principle this is not a problem, since they are very easy to create.

Restic is an actively developing project, written in go, quite fast and supports a large number of storage backends, including local storage, scp, S3 and much more. Separately, I would like to note that there is a specially created rest-server for restic, which allows you to quickly export storage for remote use. Of all the above, I liked the most. Can backup from stdin. It has almost no noticeable disadvantages, but there are several features:

  • Firstly, I tried to use it in the general repository mode for all virtual machines (like Benji) and it even worked quite well, but the restore operations took a very long time, because... Every time before restoring, restic tries to read the metadata of all backups. This problem was easily solved, as with borg, by creating a separate repository for each virtual machine. This approach has proven to be very effective for managing backups as well. Separate repositories can have a separate password for accessing data, and we also don’t have to be afraid that the global repo might somehow break. You can spawn new repositories just as easily as in borg backup.

    In any case, deduplication is performed only relative to the previous version of the backup, the previous backup is determined by the path for the specified backup, so if you are backing up different objects from stdin to a common repository, do not forget to specify the option --stdin-filename, or explicitly specify the option each time --parent.

  • Secondly, restoring to stdout takes much more time than restoring to the file system due to its parallelism. In the future, we plan to add closer backup support for block devices.

  • Third, it is currently recommended to use version from master, because version 0.9.6 has a bug with long recovery of large files.

To test the efficiency of the backup and the speed of writing / restoring from a backup, I created a separate repository and tried to backup a small image of a virtual machine (21 GB). Two backups were made without changing the original, using each of the listed solutions to check how much faster / slower the deduplicated data is copied.

Backup storage for thousands of virtual machines with free tools

As we can see, Borg Backup has the best initial backup efficiency, but loses in terms of both write and restore speed.

Restic turned out to be faster than Benji Backup, but it takes longer to restore to stdout, and, unfortunately, it still cannot write directly to a block device.

After weighing all the pros and cons, I decided to stop at restic с rest-server as the most convenient and promising backup solution.

Backup storage for thousands of virtual machines with free tools

In this screencast you can see how a 10-gigabit channel is completely utilized during several backup operations running simultaneously. It is worth noting that disk recycling does not rise above 30%.

I was more than happy with the result!

Source: habr.com

Add a comment