Backup Part 3: Review and testing of duplicity, duplicati

Backup Part 3: Review and testing of duplicity, duplicati

This note discusses backup tools that perform backup by creating archives on a standby server.

Of those that meet the requirements - duplicity (which has a nice interface in the form of deja dup) and duplicati.

Another very remarkable backup tool is dar, but since it has a very extensive list of options - the testing methodology covers hardly 10% of what it is capable of - we do not test it within the current cycle.

Expected results

Since both candidates create archives in one way or another, a regular tar can be used as a guide.

Additionally, let's evaluate how well data storage on the storage server is optimized by creating backups that contain only the difference between the full copy and the current state of the files, or between the past and current archives (incremental, decremental, etc.).

Behavior when creating backups:

  1. A relatively small number of files on the backup storage server (comparable to the number of backups or data size in GB), but their size is quite large (tens to hundreds of megabytes).
  2. The size of the repository will only include changes - no duplicates will be kept, so the size of the repository will be smaller than when running rsync-based software.
  3. High CPU usage is expected when using compression and/or encryption, and probably quite a heavy network and disk subsystem load if the archiving and/or encryption process runs on the backup storage server.

As a reference value, run the following command:

cd /src/dir; tar -cf - * | ssh backup_server "cat > /backup/dir/archive.tar"

The execution results are as follows:

Backup Part 3: Review and testing of duplicity, duplicati

Runtime 3m12s. It can be seen that the speed rested on the disk subsystem of the backup storage server, as in the example with Rsync. Only a little faster, because. recording goes to one file.

Also, to evaluate compression, let's run the same option, but enable compression on the side of the backup server:

cd /src/dir; tar -cf - * | ssh backup_server "gzip > /backup/dir/archive.tgz"

The results are as follows:

Backup Part 3: Review and testing of duplicity, duplicati

Runtime 10m11s. Most likely, the bottleneck is a single-threaded compressor on the receiving side.

The same command, but with the transfer of compression to the server with the initial data to test the hypothesis that the bottleneck is a single-threaded compressor.

cd /src/dir; tar -czf - * | ssh backup_server "cat > /backup/dir/archive.tgz"

It turned out like this:

Backup Part 3: Review and testing of duplicity, duplicati

The execution time was 9m37s. You can clearly see the loading of one core by the compressor, because. network transmission speed and load on the source disk subsystem are similar.

To evaluate encryption, you can use openssl or gpg, including an additional command openssl or gpg in pipe. For reference, there will be such a command:

cd /src/dir; tar -cf - * | ssh backup_server "gzip | openssl enc -e -aes256 -pass pass:somepassword -out /backup/dir/archive.tgz.enc"

The results came out like this:

Backup Part 3: Review and testing of duplicity, duplicati

The execution time turned out to be 10m30s, since 2 processes are running on the receiving side - the bottleneck is again a single-threaded compressor, plus a small overhead for encryption.

UPD: At the request of bliznezz, I am adding tests with pigz. If you use only the compressor - it turned out in 6m30s, if you also add encryption - about 7m. The dip in the bottom graph is an unflushed disk cache:

Backup Part 3: Review and testing of duplicity, duplicati

duplicity testing

Duplicity is a python backup software for creating encrypted tar archives.

For incremental archives, librsync is used, so you can expect the behavior described in previous cycle note.

Backups can be encrypted and signed using gnupg, which is important when using various providers for storing backups (s3, backblaze, gdrive, etc.)

Let's see what the results will be:

These are the results obtained when running without encryption

spoiler

Backup Part 3: Review and testing of duplicity, duplicati

Running time of each test run:

Launch 1
Launch 2
Launch 3

16m33s
17m20s
16m30s

8m29s
9m3s
8m45s

5m21s
6m04s
5m53s

And here are the results when gnupg encryption is enabled, with a key size of 2048 bits:

Backup Part 3: Review and testing of duplicity, duplicati

Operating time on the same data, with encryption:

Launch 1
Launch 2
Launch 3

17m22s
17m32s
17m28s

8m52s
9m13s
9m3s

5m48s
5m40s
5m30s

The block size was indicated - 512 megabytes, which is clearly visible on the graphs; CPU utilization was actually kept at 50%, which means that the program utilizes no more than one processor core.

The principle of the program’s operation is also quite clearly visible: they took a piece of data, shook it, sent it to a backup storage server, which can be quite slow.
Another feature is the predictable running time of the program, which depends only on the size of the changed data.

Enabling encryption did not significantly increase the program's running time, but increased the processor load by about 10%, which can be a very nice nice bonus.

Unfortunately, this program was not able to correctly detect the situation with renaming the directory, and the resulting size of the repository turned out to be equal to the size of the changes (i.e. all 18GB), but the ability to use an untrusted server for backup clearly overrides this behavior.

duplicati testing

This software is written in C# and runs using a set of libraries from Mono. There is a GUI, as well as a cli version.

A rough list of the main features is close to duplicity, including various backup storage providers, however, unlike duplicity, most of the features are available without third-party tools. Whether this is a plus or a minus depends on the specific case, however, for beginners, it is most likely easier to have a list of all the features in front of your eyes at once than to install additional packages for python, as is the case with duplicity.

Another small nuance is that the program actively writes the local sqlite database on behalf of the user who starts the backup, so you need to additionally monitor the correct indication of the desired database every time you start the process using cli. When working through the GUI or WEBGUI, the details will be hidden from the user.

Let's see what indicators this solution can give:

If you turn off encryption (and WEBGUI does not recommend doing this), the results are as follows:

Backup Part 3: Review and testing of duplicity, duplicati

Openning time:

Launch 1
Launch 2
Launch 3

20m43s
20m13s
20m28s

5m21s
5m40s
5m35s

7m36s
7m54s
7m49s

With encryption enabled, using aes, it looks like this:

Backup Part 3: Review and testing of duplicity, duplicati

Openning time:

Launch 1
Launch 2
Launch 3

29m9s
30m1s
29m54s

5m29s
6m2s
5m54s

8m44s
9m12s
9m1s

And if you use the external program gnupg, you get the following results:

Backup Part 3: Review and testing of duplicity, duplicati

Launch 1
Launch 2
Launch 3

26m6s
26m35s
26m17s

5m20s
5m48s
5m40s

8m12s
8m42s
8m15s

As you can see, the program can work in several threads, but this does not make it a more productive solution, and if we compare the operation of encryption, the launch of an external program
turned out to be faster than using the library from the Mono set. Perhaps this is due to the fact that the external program is more optimized.

A pleasant moment was also the fact that the size of the repository takes exactly as much as the actual data was changed, i.e. duplicati detected a directory rename and correctly handled the situation. This can be seen when running the second test.

In general, a fairly positive impression of the program, including sufficient friendliness to beginners.

The results

Both candidates worked rather slowly, but in general, compared to the usual tar, there is progress, at least for duplicati. The price of such progress is also clear - a noticeable burden
processor. In general, there are no special deviations in predicting the results.

Conclusions

If there is no need to rush anywhere, and there is also a margin for the processor, any of the considered solutions will do, in any case, quite a lot of work has been done, which should not be repeated by writing wrapper scripts over tar. The presence of encryption is a very necessary property if the server for storing backups cannot be fully trusted.

Compared to solutions based on Rsync - performance can be several times worse, despite the fact that in its pure form tar worked 20-30% faster than rsync.
There are savings on the size of the repository, but only for duplicati.

Announcement

Backup, part 1: Why backup is needed, an overview of methods, technologies
Backup Part 2: Reviewing and testing rsync-based backup tools
Backup Part 3: Review and testing of duplicity, duplicati, deja dup
Backup Part 4: Reviewing and testing zbackup, restic, borgbackup
Backup Part 5: Testing bacula and veeam backup for linux
Backup Part 6: Comparing Backup Tools
Backup Part 7: Conclusions

Post Author: Pavel Demkovich

Source: habr.com

Add a comment