How GitLab helps you back up large NextCloud storages

Hey Habr!

Today I want to talk about our experience in automating big data backup of Nextcloud storages in different configurations. I work for a service station at Molniya AK, where we are engaged in configuration management of IT systems, Nextcloud is used for data storage. Including, with a distributed structure, with redundancy.

The problems arising from the peculiarities of the installations are that there is a lot of data. The versioning that Nextcloud gives, redundancy, subjective reasons, and more creates a lot of duplication.

prehistory

When administering Nextcloud, there is an acute problem of organizing an effective backup, which must be encrypted, since the data is valuable.

We offer options for storing backups at our place or at the customer's on their machines separate from Nextcloud, which requires a flexible automated approach to administration.

There are many clients, all of them with different configurations, and all on their own sites and with their own characteristics. Here the standard technique is when the entire site belongs to you, and backups are made from the crown, it does not fit well.

First, let's look at the input data. We need:

  • Scalability in terms of one node or several. For large installations, we use minio as storage.
  • Find out about backup problems.
  • You need to keep a backup with clients and / or with us.
  • Deal with problems quickly and easily.
  • Clients and installations are very different from each other - uniformity cannot be achieved.
  • Recovery speed should be minimal in two scenarios: full recovery (disaster), one folder deleted by mistake.
  • Deduplication is required.

How GitLab helps you back up large NextCloud storages

To solve the problem of managing backups, we screwed GitLab. More rolling.

Of course, we are not the first to solve such a problem, but it seems to us that our practical experience of suffering can be interesting and we are ready to share it.

Since our company has adopted an open source policy, we were looking for an open source solution. In turn, we share our developments and post them. For example, on GitHub there is our plugin for Nextcloud, which we put to customers, enhancing the safety of data in case of accidental or intentional deletion.

Backup tools

We began the search for solution methods by choosing a backup creation tool.

Normal tar + gzip does not work well - the data is duplicated. An increment often contains very few changes in fact, and most of the data within the same file is repeated.
There is another problem - the redundancy of a distributed data warehouse. We use minio and its data is basically redundant. Or it was necessary to make a backup through minio itself - load it and use all the spacers between the file system, and, no less important, there is a risk of forgetting about some of the buckets and meta-information. Or use deduplication.

Backup tools with duplication are available in open source (there were Articles about this theme) and our finalists were Deposit ΠΈ Restic. About our comparison of the two applications below, but for now, let's talk about how we organized the whole scheme.

Backup management

Borg and Restic are good, but neither product has a centralized control mechanism. For the purpose of management and control, we have chosen a tool that we already have implemented, without which we cannot imagine our work, including automation - this is the well-known CI / CD - GitLab.

The idea is as follows: a gitlab-runner is placed on each node that stores Nextcloud data. The runner runs a scheduled script that monitors the backup process, and it starts Borg or Restic.

What did we get? Feedback from execution, convenient control over changes, details in case of an error.

Here here on github we posted examples of the script for various tasks, and as a result, we attached it to the backup not only of Nextcloud, but also of many other services. There is also a scheduler, if you don’t want to set it up with your hands (and we don’t want to) and .gitlab-ci.yml

The Gitlab API does not yet have the ability to change the CI / CD timeout, and it is small there. It must be increased, say to 1d.

GitLab, fortunately, can run not only on a commit, but only on a schedule, this is exactly what we need.

Now about the wrapper script.

We have set the following conditions for this script:

  • Should be launched both by the runner and by hand from the console with the same functionality.
  • There must be error handlers:
  • return code.
  • search for a line in the log. For example, for us, an error may be a message that the program does not consider fatal.
  • timeout handling. The execution time must be reasonable.
  • We need a detailed log. But only in case of an error.
  • A number of tests are also carried out before starting.
  • Small convenience features that we found useful during the support process:
  • Start and end are recorded in the local machine's syslog. This helps to link system errors and backup operation.
  • Part of the error log, when any, is output to stdout, the entire log is written to a separate file. It is convenient to immediately look at CI and evaluate the error if it is trivial.
  • Debug modes.

The full log is saved as an artifact in GitLab, if there is no error, then the log is deleted. The script is written in bash.

We will be happy to consider any suggestions and comments on open source - welcome.

How it works

A runner with a bash executor is launched on the backup node. According to the scheduler, the CI / CD job is launched in a special turnip. The runner launches a universal wrapper script for such tasks, it checks the validity of the backup repository, mount points and whatever we want, then backs up and cleans up the old one. The finished backup itself is sent to S3.

We work according to this scheme - this is an external AWS provider or a Russian equivalent (it's faster and the data does not leave the Russian Federation). Or we put a separate minio cluster on its site for the client for these purposes. We usually do this for security reasons, when the client does not want the data to leave their circuit at all.

We did not use the feature of sending a backup via ssh. This does not add security, and the network capabilities of the S3 provider are much higher than one of our ssh machines.

In order to protect against a hacker on the local machine - after all, he can erase data on S3, you must enable versioning.
The backup always encrypts the backup.

Borg has a no-encryption mode none, but we categorically do not recommend enabling it. In this mode, not only will there be no encryption, but the checksum of what is being written is not calculated, which means that integrity can only be checked indirectly, by indexes.

On a separate scheduler, backups are checked for the integrity of indexes and contents. The check is slow and long, so we run it separately once a month. It may take several days.

Readme in Russian

The main function

  • prepare Assistance with resumes writing
  • testcheck readiness check
  • maincommand core team
  • forcepostscript a function that is executed at the end or by mistake. Use to unmount the partition.

Service functions

  • cleanup write errors or delete the log file.
  • checklog parse the log for the occurrence of a line with an error.
  • ret exit handler.
  • checktimeout timeout check.

Environment

  • VERBOSE=1 output errors to the screen immediately (stdout).
  • SAVELOGSONSUCCES=1 save the log on success.
  • INIT_REPO_IF_NOT_EXIST=1 Create a repository if it didn't exist. Disabled by default.
  • TIMEOUT maximum time for the main operation. You can set it as 'm', 'h' or 'd' at the end.

Storage mode for old copies. Default:

  • KEEP_DAILY=7
  • KEEP_WEEKLY=4
  • KEEP_MONTHLY=6

Variables inside a script

  • ERROR_STRING - string for the check in log for error.
  • EXTRACT_ERROR_STRING - expression for show string if error.
  • KILL_TIMEOUT_SIGNAL - signal for killing if timeout.
  • TAIL - how many strings with errors on screen.
  • COLORMSG - color of message (default yellow).

That script, which is called wordpress, is called conditionally, its trick is that it also backs up the mysql database. This means that it can be used for single-node Nexcloud installations, where you can backup the database at the same time. The convenience is not only that everything is in one place, but the contents of the database are close to the contents of the files, since the time difference is minimal.

Restic vs Borg

There are comparisons of Borg and Restic, including here on HabrΓ©, and we did not have the task of making just one more, but our own. It was important for us how it would look on our data, with our specifics. We bring them.

Our selection criteria, in addition to those already mentioned (deduplication, fast recovery, etc.):

  • Resilience to work in progress. Check for kill -9.
  • Disk size.
  • Demanding on resources (CPU, memory).
  • The size of the stored blobs.
  • Working with S3.
  • Integrity check.

For testing, we took one client with real data and a total size of 1,6 TB.
Conditions.

Borg does not work directly with S3, and we mounted as a fuse disk, via goofys. Restic sent to S3 himself.

Goofys works very fast and well, and there are disk cache modulewhich speeds things up even more. It is in the beta stage, and, to be honest, it crashed with us due to loss of data on tests (others). But the convenience is that the backup procedure itself does not require much reading, but mostly writing, so we use the cache only during the integrity check.

To reduce the impact of the network, we used a local provider - Yandex Cloud.

Comparison test results.

  • Kill -9 with a further restart both were successful.
  • Disk size. Borg knows how to compress, so the results are expected.

backups
Size

Deposit
562Gb

Restic
628Gb

  • By CPU
    By itself, borg consumes little, with default compression, but it should be evaluated along with the goofys process. In total, they are comparable and utilize about 1,2 cores on the same test virtual machine.
  • Memory. Restic about 0,5Gb, Borg about 200Mb. But this is all insignificant compared to the system's file cache. So it is desirable to allocate more memory.
  • The difference in the size of the blobs was striking.

backups
Size

Deposit
about 500Mb

Restic
about 5Mb

  • Working with Restic's S3 is great. The work of Borg through goofys does not raise any questions, but it has been noticed that it is desirable to do umount at the end of the backup in order to completely reset the cache. The peculiarity of S3 is that under-downloaded chunks will never be sent to the bucket, which means that incompletely filled data leads to great damage.
  • The integrity check works well in both cases, but the speed differs significantly.
    Restic- 3,5 hours.
    Borg, with 100GB SSD file cache - 5 hours.Approximately the same result in terms of speed if the data is on the local disk.
    Borg reads directly from S3 without cache 33 hours. Amazingly long.

The bottom line is that Borg can compress and has larger blobs - which makes storage and GET/PUT operations in S3 cheaper. But this comes at the price of a more complex and slower check. As for the recovery speed, we did not notice a difference. Restic makes subsequent backups (after the first one) a little longer, but not significantly.

Not last in the choice was the size of the community.

And we chose borg.

A few words about compression

Borg has a great new compression algorithm in its arsenal - zstd. The quality of compression is not worse than gzip, but much faster. And compare in speed with the default lz4.

For example, a MySQL database dump is compressed two times better than lz4 at the same speed. However, experience on real data shows that there is very little difference in the compression ratio of the Nextcloud node.

Borg has a rather bonus compression mode - if the file has a large entropy, then compression is not applied at all, which increases the speed of work. Enabled by option when creating
-C auto,zstd
for the zstd algorithm
So with this option, in comparison with the default compression, we got
560Gb and 562Gb respectively. The data from the example above, let me remind you, without compression, the result is 628Gb. The result of a 2GB difference surprised us a little, but we thought that we would choose anyway auto,zstd.

Backup verification method

According to the scheduler, a virtual machine is launched directly at the provider or at the client, which greatly reduces the network load. At least it's cheaper than raising and driving traffic.

goofys --cache "--free:5%:/mnt/cache" -o allow_other --endpoint https://storage.yandexcloud.net --file-mode=0666 --dir-mode=0777 xxxxxxx.com /mnt/goofys
export BORG_PASSCOMMAND="cat /home/borg/.borg-passphrase"
borg list /mnt/goofys/borg1/
borg check --debug -p --verify-data /mnt/goofys/borg1/

According to the same scheme, we check files with antivirus (after the fact). After all, users upload different things to Nextcloud and not everyone has an antivirus. Checking at the time of pouring takes too much time and interferes with business.

Scalability is achieved by running runners on different nodes with different tags.
In our monitoring, backup statuses are collected through the GitLab API in one window, if necessary, problems are easily noticed, and just as easily localized.

Conclusion

As a result, we know for sure that we make backups, that our backups are valid, the problems that arise with them take little time and are solved at the level of the administrator on duty. Backups take up really little space compared to tar.gz or Bacula.

Source: habr.com

Add a comment