Reduce backups by 99.5% with hashget

hashget - it's free, operating source deduplicator - a utility similar to an archiver that allows you to significantly reduce the size of backups, as well as organize incremental and differential backup schemes and more.

This is an overview article to describe the features. The use of hashget itself (quite simple) is described in README project and wiki documentation.

Comparison

According to the law of the genre, I’ll start right away with an intrigue - a comparison of the results:

data sample
unpacked size
.tar.gz
hashget.tar.gz

WordPress-5.1.1
43 Mb
11 Mb (26%)
155 Kb ( 0.3% )

Linux kernel 5.0.4
934 Mb
161 Mb (20%)
4.7MB ( 0.5% )

Debian 9 (LAMP) LXCVM
724 Mb
165 Mb (23%)
4.1MB ( 0.5% )

Background, what should be an ideal and effective backup

Every time I made a backup of a freshly created virtual machine, I was haunted by the feeling that I was doing something wrong. Why do I get a weighty backup from a system where my priceless imperishable creativity is a one-line index.html with the text “Hello world”?

Why is there a 16MB /usr/sbin/mysqld in my backup? Is it really in this world that I am the one who has the honor of keeping this important file, and if I fail to do it, it will be lost to humanity? Most likely no. It's stored on highly secure debian servers (whose reliability and uptime is nowhere near what I can provide) and in backups (millions of them) of other admins. Do we really need to create 10 + 000 copies of this important file to improve reliability?

Generally hashget and solves this problem. When packing, it creates a very small backup. When unpacked, a fully unpacked system, similar to the one that would be tar -c / tar -x. (In other words, this is a lossless package)

How hashget works

Hashget has the concepts of Package and HashPackage, with their help it performs deduplication.

Package (plastic bag). A file (usually a .deb or .tar.gz archive) that can be downloaded securely from the web, and from which one or more files can be obtained.

hashpackage - a small JSON file representing the Package, including the URL of the package and the hash sum (sha256) of files from it. For example, for a 5 megabyte mariadb-server-core package, the hashpackage size is only 6 kilobytes. About a thousand times less.

Deduplication - creating an archive without duplicate files (if the deduplicator knows where the original package can be downloaded, it reduces the duplicates from the archive).

Packaging

When packing, all files from the directory being packed are looked at, their hash sums are considered, and if the sum is found in one of the known HashPackages, then the metadata about the file (name, hash, access rights, etc.) are saved in a special .hashget-restore.json file, which will also be included in the archive.

The packaging itself in the simplest case looks no more complicated than tar:

hashget -zf /tmp/mybackup.tar.gz --pack /path/to/data

Unpacking

Unpacking is done in two stages. First, the usual tar unpacking:

tar -xf mybackup.tar.gz -C /path/to/data

then restore from network:

hashget -u /path/to/data

When restoring, hashget reads the .hashget-restore.json file, downloads the necessary packages, unpacks them, and extracts the necessary files, setting them to the correct paths, with the correct owner/group/permissions.

More complex things

What is described above is already enough for those who "want as tar, but to pack my Debian into 4 megabytes". Let's look at more complex things.

Indexing

If hashget didn’t have a single HashPackage at all, then it simply wouldn’t be able to dedupe anything.

You can also create a HashPackage manually (simply: hashget --submit https://wordpress.org/wordpress-5.1.1.zip -p my), but there is a better way.

In order to get the required hashpackage, there is a stage indexing (it is automatically executed when the command --pack) and heuristics. When indexing, hashget "feeds" each found file to all available heuristics that are interested in it. Heuristics can then index any Package to create a HashPackage.

For example, the Debian heuristic likes the /var/lib/dpkg/status file and detects installed debian packages, and if they are not indexed (HashPackages are not created for them), it downloads and indexes them. It turns out a very nice effect - hashget will always effectively dedupe Debian OS, even if they have the newest packages.

Hint files (hints)

If your network uses some proprietary package of yours or a public package that is not included in hashget heuristics, you can add a simple hashget-hint.json hint file like this:

{
    "project": "wordpress.org",
    "url": "https://ru.wordpress.org/wordpress-5.1.1-ru_RU.zip"
}

Further, each time the archive is created, the package will be indexed (if not already), and the package files will be deduplicated from the archive. No programming needed, everything can be done from vim and save on every backup. Note that thanks to the hashsum approach, if some files from the package are changed locally (for example, the configuration file is changed), then the changed files will be saved in the archive "as is" and will not be reduced.

If some of your own packages are updated periodically, but the changes are not very large, you can only hint for major versions. For example, in version 1.0 they made a hint pointing to mypackage-1.0.tar.gz, and it will be completely deduplicated, then they released version 1.1, which is slightly different, and the hint was not updated. It's OK. Only files that match (that can be recovered) with version 1.0 are deduplicated.

The heuristic that processes the hint file is a good example for understanding the inner workings of heuristics. It only processes hashget-hint.json (or .hashget-hint.json with a dot) files and ignores all others. Based on this file, it determines which package URL should be indexed, and hashget indexes it (if it has not been done before)

hashserver

It would be quite laborious to do full indexing when creating backups. To do this, you need to download each package, unpack, index. So hashget uses the scheme with hashserver. When an installed debian package is found, if it is not found in the local HashPackage, first an attempt is made to simply download the HashPackage from the hash server. And only if this fails, hashget downloads and hashes the package itself (and uploads it to the hashserver, so that the hashserver will provide it in the future).

HashServer is an optional element of the scheme, not critical, it serves solely to speed up and reduce the load on the repositories. Easy to turn off (option) --hashserver no options). In addition, you can easily make your own hashserver.

Incremental and differential backups, planned obsolescence

hashget makes it very easy to make a diagram incremental and differential backups. Why don't we index our backup itself (with all our unique files)? One team --submit and you're done! The next backup that hashget creates will not include files from this archive.

But this is not a very good approach, because it may turn out that when restoring, we will have to pull all hashget backups in the entire history (if each has at least one unique file). There is a mechanism for this planned expiration of backups. When indexing, you can specify the expiration date of the HashPackage --expires 2019-06-01, and after this date (from 00:00), it will not be used. The archive itself can not be deleted after this date (Although hashget can conveniently show the URLs of all backups that are rotten / rotten at the moment or on any date).

For example, if you make a full backup on the 1st and index it with a lifetime until the end of the month, we will get a differential backup scheme.

If we index new backups in the same way, there will be a scheme of incremental backups.

Unlike traditional schemes, hashget allows multiple underlying sources. The backup will be reduced both by reducing files from previous backups (if any) and by public files (what can be downloaded).

If for some reason we do not trust the reliability of Debian resources (https://snapshot.debian.org/) or uses another distribution, we can just make a full backup with all packages once, and then rely on it (disabling heuristics). Now, if all the servers of our distributions are inaccessible to us (on the souvenir Internet or during a zombie apocalypse), but our backups are in order, we can recover from any short diff backup that relies only on our earlier backups.

Hashget only relies on trusted recovery sources of YOUR choice. What you consider reliable - those will be used.

FilePool and Glacier

Movement FilePool allows you not to constantly contact external servers to download packages, but to use packages from a local directory or a corporate server, for example:

$ hashget -u . --pool /tmp/pool

or

$ hashget -u . --pool http://myhashdb.example.com/

To make a pool in a local directory, you just need to create a directory and throw files into it, hashget will find what it needs by hashes. To make the pool available via HTTP, you need to create symlinks in a special way, this is done with one command (hashget-admin --build /var/www/html/hashdb/ --pool /tmp/pool). The HTTP FilePool itself is static files, so any simple web server can serve it, the load on the server is almost zero.

Thanks to FilePool, you can use not only resources on http (s) as basic resources, but also, for exampleAmazon Glacier.

After uploading the backup to the glacier, we get its Upload ID and use it as a URL. For example:

hashget --submit Glacier_Upload_ID --file /tmp/my-glacier-backup.tar.gz --project glacier --hashserver --expires 2019-09-01

Now new (differential) backups will rely on this backup and will be shorter. After tar unpacking the diffbackup, we can see what resources it relies on:

hashget --info /tmp/unpacked/ list

and just use a shell script to download all these files from the glacier to the pool and start the usual recovery: hashget -u /tmp/unpacked -pool /tmp/pool

Is the game worth the candle

In the simplest case, you will simply pay less for backups (if you store them somewhere in the cloud for money). Maybe much, much less.

But this is not the only one. Quantity turns into quality. You can use this to get a quality backup scheme upgrade. For example, since our backups are now shorter, you can do not a monthly backup, but a daily one. Store them not for six months, as before, but for 5 years. Previously stored in a slow but cheap "cold" storage (Glacier), now you can store in a hot one, from where you can always quickly download a backup and recover in minutes, not a day.

You can increase the reliability of backup storage. If now we store them in one storage, then by reducing the volume of backups, we can store them in 2-3 storages and survive painlessly if one of them gets damaged.

How to try and start using?

Go to gitlab page https://gitlab.com/yaroslaff/hashget, install with one command (pip3 install hashget[plugins]) and just read-execute quick-start. I think doing all the simple things will take 10-15 minutes. Then you can try to shake your virtuals, make hint files if necessary to shrink it more, play around with pools, a local hash database and a hash server, if you're interested, and the next day see what the size of the incremental backup will be over yesterday.

Source: habr.com

Add a comment