About the strange method of saving hard disk space

The next user wants to write a new piece of data to the hard disk, but he does not have enough free space for this. I also don’t want to delete anything, because β€œeverything is very important and necessary.” And what should we do with it?

He is not the only one with this problem. There are terabytes of information on our hard drives, and this number does not tend to decrease. But how unique is she? In the end, because all files are just sets of bits of a certain length and, most likely, the new one is not much different from the one that is already stored.

It is clear that searching for already stored pieces of information on a hard drive is a task, if not a failure, then at least not effective. On the other hand, if the difference is small, then you can adjust a little ...

About the strange method of saving hard disk space

TL;DR - the second attempt to talk about a strange method of optimizing data using JPEG files, now in a more understandable form.

About beats and difference

If we take two completely random pieces of data, then on average half of the contained bits match. Indeed, among the possible layouts for each pair ('00, 01, 10, 11'), exactly half have the same values, everything is simple here.

But of course, if we just take two files and fit one under the second, then we will lose one of them. If we save the changes, then we just re-invent delta encoding, which even without us exists perfectly, although it is not usually used for the same purposes. It is possible to try to build a smaller sequence into a larger one, but even so, we risk losing critical data segments if used rashly with everything.

Between what and what then can the difference be eliminated? Well, that is, a new file written by the user is just a sequence of bits, with which we cannot do anything by itself. Then you just need to find such bits on the hard disk so that they can be changed without the need to store the difference, so that you can survive their loss without serious consequences. Yes, and it makes sense to change not just the file itself on the FS, but some less sensitive information inside it. But which one and how?

Fitting Methods

Lossy compressed files come to the rescue. All these jpegs, mp3s and others, although they are lossy compression, contain a bunch of bits available for safe modification. You can use advanced techniques that imperceptibly modify their components in various parts of the encoding. Wait. Advanced techniques... inconspicuous modification... bits to bits... it's almost like steganography!

Indeed, the embedding of one information into another resembles its methods in a way that does not. I am also impressed by the imperceptibility of the changes carried out for the human senses. That's where the paths diverge - it's about secrecy: our task is to add additional information to your hard drive by the user, it will only harm him. Will forget again.

Therefore, although we can use them, we need to make some modifications. And then I will tell and show them using the example of one of the existing methods and a common file format.

About jackals

If we compress, then the most compressible in the world. We are talking, of course, about JPEG files. Not only does it have a ton of tools and existing methods for embedding data into it, it is the most popular graphics format on this planet.

About the strange method of saving hard disk space

However, in order not to engage in dog breeding, you need to limit your field of activity in files of this format. Nobody likes single-color squares that appear due to excessive compression, so you need to limit yourself to working with an already compressed file, avoiding recoding. More specifically, with integer coefficients, which remain after the operations responsible for data loss - DCT and quantization, which is beautifully displayed on the coding scheme (thanks to the wiki of the Bauman National Library):
About the strange method of saving hard disk space

There are many possible methods for optimizing jpeg files. There is a lossless optimization (jpegtran), there is an optimization "losslessβ€œWhich ones they actually contribute, but they don’t worry us. After all, if a user is ready to embed one information into another in order to increase free disk space, then he either optimized his images a long time ago, or does not want to do this at all because of fear of quality loss.

F5

Under such conditions, a whole family of algorithms is suitable, which can be found in this nice presentation. The most advanced of these is the algorithm F5 by Andreas Westfeld, working with the coefficients of the brightness component, since the human eye is the least sensitive to its changes. Moreover, it uses an embedding technique based on matrix encoding, which allows you to make fewer changes to them when embedding the same amount of information, the larger the size of the container used.

The changes themselves come down to reducing the absolute value of the coefficients by one under certain conditions (that is, not always), which allows you to use F5 to optimize data storage on your hard drive. The fact is that the coefficient after such a change, most likely, will occupy a smaller number of bits after Huffman coding due to the statistical distribution of values ​​in JPEG, and new zeros will give a gain when encoding them using RLE.

The necessary modifications are to remove the privacy part (password swapping), which saves resources and execution time, and to add a mechanism for working with many files instead of one at a time. The reader is unlikely to be interested in the process of change in more detail, so let's move on to the description of the implementation.

High tech

To demonstrate the work of this approach, I implemented the method in pure C and carried out a number of optimizations both in terms of execution speed and memory (you have no idea how much these pictures weigh without compression, even before DCT). Cross-platform achieved using a combination of libraries libjpeg, pcre ΠΈ tinydirfor which we thank them. All this is going to 'make', so Windows users want to install some Cygwin for evaluation, or deal with Visual Studio and libraries on their own.

The implementation is available in the form of a console utility and a library. Those who wish can learn more about using the latter in the readme in the repository on github, the link to which I will attach at the end of the post.

How to use?

Carefully. The images used for packaging are selected by a regular expression search in the given root directory. Upon completion, files can be moved, renamed and copied as desired within it, change the file and operating systems, etc. However, you should be extremely careful and not change the immediate contents in any way. The loss of the value of even one bit can lead to the impossibility of recovering information.

Upon completion, the utility leaves a special archive file containing all the information necessary for unpacking, including data on the images used. By itself, it weighs about a couple of kilobytes and does not have any significant impact on the occupied disk space.

You can analyze the possible capacity using the '-a' flag: './f5ar -a [search folder] [Perl compatible regular expression]'. Packing is done with './f5ar -p [search folder] [Perl compatible regular expression] [file to pack] [archive name]', and unpacking with './f5ar -u [archive file] [recovered file name]' .

Demonstration of work

To show the effectiveness of the method, I uploaded a collection of 225 absolutely free dog photos from the service. Unsplash and dug out a large pdf-ku for 45 meters of the second volume in the documents Art of Programming Knut.

The sequence is pretty simple:

$ du -sh knuth.pdf dogs/
44M knuth.pdf
633M dogs/

$ ./f5ar -p dogs/ .*jpg knuth.pdf dogs.f5ar
Reading compressing file... ok
Initializing the archive... ok
Analysing library capacity... done in 17.0s
Detected somewhat guaranteed capacity of 48439359 bytes
Detected possible capacity of upto 102618787 bytes
Compressing... done in 39.4s
Saving the archive... ok

$ ./f5ar -u dogs/dogs.f5ar knuth_unpacked.pdf
Initializing the archive... ok
Reading the archive file... ok
Filling the archive with files... done in 1.4s
Decompressing... done in 21.0s
Writing extracted data... ok

$ sha1sum knuth.pdf knuth_unpacked.pdf
5bd1f496d2e45e382f33959eae5ab15da12cd666 knuth.pdf
5bd1f496d2e45e382f33959eae5ab15da12cd666 knuth_unpacked.pdf

$ du -sh dogs/
551M dogs/

Screenshots for fans

About the strange method of saving hard disk space

The unpacked file can and should still be read:

About the strange method of saving hard disk space

As you can see, from the original 633 + 36 == 669 megabytes of data on the hard disk, we came to a more pleasant 551. Such a radical difference is explained by the decrease in the values ​​of the coefficients, which affects their subsequent lossless compression: reducing one by just one can calmly " cut" a couple of bytes from the final file. Nevertheless, it is still data loss, albeit extremely small, that you have to put up with.

Fortunately, they are not visible to the eye at all. Under the spoiler (since habrastorage is not capable of large files), the reader can evaluate the difference both by eye and their intensity, obtained by subtracting the values ​​of the changed component from the original one: original, with information inside, difference (the dimmer the color, the smaller the difference in the block).

Instead of a conclusion

Considering all these complexities, buying a hard drive or uploading everything to the cloud may seem like a much simpler solution to the problem. But even though we now live in such a wonderful time, there are no guarantees that tomorrow it will still be possible to go online and upload all your extra data somewhere. Or come to the store and buy yourself another hard drive for a thousand terabytes. But you can always use already lying houses.

-> GitHub

Source: habr.com

Add a comment