Save hard drive space with steganography

When we talk about steganography, people imagine terrorists, pedophiles, spies, well, at best, crypto-anarchists and other scientists. And really, who else might need hide something from outside eyes? What would be the benefit of this to the common man?

It turns out there is some. That is why today we will compress data using steganography methods. And at the end, the reader will even be able to use his precious photo archives in JPEGs to increase the number of free gigabytes on the file system.

Save hard drive space with steganography

What?

If the reader remembers, then steganography is such strange algorithms that allow you to hide the presence of one information inside another. In even simpler language: picture + file == approximately the same picture, but not quite (anything can be instead of pictures, but usually everything is clearer on them). That being said, there shouldn't be an easy way to determine if something is inside or not.

But if one cannot be distinguished from the other, is there any difference at all? From the point of view of the consumer, the user does not care about the mathematical precision (represented by a particular set of bits), only that he perceives.

For example, let's look at three images of a cute dog:

Beware JPEGs!

Save hard drive space with steganography Save hard drive space with steganography Save hard drive space with steganography

Despite the huge difference in size, few people will choose the third version. On the other hand, the difference between the first two photographs is not so noticeable, and the amount of information in them (from my point of view) can be equated.

By itself, this principle is already old and has been actively exploited by lossy compression methods for many years. But to break is not to build, we are interested in a more advanced side of the issue. Is it possible to embed additional size information N to the file so that its size increases by M < N, and the changes were not visible to the user?

Of course you can. But it is worth immediately making a couple of reservations:

  • First, the method must be universal and give a positive result on most inputs. That is, on average, for an arbitrary input, there should be an actual decrease in the amount of information stored. "On average" means that the opposite cases can occur, but should not predominate.
  • Secondly, the size of the compressed container before embedding the information must be larger than its similarly compressed modification. Just embed a lot of bits into BMP images using the LSB method - not steganographic compression, since, after being run through some kind of DEFLATE, the original image will most likely be noticeably smaller.
  • Thirdly, it is necessary to carry out and compare the result with respect to the data already compressed by classical methods. This will remove the probabilistic effect of the difference in their redundancy and perform more efficient compression in the general case.

Where?

The use of steganography implies that, in addition to the information to be compressed, we need containers in which it will be embedded. The maximum amount of information that can be embedded depends largely on individual properties, but scales much more easily with their number. Therefore, the container format must be widespread so that the user has enough of them to get any benefit from the "compression" process.

In this context, graphics, audio, and video files are good candidates. But, due to the variety of different formats, codecs, etc., in practice, we are left with a choice of not so many options.

Given all this, my choice fell on JPEG. Almost everyone has it, it is widely used both for personal and business purposes, being almost the de facto format for most images.

Save hard drive space with steganography

It depends?

Then there are near-and technical diagrams and descriptions without much explanation, so those who wish can skip them by scrolling to the β€œHigh Technologies” section.

Common features

To embed data somewhere, you must first determine where. There can be any number of different photos on the file system, among which the user may want to use only a few. Such a desired set of containers will be called a library.

It is formed in two cases: before compression and before decompression. In the first case, you can simply use a set of names (or better, a regular expression for them) of files, but in the second, something more reliable is required: the user can copy and move them within the file system, thereby preventing them from being correctly determined. Therefore, it is required to store their hashes (md5 is enough) after all modifications.

In this case, it makes no sense to carry out the initial search by regular expression throughout the entire file system, it is enough to specify a certain root directory. A special archive file will be stored in it, in which those hashes will be located, along with other meta-information necessary for the subsequent recovery of the compressed information.

All of this applies equally to any implementation of any steganographic data compression algorithm. The processes of data compression and recovery can be called packing and unpacking.

F5

Now, when it became clear (/her) what we are doing and why, it remains to describe the algorithm for achieving the goal itself. Recall the process of encoding a JPEG file (thanks to the Bauman National Library wiki):

Save hard drive space with steganography

Looking at it, it is better to immediately make a few remarks:

  • The size of a JPEG file can be considered optimal without even trying to compress it with some winrar;
  • You can only change the stored information (the one at the output of the discrete cosine transform, DCT) to provide at least some acceptable performance.
  • In order not to lose data on an industrial scale noticeable to the user, it is required to make a minimum of modifications to each individual image;

Under such conditions, a whole family of algorithms is suitable, which can be found in this nice presentation. The most advanced of these is the algorithm F5 by Andreas Westfeld, working with the DCT coefficients of the brightness component (the human eye is the least sensitive to its change). Its general layout when working with an existing JPEG file is shown as the following scheme:

Save hard drive space with steganography

Block F5 uses an advanced embedding technique based on matrix coding. The reader can learn more about it and the algorithm itself at the link above, but we are primarily interested in the fact that with its help you can make the fewer changes when embedding the same amount of information, the larger the size of the container used, and to carry out the The algorithm only needs to perform simple Huffman and RLE (de)coding operations.

In this case, the changes themselves are made on integer coefficients and come down to a decrease in their absolute value by one, which, in general, allows using F5 for data compression. The fact is that the coefficient reduced in absolute value, most likely, will occupy a smaller number of bits after Huffman coding due to the statistical distribution of values ​​in JPEG.

Save hard drive space with steganography

In the case of the formation of zero (the so-called reduction), the number of stored information will be reduced by its size, since the former independent coefficient will become part of the encoded RLE sequence of zeros:

Save hard drive space with steganography

Modifications

Data protection and compression are orthogonal tasks, so the secret password permutation from the original algorithm can be neglected. Moreover, we need to know exactly how to extract the data, so all the information necessary for this (which containers were used, in what order, etc.) should be written to a separate file and be opened for free reading by the archiver.

The original algorithm is designed to transmit secret messages, so it works with only one container at a time, assuming that the user himself will break it into pieces if necessary, if any. Moreover, when embedding independently in each container, you will need to know in advance how many bits of data to put in each. Therefore, the coefficients of each element of the library should be combined into one abstract large one and work with it according to the original algorithm.

Since the original F5 allows up to 12% of the container size, this modification will also increase the maximum capacity: "up to 12%" of the size of the entire library is greater than or equal to the sum of "up to 12%" of each of its elements.

The codified general scheme is as follows:

Save hard drive space with steganography

The algorithm itself

Now it's time to describe the algorithm itself from beginning to end, so as not to keep the reader in the dark:

  • The user defines the binary compressible data M and the library L with a regular expression and a search root;
  • In the order of the FS, the elements of the library form the MC:
    • A series of coefficients C is decoded from the file data;
    • MC <- MC | C;
  • The parameter k is determined based on the terrible inequality: |M| * 8 / (count_full(MC) + count_ones(MC) * k_rate(k)) < k / ((1 << k) - 1);
  • Taken next n = (1 << k) - 1 least significant bits of non-zero elements from MC and is written to a:
    • Considered a magic hash function f, representing an n-bit word a to k-bit s;
    • If s == 0, then there is no need to change anything and the algorithm proceeds to the following coefficients;
    • Decrease the absolute value of the coefficient responsible for s-hey bit in a word a;
    • If as a result of the decrease there was a reduction (the coefficient became 0), then repeat the step from the beginning;
  • All coefficients are RLE and Huffman encoded, written to source files;
  • The parameter k is written to the archive file;
  • From each file L in the order of their original location, an MD5 hash is calculated and written to the archive file.

High tech

The naive form of the algorithm and implementations in other high-level (especially with garbage collection) languages ​​\uXNUMXb\uXNUMXbwould give terrible performance, so I implemented all these complexities in pure C and carried out a number of optimizations both in terms of execution speed and memory (you have no idea how many these pictures weigh without compression even up to DCT). But even so, at first the speed of execution left much to be desired, so I will not describe the whole process and the methods used.

Cross-platform is achieved using a combination of libjpeg, pcre and tinydir libraries, for which we thank them. By default, everything is compiled through the usual make, so Windows users want to install some Cygwin for themselves, or deal with Visual Studio and libraries on their own.

The implementation is available in the form of a console utility and a library. Those who wish can learn more about using the latter in the readme in the repository on github, the link to which I will attach at the end of the post. And here we turn to the description and demonstration of the work.

How to use?

Carefully. Used images can be moved, renamed and copied as desired. However, you should be extremely careful and do not change their contents in any way. Changing one bit will result in a violation of the hash and the impossibility of restoring the information.

Suppose after compilation we got the executable file f5ar. You can analyze the size of the library to calculate the possibilities of using it using the flag -a: ./f5ar -a [ΠΏΠ°ΠΏΠΊΠ° поиска] [Perl-совмСстимоС рСгулярноС Π²Ρ‹Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅]. Packing is done by the team ./f5ar -p [ΠΏΠ°ΠΏΠΊΠ° поиска] [Perl-совмСстимоС рСгулярноС Π²Ρ‹Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅] [ΡƒΠΏΠ°ΠΊΠΎΠ²Ρ‹Π²Π°Π΅ΠΌΡ‹ΠΉ Ρ„Π°ΠΉΠ»] [имя Π°Ρ€Ρ…ΠΈΠ²Π°], and unpacking with ./f5ar -u [Ρ„Π°ΠΉΠ» Π°Ρ€Ρ…ΠΈΠ²Π°] [имя восстановлСнного Ρ„Π°ΠΉΠ»Π°].

Demonstration of work

To show the effectiveness of the method, I uploaded a collection of 225 absolutely free dog photos from the service. Unsplash. Each of them has a slightly higher quality than regular user photos, but nonetheless. Each of them has been re-encoded with libjpeg to reduce the impact of the library's encoding features on the overall size. To indicate the worst example of compressible data, a random 36-meter (slightly more than 5% of the total size) evenly distributed file was generated using dd.

The testing process is quite simple:

$ ls
binary_data dogs f5ar
$ du -sh dogs/
633M dogs/
$ du -h binary_data
36M binary_data

$ ./f5ar -p dogs/ .*jpg binary_data dogs.f5ar
Reading compressing file... ok
Initializing the archive... ok
Analysing library capacity... done in 16.8s
Detected somewhat guaranteed capacity of 48439359 bytes
Detected possible capacity of upto 102618787 bytes
Compressing... done in 32.6s
Saving the archive... ok

$ ./f5ar -u dogs/dogs.f5ar unpacked
Initializing the archive... ok
Reading the archive file... ok
Filling the archive with files... done in 1.2s
Decompressing... done in 17.5s
Writing extracted data... ok

$ sha1sum binary_data unpacked
ba7ade4bc77881ab463121e77bbd4d41ee181ae9 binary_data
ba7ade4bc77881ab463121e77bbd4d41ee181ae9 unpacked
$ du -sh dogs/
563M dogs/

Or a screenshot for fans

Save hard drive space with steganography

As you can see, from the original 633 + 36 == 669 megabytes of data on the hard drive, we came up with a nicer 563, giving us a compression ratio of ~1,188. Such a radical difference is explained by extremely small losses, similar to those obtained when optimizing JPEG files by classical methods (such as tinyjpg). Naturally, when using steganographic compression, information is not just β€œlost”, but is used to encode other data. Moreover, the number of β€œoptimized” coefficients due to the use of F5 is much less than with traditional optimization.

Whatever modifications are, they are absolutely not noticeable to the eye. Under the spoiler below, the reader can evaluate the difference both by eye and by subtracting the values ​​of the modified component from the original one (the more muted the color, the smaller the difference):

Links to images that do not fit on habrastorage

Original - https://i.ibb.co/wNDLNcZ/1.jpg
Modified - https://i.ibb.co/qWvpfFM/1.jpg
Difference - https://i.ibb.co/2ZzhHfD/diff.jpg

Instead of a conclusion

I hope I was able to convince the reader that such methods are possible and have the right to life. However, buying a hard drive or an additional channel (for network transmission) may seem like a much easier option than trying to save money in this way. On the one hand, this is true, extensive development is often simpler and more reliable. But on the other hand, do not forget about the intensive. After all, there are no guarantees that tomorrow you will be able to come to the store and buy yourself another hard drive for a thousand terabytes, but you can always use those already at home.

-> GitHub

Source: www.habr.com

Add a comment