Paradoxes about data compression

Paradoxes about data compression The problem of data compression in its simplest form can be about numbers and their notation. Numbers can be denoted by numerals ("eleven" for the number 11), mathematical expressions ("two in the twentieth" for 1048576), string expressions ("five nines" for 99999), proper names ("number of the beast" for 666, "year of Turing's death" for 1954), or arbitrary combinations of them. Any designation is suitable, according to which the interlocutor can clearly determine what number we are talking about. It is obvious that to inform the interlocutor "factorial of eight" more efficient than the equivalent notation "forty thousand three hundred twenty". Here a logical question arises: what is the shortest notation for a given number?

The philosopher Bertrand Russell in 1908 published "Berry's paradox", which touches on the issue of notation of numbers from the opposite side: What is the smallest number for which eighty letters are not enough?
Such a number must exist: out of eighty Russian letters and spaces, only 3480 designations can be made, which means that using eighty letters, no more than 3480 numbers can be designated. This means that a certain number, no more than 3480, cannot be designated in this way.

So this number will correspond to the designation "the smallest number for which eighty letters are not enough", which has only 78 letters! On the one hand, this number must exist; on the other hand, if this number exists, then its designation does not correspond to it. Paradox!

The easiest way to dismiss this paradox is to refer to the informality of verbal designations. Like, if only a specific set of expressions were allowed in the notation, then "the smallest number for which eighty letters are not enough" would not be a valid notation, whereas practically useful notations like "factorial of eight" would remain acceptable.

Are there formal ways to describe a sequence (algorithm) of operations on numbers? There are, and in abundance - they are called programming languages. Instead of verbal notation, we will use programs (for example, in Python) that display the necessary numbers. For example, for five nines, the program is suitable print("9"*5). As before, we will be interested in the shortest program for a given number. The length of such a program is called Kolmogorov complexity numbers; this is the theoretical limit to which a given number can be compressed.

Instead of Berry's paradox, we can now consider a similar one: What is the smallest number for which a kilobyte program is not enough to output?

We will argue in the same way as before: there are 2561024 kilobyte texts, which means that no more than 2561024 numbers can be output by kilobyte programs. This means that some number not greater than 2561024 cannot be deduced in this way.

But let's write a Python program that generates all possible kilobyte texts, launches them for execution, and if they output some number, then it adds this number to the reachable dictionary. After checking all 2561024 possibilities, no matter how long it takes, the program looks for the smallest number that is not in the dictionary and outputs that number. It seems obvious that such a program will fit into a kilobyte of code - and will display the very number that cannot be output by a kilobyte program!

What's the catch now? It can no longer be attributed to the informality of designations!

If you are confused by the fact that our program will require an astronomical amount of memory to work - a dictionary (or bit array) of 2561024 elements - then you can do the same thing without it: for each of the 2561024 numbers, iterate over all 2561024 possible programs in turn, until no suitable one can be found. It doesn't matter that such an enumeration will take a very long time: after checking less than (2561024) 2 pairs from the number and the program, it will end and find that very coveted number.

Or won't it finish? After all, among all the programs that will be tried, there will be while True: pass (and its functional analogues) - and the matter will not go further than checking such a program!

In contrast to Berry's paradox, where the catch was in the informality of notation, in the second case we have a well-disguised reformulation "stop problems". The fact is that it is impossible to determine its output from the program in a finite time. In particular, the Kolmogorov complexity uncomputable: there is no algorithm that would allow for a given number to find the length of the shortest program that displays this number; which means that there is no solution for Berry's problem - to find for a given number the length of the shortest verbal designation.

Source: habr.com

Add a comment