"The goal of this course is to prepare you for your technical future."
Hey Habr. Remember the awesome article "You and Your Job" (+219, 2588 bookmarks, 429k reads)?
So Hamming (yes, yes, self-controlled and self-correcting Hamming codes) is an integer bookbased on his lectures. We translate it, because the man speaks his mind.
This book is not just about IT, this is a book about the way of thinking of incredibly cool people. βIt's not just a charge of positive thinking; it describes the conditions that increase the chances of doing great work.β
Thanks to Andrey Pakhomov for the translation.
Information Theory was developed by C. E. Shannon in the late 1940s. Bell Labs management insisted that he call it "Connection Theory" because. it's a much more accurate name. For obvious reasons, the name "Information Theory" has a much greater impact on the public, which is why Shannon chose it, and that is what we know to this day. The very name suggests that the theory deals with information, which makes it important as we move deeper into the information age. In this chapter, I will touch on several main conclusions from this theory, I will give not strict, but rather intuitive proofs of some individual provisions of this theory, so that you understand what βInformation Theoryβ really is, where you can apply it, and where you canβt. .
First of all, what is βinformationβ? Shannon equates information with uncertainty. He chose the negative log probability of an event as a quantitative measure of the information you get when an event occurs with probability p. For example, if I tell you that it's foggy in Los Angeles, then p is close to 1, which basically doesn't give us much information. But if I say that it rains in June in Monterey, then there will be uncertainty in this message, and it will contain more information. A certain event does not contain any information, since log 1 = 0.
Let's dwell on this in more detail. Shannon believed that the quantitative measure of information should be a continuous function of the event probability p, and for independent events it should be additive - the amount of information obtained as a result of the implementation of two independent events should be equal to the amount of information obtained as a result of the implementation of a joint event. For example, the result of a roll of dice and a coin are usually treated as independent events. Let's translate the above into the language of mathematics. If I (p) is the amount of information contained in an event with probability p, then for a joint event consisting of two independent events x with probability p1 and y with probability p2 we get
(x and y independent events)
This is the functional Cauchy equation, which is true for all p1 and p2. To solve this functional equation, suppose that
p1 = p2 = p,
this gives
If p1 = p2 and p2 = p, then
etc. Extending this process, using the standard method for exponentials, for all rational numbers m/n, the following is true
From the assumed continuity of the information measure, it follows that the logarithmic function is the only continuous solution of the functional Cauchy equation.
In information theory, it is customary to take the base of the logarithm equal to 2, so a binary choice contains exactly 1 bit of information. Therefore, information is measured by the formula
Let's pause and see what happened above. First of all, we have not given a definition of the concept of βinformationβ, we have simply defined the formula for its quantitative measure.
Secondly, this measure depends on uncertainty, and although it is quite suitable for machines - for example, telephone systems, radio, television, computers, etc. - it does not reflect the normal human attitude to information.
Thirdly, it is a relative measure, it depends on the current state of your knowledge. If you look at the stream of βrandom numbersβ from a random number generator, you assume that each next number is undefined, but if you know the formula for calculating βrandom numbersβ, the next number will be known, and, accordingly, will not contain information.
Thus, the definition given by Shannon for information is in many cases suitable for machines, but does not seem to match the human understanding of the word. It is for this reason that "Information Theory" should have been called "Communication Theory". However, it's too late to change the definitions (thanks to which the theory gained its initial popularity, and which still make people think that this theory deals with "information"), so we have to put up with them, but at the same time you must clearly understand how Shannon's definition of information is far from its commonly used meaning. Shannon's information deals with something completely different, namely uncertainty.
Here's what to think about when you suggest any terminology. How much does a proposed definition, such as Shannon's definition of information, fit with your original idea, and how different is it? There is almost no term that exactly reflects your previous vision of the concept, but in the end, it is the terminology used that reflects the meaning of the concept, so formalizing something through clear definitions always introduces some noise.
Consider a system whose alphabet consists of symbols q with probabilities pi. In this case average amount of information in the system (its expected value) is:
This is called the entropy of the system with the probability distribution {pi}. We use the term "entropy" because the same mathematical form occurs in thermodynamics and statistical mechanics. That is why the term "entropy" creates around itself a certain aura of importance, which, in the end, is not justified. The same mathematical notation does not imply the same interpretation of symbols!
The entropy of the probability distribution plays a major role in coding theory. The Gibbs inequality for two different probability distributions pi and qi is one of the important consequences of this theory. So we have to prove that
The proof is based on the obvious graph, fig. 13.I which shows that
and equality is achieved only for x = 1. Apply the inequality to each term of the sum from the left side:
If the alphabet of the communication system consists of q symbols, then taking the probability of transmitting each symbol qi = 1/q and substituting q, we obtain from the Gibbs inequality
Figure 13.I
This suggests that if the probability of transmitting all q characters is the same and equals - 1 / q, then the maximum entropy is equal to ln q, otherwise the inequality holds.
In the case of a uniquely decodable code, we have Kraft's inequality
Now if we define pseudoprobabilities
where of course = 1, which follows from the Gibbs inequality,
and applying some algebra (remember that K β€ 1, so we can omit the logarithmic term, and maybe tighten up the inequality later), we get
where L is the average code length.
Thus, the entropy is the minimum bound for any symbol-by-symbol code with an average codeword length L. This is Shannon's theorem for a noise-free channel.
Now consider the main theorem on the limitations of communication systems in which information is transmitted as a stream of independent bits and there is noise. It is assumed that the probability of correctly transmitting one bit is P > 1 / 2, and the probability that the value of a bit will be inverted during transmission (an error will occur) is equal to Q = 1 - P. For convenience, we assume that the errors are independent and the error probability is the same for each sent bit - that is, there is "white noise" in the communication channel.
The way we have a long stream of n bits encoded into a single message is n is a dimensional extension of the one bit code. We will define the value of n later. Consider a message consisting of n-bits as a point in n-dimensional space. Since we have an n-dimensional space - and for simplicity we will assume that each message has the same probability of occurring - there are M possible messages (M will also be determined later), therefore, the probability of any message sent is
(sender) Schedule 13.II
Next, consider the idea of ββchannel capacity. Without going into details, the channel capacity is defined as the maximum amount of information that can be reliably transmitted over a communication channel, taking into account the use of the most efficient coding. There are no arguments in favor of the fact that more information can be transmitted through a communication channel than its capacity. This can be proven for a binary symmetric channel (which we use in our case). The channel capacity, when sent bit by bit, is given by
where, as before, P is the probability of no error in any bit sent. When sending n independent bits, the channel capacity is defined as
If we are close to the channel capacity, then we have to send almost the same amount of information for each of the symbols ai, i = 1, ..., M. Considering that the probability of occurrence of each symbol ai is 1 / M, we get
when we send any of M equiprobable messages ai, we have
When sending n bits, we expect nQ errors to occur. In practice, for a message consisting of n-bits, we will have approximately nQ errors in the received message. For large n, the relative variation ( variation = distribution width, )
the distribution of the number of errors will become narrower as n grows.
So, from the sender side, I take the message ai to send and draw a sphere around it with radius
which is slightly larger by an amount equal to e2 than the expected number of errors Q, (figure 13.II). If n is large enough, then there is an arbitrarily small probability of the occurrence of a message point bj on the side of the receiver that goes beyond this sphere. Let's sketch the situation as I see it from the point of view of the transmitter: we have any radius from the transmitted message ai to the received message bj with an error probability equal (or almost equal) to the normal distribution, reaching a maximum of nQ. For any given e2, there is an n large enough that the probability that the resulting point bj, which is outside my sphere, will be as small as you like.
Now consider the same situation on your part (Fig. 13.III). On the receiver side, there is a sphere S( r) of the same radius r around the received point bj in n-dimensional space, such that if the received message bj is inside my sphere, then the message ai I sent is inside your sphere.
How can an error occur? An error can occur in the cases described in the table below:
Figure 13.III
Here we see that if there is at least one more point in the sphere built around the received point that corresponds to a possible unencoded message sent, then an error occurred during transmission, since you cannot determine which of these messages was transmitted. The sent message does not contain an error only if the point corresponding to it is in the sphere, and there are no other points possible in this code that are in the same sphere.
We have a mathematical equation for the error probability Pe if a message was sent ai
We can drop the first factor in the second term, taking it as 1. Thus, we get the inequality
Obviously, the
Consequently
reapply to the last term on the right
Assuming n is large enough, the first term can be taken as small as desired, say less than some number d. Therefore we have
Now consider how you can construct a simple replacement code for encoding M messages consisting of n bits. Having no idea how to build the code (error-correcting codes had not yet been invented), Shannon opted for random coding. Flip a coin for each of the n bits in the message and repeat the process for M messages. There are nM coin tosses in total, so there are possible
code dictionaries that have the same probability Β½nM. Of course, the random process of creating a code book means that there are chances of duplicates, as well as code points that are close to each other and therefore a source of probable errors. We need to prove that if this does not happen with a probability higher than any small chosen level of error, then the given n is large enough.
The crucial point is that Shannon averaged all possible codebooks to find the average error! We will use the symbol Av [.] to denote the average value over the set of all possible random codebooks. Averaging over the constant d of course yields a constant, since for averaging each term is the same as every other term in the sum,
which can be increased (Mβ1 goes to M )
For any particular message, when averaging all codebooks, the encoding runs through all possible values, so the average probability that a point is in a sphere is the ratio of the volume of the sphere to the total volume of space. The volume of the sphere is
where s=Q+e2 <1/2 and ns must be an integer.
The last term on the right is the largest in this sum. First, we estimate its value using the Stirling formula for factorials. We then look at the factor that reduces the term in front of it, note that this factor increases as we move to the left, and so we can: (1) limit the value of the sum to the sum of the geometric progression with this initial factor, (2) expand the geometric progression from ns terms to an infinite number of terms, (3) calculate the sum of an infinite geometric progression (standard algebra, nothing significant) and finally get the limit value (for sufficiently large n):
Notice how the entropy H(s) appeared in the binomial identity. Note that the Taylor series expansion H(s)=H(Q+e2) gives an estimate that takes into account only the first derivative and ignores all others. Now let's assemble the final expression:
where
All we have to do is choose e2 so that e3 < e1, and then the last term will be arbitrarily small, as long as n is large enough. Therefore, the average error PE can be obtained arbitrarily small when the channel capacity is arbitrarily close to C.
If the average over all codes has a sufficiently small error, then at least one code must be suitable, hence there is at least one suitable coding system. This is an important result obtained by Shannon - "Shannon's noisy channel theorem", although it should be noted that he proved this for a much more general case than for the simple binary symmetric channel used by me. For the general case, the mathematical calculations are much more complicated, but the ideas are not so different, so very often, using the example of a particular case, you can reveal the true meaning of the theorem.
Let's criticize the result. We repeatedly repeated: "For sufficiently large n". But how big is n? Very, very large, if you really want to be both close to the bandwidth and be sure of the correct data transfer! So big that in fact you will have to wait a very long time to accumulate a message from so many bits in order to subsequently encode it. In this case, the size of the random code dictionary will be simply huge (after all, such a dictionary cannot be represented in a shorter form than the complete list of all Mn bits, despite the fact that n and M are very large)!
Error correction codes avoid waiting for a very long message and then encoding and decoding it through very large codebooks because they avoid codebooks per se and use conventional calculations instead. In simple theory, such codes tend to lose the ability to get close to the bandwidth and still keep the error rate low enough, but when the code corrects a large number of errors, they perform well. In other words, if you allocate some channel capacity for error correction, then you must use the error correction opportunity most of the time, i.e. a large number of errors must be corrected in each message sent, otherwise you waste this capacity.
At the same time, the theorem proved above is still not meaningless! It shows that efficient transmission systems must use clever coding schemes for very long bit strings. An example is satellites that have flown outside the outer planets; as they move away from the Earth and the Sun, they are forced to correct more and more errors in the data block: some satellites use solar panels, which provide about 5 watts, others use nuclear power sources, which provide about the same power. The weak power of the power supply, the small size of the transmitter dishes and the limited size of the receiver dishes on Earth, the huge distance that the signal must travel - all this requires the use of codes with a high level of error correction to build an effective communication system.
Let's go back to the n-dimensional space we used in the proof above. Discussing it, we showed that almost the entire volume of the sphere is concentrated near the outer surface - thus, almost certainly the sent signal will be located at the surface of the sphere built around the received signal, even with a relatively small radius of such a sphere. Therefore, it is not surprising that the received signal, after correcting an arbitrarily large number of errors, nQ, turns out to be arbitrarily close to a signal without errors. The capacity of the communication channel, which we discussed earlier, is the key to understanding this phenomenon. Note that similar spheres built for error-correcting Hamming codes do not overlap. The large number of nearly orthogonal dimensions in n-dimensional space shows why we can fit M spheres into space with little overlap. If we allow a small, arbitrarily small overlap, which can lead to only a small number of decoding errors, we can get a dense arrangement of spheres in space. Hamming guaranteed a certain level of error correction, Shannon - a low probability of error, but at the same time keeping the actual throughput arbitrarily close to the capacity of the communication channel, which Hamming codes cannot do.
Information theory does not tell how to design an efficient system, but it points the way towards efficient communication systems. It is a valuable tool for building communication systems between machines, but as noted earlier, it has little to do with how people communicate with each other. The extent to which biological inheritance is similar to technical communication systems is simply unknown, so it is not currently clear how much information theory applies to genes. We have no choice but to simply try, and if success will show us the machine-like nature of this phenomenon, then failure will point to other significant aspects of the nature of information.
Let's not digress too much. We have seen that all original definitions, to a greater or lesser extent, should express the essence of our original beliefs, but they are characterized by some degree of distortion, and therefore they are not applicable. It is traditionally accepted that, in the end, the definition we use actually defines the essence; but, it only tells us how to process things and in no way makes any sense to us. The postulate approach, so strongly favored in mathematical circles, leaves much to be desired in practice.
Now we'll look at an example of IQ tests where the definition is as cyclical as you like and misleads you as a result. A test is being created that is supposed to measure intelligence. It is then revised to be as consistent as possible, and then published and calibrated in a simple way so that the measured "intelligence" is normally distributed (of course, along the calibration curve). All definitions must be re-examined, not only when they are first proposed, but much later when they are used in conclusions. To what extent are the boundaries of definitions appropriate for the problem at hand? How often are definitions given under one set of conditions applied under sufficiently different conditions? This happens quite often! In the humanities, which you will inevitably encounter in your life, this happens more often.
So one of the purposes of this presentation of information theory, in addition to demonstrating its usefulness, was to warn you of this danger, or to show you exactly how to use it to get the desired result. It has long been noted that the initial definitions determine what you end up with to a much greater extent than it seems. Initial definitions require you to pay close attention, not only in any new situation, but also in areas that you have been working with for a long time. This will allow you to understand to what extent the results obtained are a tautology, and not something useful.
The famous story of Eddington tells of people who were fishing in the sea with a net. By studying the size of the fish they caught, they determined the smallest size of fish that can be found in the sea! Their conclusion was due to the instrument used, not to reality.
To be continued ...
Who wants to help with the translation, layout and publication of the book - write in a personal or email [email protected]
We are especially looking for who can help translate bonus chapter, which is only on the videosystem. (we translate for 10 minutes, the first 20 have already been taken)