Richard Hamming: Chapter 13. Information Theory

We did it!

"The goal of this course is to prepare you for your technical future."

Richard Hamming: Chapter 13. Information TheoryHey Habr. Remember the awesome article "You and Your Job" (+219, 2588 bookmarks, 429k reads)?

So Hamming (yes, yes, self-controlled and self-correcting Hamming codes) is an integer bookbased on his lectures. We translate it, because the man speaks his mind.

This book is not just about IT, this is a book about the way of thinking of incredibly cool people. β€œIt's not just a charge of positive thinking; it describes the conditions that increase the chances of doing great work.”

Thanks to Andrey Pakhomov for the translation.

Information Theory was developed by C. E. Shannon in the late 1940s. Bell Labs management insisted that he call it "Connection Theory" because. it's a much more accurate name. For obvious reasons, the name "Information Theory" has a much greater impact on the public, which is why Shannon chose it, and that is what we know to this day. The very name suggests that the theory deals with information, which makes it important as we move deeper into the information age. In this chapter, I will touch on several main conclusions from this theory, I will give not strict, but rather intuitive proofs of some individual provisions of this theory, so that you understand what β€œInformation Theory” really is, where you can apply it, and where you can’t. .

First of all, what is β€œinformation”? Shannon equates information with uncertainty. He chose the negative log probability of an event as a quantitative measure of the information you get when an event occurs with probability p. For example, if I tell you that it's foggy in Los Angeles, then p is close to 1, which basically doesn't give us much information. But if I say that it rains in June in Monterey, then there will be uncertainty in this message, and it will contain more information. A certain event does not contain any information, since log 1 = 0.

Let's dwell on this in more detail. Shannon believed that the quantitative measure of information should be a continuous function of the event probability p, and for independent events it should be additive - the amount of information obtained as a result of the implementation of two independent events should be equal to the amount of information obtained as a result of the implementation of a joint event. For example, the result of a roll of dice and a coin are usually treated as independent events. Let's translate the above into the language of mathematics. If I (p) is the amount of information contained in an event with probability p, then for a joint event consisting of two independent events x with probability p1 and y with probability p2 we get

Richard Hamming: Chapter 13. Information Theory
(x and y independent events)

This is the functional Cauchy equation, which is true for all p1 and p2. To solve this functional equation, suppose that

p1 = p2 = p,

this gives

Richard Hamming: Chapter 13. Information Theory

If p1 = p2 and p2 = p, then

Richard Hamming: Chapter 13. Information Theory

etc. Extending this process, using the standard method for exponentials, for all rational numbers m/n, the following is true

Richard Hamming: Chapter 13. Information Theory

From the assumed continuity of the information measure, it follows that the logarithmic function is the only continuous solution of the functional Cauchy equation.

In information theory, it is customary to take the base of the logarithm equal to 2, so a binary choice contains exactly 1 bit of information. Therefore, information is measured by the formula

Richard Hamming: Chapter 13. Information Theory

Let's pause and see what happened above. First of all, we have not given a definition of the concept of β€œinformation”, we have simply defined the formula for its quantitative measure.

Secondly, this measure depends on uncertainty, and although it is quite suitable for machines - for example, telephone systems, radio, television, computers, etc. - it does not reflect the normal human attitude to information.

Thirdly, it is a relative measure, it depends on the current state of your knowledge. If you look at the stream of β€œrandom numbers” from a random number generator, you assume that each next number is undefined, but if you know the formula for calculating β€œrandom numbers”, the next number will be known, and, accordingly, will not contain information.

Thus, the definition given by Shannon for information is in many cases suitable for machines, but does not seem to match the human understanding of the word. It is for this reason that "Information Theory" should have been called "Communication Theory". However, it's too late to change the definitions (thanks to which the theory gained its initial popularity, and which still make people think that this theory deals with "information"), so we have to put up with them, but at the same time you must clearly understand how Shannon's definition of information is far from its commonly used meaning. Shannon's information deals with something completely different, namely uncertainty.

Here's what to think about when you suggest any terminology. How much does a proposed definition, such as Shannon's definition of information, fit with your original idea, and how different is it? There is almost no term that exactly reflects your previous vision of the concept, but in the end, it is the terminology used that reflects the meaning of the concept, so formalizing something through clear definitions always introduces some noise.

Consider a system whose alphabet consists of symbols q with probabilities pi. In this case average amount of information in the system (its expected value) is:

Richard Hamming: Chapter 13. Information Theory

This is called the entropy of the system with the probability distribution {pi}. We use the term "entropy" because the same mathematical form occurs in thermodynamics and statistical mechanics. That is why the term "entropy" creates around itself a certain aura of importance, which, in the end, is not justified. The same mathematical notation does not imply the same interpretation of symbols!

The entropy of the probability distribution plays a major role in coding theory. The Gibbs inequality for two different probability distributions pi and qi is one of the important consequences of this theory. So we have to prove that

Richard Hamming: Chapter 13. Information Theory

The proof is based on the obvious graph, fig. 13.I which shows that

Richard Hamming: Chapter 13. Information Theory

and equality is achieved only for x = 1. Apply the inequality to each term of the sum from the left side:

Richard Hamming: Chapter 13. Information Theory

If the alphabet of the communication system consists of q symbols, then taking the probability of transmitting each symbol qi = 1/q and substituting q, we obtain from the Gibbs inequality

Richard Hamming: Chapter 13. Information Theory

Richard Hamming: Chapter 13. Information Theory

Figure 13.I

This suggests that if the probability of transmitting all q characters is the same and equals - 1 / q, then the maximum entropy is equal to ln q, otherwise the inequality holds.

In the case of a uniquely decodable code, we have Kraft's inequality

Richard Hamming: Chapter 13. Information Theory

Now if we define pseudoprobabilities

Richard Hamming: Chapter 13. Information Theory

where of course Richard Hamming: Chapter 13. Information Theory= 1, which follows from the Gibbs inequality,

Richard Hamming: Chapter 13. Information Theory

and applying some algebra (remember that K ≀ 1, so we can omit the logarithmic term, and maybe tighten up the inequality later), we get

Richard Hamming: Chapter 13. Information Theory

where L is the average code length.

Thus, the entropy is the minimum bound for any symbol-by-symbol code with an average codeword length L. This is Shannon's theorem for a noise-free channel.

Now consider the main theorem on the limitations of communication systems in which information is transmitted as a stream of independent bits and there is noise. It is assumed that the probability of correctly transmitting one bit is P > 1 / 2, and the probability that the value of a bit will be inverted during transmission (an error will occur) is equal to Q = 1 - P. For convenience, we assume that the errors are independent and the error probability is the same for each sent bit - that is, there is "white noise" in the communication channel.

The way we have a long stream of n bits encoded into a single message is n is a dimensional extension of the one bit code. We will define the value of n later. Consider a message consisting of n-bits as a point in n-dimensional space. Since we have an n-dimensional space - and for simplicity we will assume that each message has the same probability of occurring - there are M possible messages (M will also be determined later), therefore, the probability of any message sent is

Richard Hamming: Chapter 13. Information Theory

Richard Hamming: Chapter 13. Information Theory
(sender)
Schedule 13.II

Next, consider the idea of ​​channel capacity. Without going into details, the channel capacity is defined as the maximum amount of information that can be reliably transmitted over a communication channel, taking into account the use of the most efficient coding. There are no arguments in favor of the fact that more information can be transmitted through a communication channel than its capacity. This can be proven for a binary symmetric channel (which we use in our case). The channel capacity, when sent bit by bit, is given by

Richard Hamming: Chapter 13. Information Theory

where, as before, P is the probability of no error in any bit sent. When sending n independent bits, the channel capacity is defined as

Richard Hamming: Chapter 13. Information Theory

If we are close to the channel capacity, then we have to send almost the same amount of information for each of the symbols ai, i = 1, ..., M. Considering that the probability of occurrence of each symbol ai is 1 / M, we get

Richard Hamming: Chapter 13. Information Theory

when we send any of M equiprobable messages ai, we have

Richard Hamming: Chapter 13. Information Theory

When sending n bits, we expect nQ errors to occur. In practice, for a message consisting of n-bits, we will have approximately nQ errors in the received message. For large n, the relative variation ( variation = distribution width, )
the distribution of the number of errors will become narrower as n grows.

So, from the sender side, I take the message ai to send and draw a sphere around it with radius

Richard Hamming: Chapter 13. Information Theory

which is slightly larger by an amount equal to e2 than the expected number of errors Q, (figure 13.II). If n is large enough, then there is an arbitrarily small probability of the occurrence of a message point bj on the side of the receiver that goes beyond this sphere. Let's sketch the situation as I see it from the point of view of the transmitter: we have any radius from the transmitted message ai to the received message bj with an error probability equal (or almost equal) to the normal distribution, reaching a maximum of nQ. For any given e2, there is an n large enough that the probability that the resulting point bj, which is outside my sphere, will be as small as you like.

Now consider the same situation on your part (Fig. 13.III). On the receiver side, there is a sphere S( r) of the same radius r around the received point bj in n-dimensional space, such that if the received message bj is inside my sphere, then the message ai I sent is inside your sphere.

How can an error occur? An error can occur in the cases described in the table below:

Richard Hamming: Chapter 13. Information Theory

Figure 13.III

Richard Hamming: Chapter 13. Information Theory

Here we see that if there is at least one more point in the sphere built around the received point that corresponds to a possible unencoded message sent, then an error occurred during transmission, since you cannot determine which of these messages was transmitted. The sent message does not contain an error only if the point corresponding to it is in the sphere, and there are no other points possible in this code that are in the same sphere.

We have a mathematical equation for the error probability Pe if a message was sent ai

Richard Hamming: Chapter 13. Information Theory

We can drop the first factor in the second term, taking it as 1. Thus, we get the inequality

Richard Hamming: Chapter 13. Information Theory

Obviously, the

Richard Hamming: Chapter 13. Information Theory

Consequently

Richard Hamming: Chapter 13. Information Theory

reapply to the last term on the right

Richard Hamming: Chapter 13. Information Theory

Assuming n is large enough, the first term can be taken as small as desired, say less than some number d. Therefore we have

Richard Hamming: Chapter 13. Information Theory

Now consider how you can construct a simple replacement code for encoding M messages consisting of n bits. Having no idea how to build the code (error-correcting codes had not yet been invented), Shannon opted for random coding. Flip a coin for each of the n bits in the message and repeat the process for M messages. There are nM coin tosses in total, so there are possible

Richard Hamming: Chapter 13. Information Theory

code dictionaries that have the same probability Β½nM. Of course, the random process of creating a code book means that there are chances of duplicates, as well as code points that are close to each other and therefore a source of probable errors. We need to prove that if this does not happen with a probability higher than any small chosen level of error, then the given n is large enough.
The crucial point is that Shannon averaged all possible codebooks to find the average error! We will use the symbol Av [.] to denote the average value over the set of all possible random codebooks. Averaging over the constant d of course yields a constant, since for averaging each term is the same as every other term in the sum,

Richard Hamming: Chapter 13. Information Theory

which can be increased (M–1 goes to M )

Richard Hamming: Chapter 13. Information Theory

For any particular message, when averaging all codebooks, the encoding runs through all possible values, so the average probability that a point is in a sphere is the ratio of the volume of the sphere to the total volume of space. The volume of the sphere is

Richard Hamming: Chapter 13. Information Theory

where s=Q+e2 <1/2 and ns must be an integer.

The last term on the right is the largest in this sum. First, we estimate its value using the Stirling formula for factorials. We then look at the factor that reduces the term in front of it, note that this factor increases as we move to the left, and so we can: (1) limit the value of the sum to the sum of the geometric progression with this initial factor, (2) expand the geometric progression from ns terms to an infinite number of terms, (3) calculate the sum of an infinite geometric progression (standard algebra, nothing significant) and finally get the limit value (for sufficiently large n):

Richard Hamming: Chapter 13. Information Theory

Notice how the entropy H(s) appeared in the binomial identity. Note that the Taylor series expansion H(s)=H(Q+e2) gives an estimate that takes into account only the first derivative and ignores all others. Now let's assemble the final expression:

Richard Hamming: Chapter 13. Information Theory

where

Richard Hamming: Chapter 13. Information Theory

All we have to do is choose e2 so that e3 < e1, and then the last term will be arbitrarily small, as long as n is large enough. Therefore, the average error PE can be obtained arbitrarily small when the channel capacity is arbitrarily close to C.
If the average over all codes has a sufficiently small error, then at least one code must be suitable, hence there is at least one suitable coding system. This is an important result obtained by Shannon - "Shannon's noisy channel theorem", although it should be noted that he proved this for a much more general case than for the simple binary symmetric channel used by me. For the general case, the mathematical calculations are much more complicated, but the ideas are not so different, so very often, using the example of a particular case, you can reveal the true meaning of the theorem.

Let's criticize the result. We repeatedly repeated: "For sufficiently large n". But how big is n? Very, very large, if you really want to be both close to the bandwidth and be sure of the correct data transfer! So big that in fact you will have to wait a very long time to accumulate a message from so many bits in order to subsequently encode it. In this case, the size of the random code dictionary will be simply huge (after all, such a dictionary cannot be represented in a shorter form than the complete list of all Mn bits, despite the fact that n and M are very large)!

Error correction codes avoid waiting for a very long message and then encoding and decoding it through very large codebooks because they avoid codebooks per se and use conventional calculations instead. In simple theory, such codes tend to lose the ability to get close to the bandwidth and still keep the error rate low enough, but when the code corrects a large number of errors, they perform well. In other words, if you allocate some channel capacity for error correction, then you must use the error correction opportunity most of the time, i.e. a large number of errors must be corrected in each message sent, otherwise you waste this capacity.

At the same time, the theorem proved above is still not meaningless! It shows that efficient transmission systems must use clever coding schemes for very long bit strings. An example is satellites that have flown outside the outer planets; as they move away from the Earth and the Sun, they are forced to correct more and more errors in the data block: some satellites use solar panels, which provide about 5 watts, others use nuclear power sources, which provide about the same power. The weak power of the power supply, the small size of the transmitter dishes and the limited size of the receiver dishes on Earth, the huge distance that the signal must travel - all this requires the use of codes with a high level of error correction to build an effective communication system.

Let's go back to the n-dimensional space we used in the proof above. Discussing it, we showed that almost the entire volume of the sphere is concentrated near the outer surface - thus, almost certainly the sent signal will be located at the surface of the sphere built around the received signal, even with a relatively small radius of such a sphere. Therefore, it is not surprising that the received signal, after correcting an arbitrarily large number of errors, nQ, turns out to be arbitrarily close to a signal without errors. The capacity of the communication channel, which we discussed earlier, is the key to understanding this phenomenon. Note that similar spheres built for error-correcting Hamming codes do not overlap. The large number of nearly orthogonal dimensions in n-dimensional space shows why we can fit M spheres into space with little overlap. If we allow a small, arbitrarily small overlap, which can lead to only a small number of decoding errors, we can get a dense arrangement of spheres in space. Hamming guaranteed a certain level of error correction, Shannon - a low probability of error, but at the same time keeping the actual throughput arbitrarily close to the capacity of the communication channel, which Hamming codes cannot do.

Information theory does not tell how to design an efficient system, but it points the way towards efficient communication systems. It is a valuable tool for building communication systems between machines, but as noted earlier, it has little to do with how people communicate with each other. The extent to which biological inheritance is similar to technical communication systems is simply unknown, so it is not currently clear how much information theory applies to genes. We have no choice but to simply try, and if success will show us the machine-like nature of this phenomenon, then failure will point to other significant aspects of the nature of information.

Let's not digress too much. We have seen that all original definitions, to a greater or lesser extent, should express the essence of our original beliefs, but they are characterized by some degree of distortion, and therefore they are not applicable. It is traditionally accepted that, in the end, the definition we use actually defines the essence; but, it only tells us how to process things and in no way makes any sense to us. The postulate approach, so strongly favored in mathematical circles, leaves much to be desired in practice.

Now we'll look at an example of IQ tests where the definition is as cyclical as you like and misleads you as a result. A test is being created that is supposed to measure intelligence. It is then revised to be as consistent as possible, and then published and calibrated in a simple way so that the measured "intelligence" is normally distributed (of course, along the calibration curve). All definitions must be re-examined, not only when they are first proposed, but much later when they are used in conclusions. To what extent are the boundaries of definitions appropriate for the problem at hand? How often are definitions given under one set of conditions applied under sufficiently different conditions? This happens quite often! In the humanities, which you will inevitably encounter in your life, this happens more often.

So one of the purposes of this presentation of information theory, in addition to demonstrating its usefulness, was to warn you of this danger, or to show you exactly how to use it to get the desired result. It has long been noted that the initial definitions determine what you end up with to a much greater extent than it seems. Initial definitions require you to pay close attention, not only in any new situation, but also in areas that you have been working with for a long time. This will allow you to understand to what extent the results obtained are a tautology, and not something useful.

The famous story of Eddington tells of people who were fishing in the sea with a net. By studying the size of the fish they caught, they determined the smallest size of fish that can be found in the sea! Their conclusion was due to the instrument used, not to reality.

To be continued ...

Who wants to help with the translation, layout and publication of the book - write in a personal or email [email protected]

By the way, we have also launched the translation of another cool book - "The Dream Machine: A History of the Computer Revolution")

We are especially looking for who can help translate bonus chapter, which is only on the videosystem. (we translate for 10 minutes, the first 20 have already been taken)

Book content and translated chaptersforeword

  1. Intro to The Art of Doing Science and Engineering: Learning to Learn (March 28, 1995) Translation: Chapter 1
  2. "Foundations of the Digital (Discrete) Revolution" (March 30, 1995) Chapter 2. Fundamentals of the Digital (Discrete) Revolution
  3. "History of Computers - Hardware" (March 31, 1995) Chapter 3
  4. "History of Computers - Software" (April 4, 1995) Chapter 4
  5. "History of Computers - Applications" (April 6, 1995) Chapter 5
  6. "Artificial Intelligence - Part I" (April 7, 1995) Chapter 6. Artificial Intelligence - 1
  7. "Artificial Intelligence - Part II" (April 11, 1995) Chapter 7. Artificial Intelligence - II
  8. "Artificial Intelligence III" (April 13, 1995) Chapter 8. Artificial Intelligence-III
  9. "n-Dimensional Space" (April 14, 1995) Chapter 9
  10. "Coding Theory - The Representation of Information, Part I" (April 18, 1995) Chapter 10 Coding Theory - I
  11. "Coding Theory - The Representation of Information, Part II" (April 20, 1995) Chapter 11 Coding Theory II
  12. "Error-Correcting Codes" (April 21, 1995) Chapter 12
  13. "Information Theory" (April 25, 1995) Chapter 13
  14. "Digital Filters, Part I" (April 27, 1995) Chapter 14 Digital Filters - 1
  15. "Digital Filters, Part II" (April 28, 1995) Chapter 15 Digital Filters - 2
  16. "Digital Filters, Part III" (May 2, 1995) Chapter 16 Digital Filters - 3
  17. "Digital Filters, Part IV" (May 4, 1995) Chapter 17 Digital Filters - IV
  18. "Simulation, Part I" (May 5, 1995) Chapter 18
  19. "Simulation, Part II" (May 9, 1995) Chapter 19
  20. "Simulation, Part III" (May 11, 1995) Chapter 20 Modeling - III
  21. Fiber Optics (May 12, 1995) Chapter 21
  22. "Computer Aided Instruction" (May 16, 1995) Chapter 22 Computer Assisted Learning (CAI)
  23. "Mathematics" (May 18, 1995) Chapter 23
  24. "Quantum Mechanics" (May 19, 1995) Chapter 24
  25. "Creativity" (May 23, 1995). Translation: Chapter 25
  26. "Experts" (May 25, 1995) Chapter 26
  27. "Unreliable Data" (May 26, 1995) Chapter 27
  28. Systems Engineering (May 30, 1995) Chapter 28. Systems Engineering
  29. "You Get What You Measure" (June 1, 1995) Chapter 29
  30. "How Do We Know What We Know" (June 2, 1995) translate in 10 minute pieces
  31. Hamming, "You and Your Research" (June 6, 1995). Translation: You and your work

Who wants to help with the translation, layout and publication of the book - write in a personal or email [email protected]

Source: habr.com

Add a comment