Table of Contents
This post is not about sharing solutions to those problems in the book. As the title indicates, this post is rather about sharing intuitions or interpretations of some results mentioned or alluded in some problems listed at the end of Chapter 2 of the book “Elements of Information Theory” second edition. The chapter deals with the notion of entropy and mutual information.
Some of my colleagues and I were reading some rather huge paper which draws from many basic ideas from many fields, and one of them was the notion of mutual information. We only had a grasp on its special case, the notion of entropy of a random variable and since the general notion was new to us, I decided to read at least first few chapters of “Elements of Information Theory”.
1. meaning of conditional mutual information
Given three (discrete) random variables X, Y, Z, the conditional mutual information I(X; Y | Z) intuitively means how much X says about Y (or equivalently how much Y says about X) under the assumption that the value of Z is already known. Try to use this intuition to make sense of the chain rule for mutual information and the data-processing inequality.
As for its special case, the intuitions for conditional entropy, see (already linked previously) Shannon’s entropy of random variables and partitions.
2. comments on some selected problems from Chapter 2
2.1. Problem 2.6
The problem asks to find examples where mutual information increases or decreases when conditioned to a third random variable. This points out that I(X;Y | Z) is not monotone in Z. One should contrast this with the following two facts:
- Conditional entropy H(X | Y) is monotone in the conditioning Y.
- Mutual information I(X;Y | Z) is monotone in X and Y.
2.2. Problem 2.14
Given two independent (discrete) random variables X and Y, the entropy of the sum X+Y is at least . This is perhaps related to why the central limit theorem works. If you keep adding i.i.d. random variables one by one, the entropy of the sum never decreases. So aside from special cases, we can expect that the entropy would get higher and higher. High entropy of the sum alone may not tell you much about the probability distribution of the sum, but the variance and the mean of the sum is also known, and IIRC the normal distribution maximizes entropy under the constraints on the mean and variance. It does not make a proof of the central limit theorem but I think it illustrates the relation between large degrees of freedom (in statistical mechanics) and probability distributions with maximal entropy.
Once you prove that the entropy of the independent sum X+Y is at least , you would also be able to see that the entropy of the independent modulo sum X+Y modulo something like 6 for example is also at least . If you have a biased dice, you may rightfully expect that you can simulate a less biased dice, by simply throwing the same dice several times and summing up the numbers modulo 6, and the entropy argument may be one way to justify this expectation.
2.3. Problem 2.23
The problem deals with binary random variables related by one linear equation . The given joint probability distribution is in some sense the most fair distribution under the constraint given by the linear equation.
The results are consistent with our intuition in that:
- knowing tells you nothing about
- and given that you already know , knowing tells you nothing about
- and at the last step when you know to , knowing tells you everything about (because of the linear equation).
The results can be different for other probability distributions while satisfying the same linear equation. For example, if we give some biased joint probability distribution, then it’s possible that knowing might tell you something about in the sense that . In fact, you can even choose a distribution so that knowing tells you everything about in the sense that . That is the case when we choose the most fair distribution under the constraint .
2.4. Problem 2.27
It asks to prove the grouping rule for entropy. It can be verified with a simple calculation, but maybe it is instructive to solve this instead by using auxiliary random variables and using already established equalities thus far.
2.5. Problem 2.41
One wishes to identify a random object X. Two i.i.d. random questions about the object are asked, eliciting answers . The problem asks to show that the two questions are less valuable than twice a single question.
At first, one might expect incorrectly that the two questions are worth twice a single question, but what about asking millions of questions about the outcome of a single dice roll? One million questions about the outcome is worth almost the same as two million questions, since by the time you ask the million’th, you would be almost certain of what the outcome of the dice roll was and the next million questions would not contribute much. The amount of information to be extracted with questions cannot grow linearly with the number of questions indefinitely and in fact there is an upper bound, namely, .
Some more thoughts reveal that it is because of independence (rather than in spite of independence) that two i.i.d questions are less valuable than twice a single question. Suppose X is the random position of a specific chess piece. If the first (random) question asked which row, then the next optimal question to ask would be to ask which column. Making subsequent optimal questions require dependence to previous questions and answers and therefore independent questions are less efficient than dependent questions, as anyone who has played Twenty Questions can attest.