Shannon’s entropy of random variables and partitions

1. entropy for dummies
2. intuition for conditional entropy and its definition and properties
3. entropy of continuous probability distribution
- 3.1. some intuition for differential entropy
4. for students of ergodic theory (optional reading)
- 4.1. entropy of a partition
- 4.2. comparison between metric entropy and differential entropy
5. using properties of entropy in proofs
- 5.1. if two random variables are close to each other, then how much are their entropies close to each other?
- 5.2. entropy with respect to convex combination
6. exercises

This post has two goals:

use a hypothetical town with curious residents to give some intuitive understanding of the notion of entropy (of finite partitions or of discrete random variables).
give some examples of using well known intuitive properties of entropy to prove some less trivial facts about entropy rather than directly using the definition.

Also, while this post is not about ergodic theory and ergodic theory (and measure theory) is not a requirement for most of this post, I include a section for helping students of ergodic theory see how probability theory intuitions can help guide them.

This is not about entropy in thermodynamics (in physics), but about entropy in information theory (in mathematics). (but the two are related).

Notion of the metric entropy of a dynamical system and differential entropy of a continuous probability distribution is also briefly discussed. The notion of entropy rate of a process too.

1. entropy for dummies

1.1. definition of entropy of r.v. and joint entropy

You probably heard that the entropy of a discrete random variable $X$ with $n$ distinct outcomes $\{x_1, \cdots, x_n\}$ is defined as

$H(X) = - \sum_{i=1}^n p_i \log p_i$

where $p_i$ is the probability of the event $X = x_i$ . The RHS is sometimes denoted as $H_n(p_1, \cdots, p_n)$ . You might be wondering, how can we make sense of that entropy formula intuitively? You might have heard that it quantifies the amount of uncertainty for outcome of $X$ and might be wondering why quantify in this specific way among other possible ways.

Also, given two discrete random variables $X$ and $Y$ , the joint entropy of $X$ and $Y$ together is defined similarly as

$H(X, Y) = - \sum_{i=1}^n p_i \log p_i$

where $p_i$ is the probability of the event $(X, Y) = z_i$ and provided that $\{z_1, \cdots, z_n\}$ is the range of values that $(X, Y)$ can take, i.e., $\{z_1, \cdots, z_n\}$ is simply the Cartesian product of the range of $X$ and the range of $Y$ . In other words, the joint entropy of $X$ and $Y$ is just the entropy of the (joint) random variable $(X, Y)$ . The joint entropy of more than two random variables is defined similarly.

Some graphs that might help:

1.2. the town

Let me give you a (possibly flawed in some way) story that will demonstrate the intuition behind entropy. It will also demonstrate the main properties of entropy, that is, the main properties will become intuitively plausible results, after you read this story, rather than just being results that follows from calculation.

Imagine a town where very curious residents live. Each month, the mayor of this town throws ten fair coins to generate a random binary sequence of length 10. So we have 10 binary random variables: $X_1, \ldots, X_{10}$ . Value of $X_i$ is $1$ if the $i$ ‘th coin landed on heads, and $0$ otherwise. We also have other interesting random variables like $X_1 + X_2$ or $(X_2, X_3)$ or $X_1 X_2 X_3$ , etc.

Each month, the mayor generates a random sequence of length 10 and then reveals the outcome to Bob who is an employee of the town’s local government. Alice, another employee, sells various tickets for accessing total or partial information about the sequence. Here is how these tickets are supposed to be used: if you are a resident of this town and you want to know the value of $X_1 + X_2$ this month (because you are curios), you can buy from Alice a ticket that has the following statement written on it: “If you give this ticket to Bob, Bob is required to tell you the value of $X_1 + X_2$ .” That is, you visit Bob’s office and give him the ticket and you get the information in return. Now what is a good price for this ticket? How much would you pay to know the value of $X_1 + X_2$ ?

Suppose the price of a ticket for $X_1$ is one dollar, i.e., you can pay one dollar (to Alice or others) to get a ticket that has the following statement on it: “If you give this ticket to Bob, Bob is required to tell you the value of $X_1$ ”. It’s pretty obvious that the price of a ticket for $X_2$ should be one dollar as well. It is also obvious that the price of a ticket for knowing $(X_1, \ldots, X_{5})$ should be five dollars.

Not interesting so far, but how about this random variable. We define the random variable $Z$ to have value 5 if $X_1 = 0$ , and 6 if $(X_1, X_2) = (1,0)$ , and 7 if $(X_1, X_2) = (1,1)$ . So $Z$ has probability $1/2$ of being 5, and probability $1/4$ of being 6, and $1/4$ of being 7. What is the right price of a ticket for $Z$ ? The right price is $1+\frac{1}{2}$ dollars. That is because there’s this alternative way of knowing $Z$ : you can first buy a ticket for $X_1$ and then you give the ticket to Bob to be informed of the value of $X_1$ , and if you learn that $X_1$ is 1, you then buy a ticket for $X_2$ and visit Bob again. This alternative way of knowing $Z$ costs $1 + \frac{1}{2}$ dollars on average. You might say “you convinced me that the prize for a ticket for $Z$ should be $\le 1 + \frac{1}{2}$ . Now convince me that it should be $\ge 1 + \frac{1}{2}$ .” I have to say that right now I don’t have any satisfying argument to convince you that, except to say that the alternative way of knowing $Z$ seems efficient and it feels like the way cannot be further improved to reduce average cost.

What about the cost of a ticket for $(X_1 + X_2, X_2 + X_3)$ . If you buy a ticket for $X_1 + X_2$ , which we will assume to cost $a$ dollars, and another ticket for $X_2 + X_3$ (also costing $a$ dollars), then you get to learn the value of $(X_1 + X_2, X_2 + X_3)$ after visiting Bob. Therefore the price of a ticket for $(X_1 + X_2, X_2 + X_3)$ should be $\le 2a$ (otherwise, you could create a business where each month you buy one ticket for $X_1 + X_2$ and one ticket for $X_2 + X_3$ from others and spend the two tickets at Bob to learn the value of $(X_1 + X_2, X_2 + X_3)$ and then sell that information to others with the price of a ticket for $(X_1 + X_2, X_2 + X_3)$ , and profit. To be more complete, maybe we should assume that the residents of the town are law abiding citizens and there is a law that bans selling any information (about the binary sequence) twice. Maybe residents are required to erase from their memory any information they sell to or hand over to others. Maybe there are other patch ups needed but let’s stop for now. I told you this is a flawed story.)

1.3. intuition for entropy and its properties

One way of justifying the formula in the definition of entropy is to think of it as the only formula that guarantees many plausible properties of entropy that you expect.

For a (discrete) random variable $X$ , think of $H(X)$ as proportional to the ideal price for a ticket for $X$ . The conversion factor is $\log 2$ per dollar. We normalize things so that the entropy of a fair coin is $\log 2$ (i.e. one dollar). I won’t go into a proof of this, but it is known that there is only one formula for $H_n(p_1, \ldots, p_n)$ that guarantees the following properties:

(decomposition property) $H_n(p_1, \ldots, p_n) + p_1 H_m(q_1, \ldots, q_m) = H_{n+m-1}(p_1 q_1, \ldots, p_1 q_m, \, p_2, \ldots, p_n)$

(triangle inequality for entropy) $H(X, Y) \le H(X) + H(Y)$ for any two discrete random variables $X, Y$

(positivity) $H(X) \ge 0$

(normalization) $H_2(\frac{1}{2}, \frac{1}{2}) = \log 2$

(continuity) each $H_n$ is a continuous function

Remember the ticket which cost $1+ \frac12$ dollars? The argument for its price gives an intuitive meaning to the decomposition property. Remember the ticket that cost $\le 2a$ dollars? That gives some intuitive sense for the triangle inequality for entropy. The entropy formula also guarantees the following additional properties:

(entropy of equidistribution) $H_n(\frac1n, \ldots, \frac1n) = \log n$

(upper bound) If $X$ can have $n$ outcomes, then $H(X) \le \log n$ with equality if and only if $X$ is equidistributed.

(w.r.t. independence) $H(X, Y) = H(X) + H(Y)$ if and only if $X$ and $Y$ are independent.

(monotonicity) $H(X) \ge H(Y)$ if value of $X$ determines value of $Y$ , i.e., if $Y$ is a function of $X$ . (Note that this is same as saying $H(X,Y) \ge H(X)$ for all discrete random variables $X$ and $Y$ )

Having listed all these properties, some understanding emerges: $H(X)$ is best thought of as the additive amount of uncertainty in $X$ .

1.4. intuition for the notion of surprisal

A more direct way to have some intuitive sense of the entropy formula for $H(X)$ is to think of it as the average amount of information you gain upon hearing the value of $X$ . For that, we need to quantify the amount of information you gain for hearing, for example, that the value of $X$ turned out to be 1, i.e., that the event $X=1$ occurred.

If $A$ is an event, the surprisal of $A$ is defined as $I(A) = - \log P(A)$ where $P(A)$ is the probability of the event $A$ . This quantifies the amount of information (and the amount of surprise) for knowing that $A$ occurred. This quantity has a nice additive property, but to see that, we need to define conditional surprisal first. If $A, B$ are events, the conditional surprise of $A$ given $B$ is defined as $I(A | B) = - \log P(A | B)$ where $P(A | B)$ is the conditional probability of $A$ given $B$ . This quantifies the amount of information someone already knowing $B$ happened gains upon hearing that $A$ too happened.

The additive property is this: at first, you knew nothing about the value of the binary sequence (hence, you had zero amount of information as to what the value of the random binary sequence was), and then you heard that some event $B$ occurred. For example, the event $B$ could be something like $X_1 + X_2 = 1$ . By hearing that news, you gained $I(B)$ bits of information about the sequence. And then you heard another news $A$ (another event). Upon hearing this news, you gained $I(A|B)$ bits of information. In total you must have gained $I(A \cap B)$ bits of information. Does the following equality actually hold?

$I(A \cap B) = I(B) + I(A|B)$

Yes, it does. That in turn justifies the formula for surprisal.

Back to the definition of $H(X)$ . It is easy to work out that the entropy formula defines $H(X)$ to be the average amount of information you gain upon hearing the value of $X$ . This should be an easy exercise (left to the readers).

2. intuition for conditional entropy and its definition and properties

The theory of entropy is not complete without the notion of conditional entropy. For two (discrete) random variables $X$ and $Y$ , the conditional entropy of $X$ given $Y$ is denoted by $H(X|Y)$ and is defined to be the average conditional surprisal for hearing the value of $X$ when one already knows the value of $Y$ . This quantifies the amount of uncertainty that someone knowing the value of $Y$ has about $X$ . Writing down the formula for conditional entropy is left to the readers. Conditional entropy satisfies following properties:

(positivity) $H(X|Y) \ge 0$ , with equality if and only if $X$ is a function of $Y$

(upper bound from unconditioning) $H(X|Y) \le H(X)$ , with equality if and only if $X$ and $Y$ are independent (you can also think of this inequality as giving a lower bound for $H(X)$ )

(chain rule) $H(X, Y) = H(X) + H(Y|X)$

Intuition for chain rule is that learning the value of $(X,Y)$ can be done in two steps: learn the value of $X$ , and then learn the value of $Y$ . Compare this rule with the additive property of surprisal.

The chain rule can also be thought of as a way to express conditional entropy in terms of unconditional entropy: $H(Y|X) = H(X, Y) - H(X)$ . There is another way to do that: $H(Y|X) = \sum P(X=x_i) H(Y| X=x_i)$ where the sum runs over the range of $X$ and $H(Y | X=x_i)$ (which is called specific conditional entropy) is the entropy for conditional probability distribution of $Y$ given the event $X=x_i$ . With this in mind, now note that the chain rule can be thought of as an encapsulation of the repeated application of the decomposition property of entropy.

Other properties:

(chain rule 2) $H(X, Y | Z) = H(X | Z) + H(Y|X, Z)$ (where $H(Y|X, Z)$ is defined to be $H(Y|(X, Z))$ )

(monotonicity 2) $H(X|Y) \ge H(X' | Y)$ when $X'$ is a function of $(X,Y)$ .

3. entropy of continuous probability distribution

Visit Wikipedia article on differential entropy to see its definition. Here is something that may hook your interest in differential entropy: many probability distributions you may know such as uniform distribution, normal distribution, exponential distribution, geometric distribution, can be characterized as distributions that maximize entropy under certain constraints. For the math-savvy, “Probability distributions and maximum entropy” by Keith Conrad is a good start. As for why and how maximum entropy distributions tend to show up in nature (for example, Maxwell–Boltzmann distribution), perhaps statistical mechanics textbooks have some answers, but not my area of expertise.

3.1. some intuition for differential entropy

Its translation invariance embodies the intuition that knowing that some value is between 1.0 and 1.1 is same amount of information as knowing some value is between 10.0 and 10.1.

The way differential entropy responds to scaling of continuous random variables embodies the intuition that knowing that some value is between 0 and 0.1 is one bit more information than knowing that some value is between 0 and 0.2.

4. for students of ergodic theory (optional reading)

4.1. entropy of a partition

If you are learning ergodic theory, you will at point or have learned that when $\alpha$ is a (measurable, finite) partition (of a probability space $(\Omega, \mathcal F, \mu)$ ), its entropy $H(\alpha)$ is defined to be $H_k(\mu(A_1), \ldots, \mu(A_k))$ where $A_i$ are elements of $\alpha$ . If we use $\hat \alpha$ to denote the random variable defined as $\hat \alpha(\omega) = i$ iff $\omega \in A_i$ , then we have $H(\alpha) = H(\hat \alpha)$ . A trivial equality but it really helps to think of the entropy of a partition as the entropy of its corresponding random variable so that you are encouraged to use probabilistic intuitions when you are, for example, solving some relevant exercises from your ergodic theory textbook.

Side note: Another good way of defining $\hat \alpha$ is to define $\hat \alpha(\omega)$ to be the element of $\alpha$ containing $\omega$ . This way has the benefit of not having to specify an ordering of elements of $\alpha$ . Another benefit is you feel more confident in using the useful abuse of notation that is writing $\alpha(\omega)$ in place of $\hat\alpha(\omega)$ .

Further observations on correspondence between partitions and random variables: If $\alpha, \beta$ are two partitions, $\alpha \vee \beta$ corresponds to the random variable $(\hat\alpha, \hat\beta)$ . The partition $\alpha$ is finer than $\beta$ if and only if $\hat\beta$ is a function of $\hat\alpha$ . Given a map $T : \Omega \to \Omega$ and $\alpha = \{A_1, \ldots, A_k\}$ , the shifted partition $\{T^{-1}A_1, \ldots, T^{-1}A_k\}$ corresponds to the shifted random variable $\hat\alpha \circ T$ . Entropy of the join of two partitions corresponds to joint entropy of random variables: $H(\alpha \vee \beta) = H(\hat\alpha, \hat\beta)$ .

These trivial observations combine together and lead to the following non-trivial insight: The pairing of a dynamical system and a partition $\alpha$ gives rise to the stationary process $\hat\alpha, \hat\alpha\circ T, \hat\alpha\circ T^2, \ldots$ . Conversely, any (discrete-valued) stationary process gives rise to an invariant measure on a shift space in the obvious way. One could say that (discrete stationary) stochastic processes are to dynamical systems what matrices are to (finite-dimensional) linear operators and a generating partition is to a basis (in linear algebra). With this insight, on can also say that the notion of metric entropy in ergodic theory and the notion of entropy rate in probability theory are interchangeable notions in that the metric entropy of a (measure theoretic) dynamical system w.r.t a partition corresponds to average entropy for its corresponding stochastic process (entropy rate).

4.2. comparison between metric entropy and differential entropy

For those who are learning about metric entropy or differential entropy, let’s list some differences for purpose of notes.

While differential entropy concerns continuous probability distributions (on the n-dimensional space), metric entropy concerns fractal-like probability distributions from stochastic processes or invariant measures on a dynamical system. Differential entropy can be negative while metric entropy cannot be. Metric entropy can be described as a conditional entropy, but I am not sure if differential entropy can be too.

As for how they react to forming a convex combination of two probability distributions, the map from continuous probability distributions to their differential entropy is a concave map, but the map from stationary processes or invariant measures (on a system) to their metric entropy is an affine map.

5. using properties of entropy in proofs

5.1. if two random variables are close to each other, then how much are their entropies close to each other?

Let’s say $X, Y$ are two random variables with range $\{1, \ldots, k\}$ and that we have $P(X \neq Y) < \delta$ . Now we want to find a good enough upper bound for $| H(X) - H(Y) |$ , i.e., we want to prove an inequality of the form $| H(X) - H(Y) | \le F(\delta, k)$ . We will choose what $F(\delta, k)$ is later, but for now notice that in order to prove such inequality, we only need show $H(X) \le H(Y) + F(\delta, k)$ (E10) (and $H(Y) \le H(X) + F(\delta, k)$ ). As I said earlier, we would like to prove this without directly using the definition of entropy.

To prove E10, we only need to build a random variable $Z$ with $H(X) \le H(Y, Z)$ (E20) and $H(Z) \le F(\delta, k)$ . To establish E20, it is enough to build $Z$ so that knowing $Y$ and $Z$ together enables one to know $X$ . So we want $Z$ to provide just enough missing information to guess $X$ from $Y$ . A way to build $Z$ now emerges. We define $Z$ as $Z = X$ on the event $X \neq Y$ , and $Z = *$ on the event $X = Y$ where $*$ is any fixed value distinct from the range of $X$ . Easy to see that value of $X$ can be determined from value of $Y$ and $Z$ , hence E20 holds. Now it only remains to give a reasonable upper bound on $H(Z)$ in terms of $\delta$ and $k$ . Let $Z'$ be just a binary random variable corresponding to distinguishing whether $X=Y$ or not. Then $H(Z) = H(Z') + H(Z | Z')$ is less than or equal to $H_2(\delta, 1- \delta) + \delta \log k$ (if $\delta \le \frac12$ ) which we now decide to be our $F(\delta, k)$ .

5.2. entropy with respect to convex combination

(only for students of ergodic theory)

Given two invariant probability measures $\mu_1, \mu_2$ on $\Omega$ and its convex combination $p_1 \mu_1 + p_2 \mu_2$ (where $p_1 + p_2 = 1, p_1 \ge 0, p_2 \ge 0$ ), we wish to prove the following two facts.

(Fact 1) $H_{p_1 \mu_1 + p_2 \mu_2}(\alpha) \ge \sum_i p_i H_{\mu_i}(\alpha)$ , i.e., $H$ is concave with respect to measures.

(Fact 2) (if you know metric entropy) $h_{p_1 \mu_1 + p_2 \mu_2}(\alpha) = \sum_i p_i h_{\mu_i}(\alpha)$ , i.e., the metric entropy map is an affine map (with respect to measures).

Above two facts follow easily from this Fact 3: the difference $H_{p_1 \mu_1 + p_2 \mu_2}(\alpha) - \sum_i p_i H_{\mu_i}(\alpha)$ (E30) is in $[0, H_2(p_1, p_2)]$ .

We want to prove Fact 3 using only basic properties of entropy. How can we employ probabilistic intuition to guide us to find such a proof? The expression $p_1 \mu_1 + p_2 \mu_2$ hints at conditioning on some event with probability $p_1$ to get to $\mu_1$ and on the complement event (with probability $p_2$ ) to get to $\mu_2$ . The probability space where this conditioning happens will have to be bigger than $\Omega$ because it does not make sense to equate a subset of $\Omega$ with the event. The expression E30 involves three terms with three different measures and one shared partition. To rephrase this expression in a probabilistic context, remember that the final expression will have to consist of terms with different random variables (different partitions) living on some shared sample space (one shared measure).

The shared sample space will be a weighted disjoint union of of $(\Omega, \mu_1, \alpha)$ and $(\Omega, \mu_2, \alpha)$ (weights being $p_1, p_2$ ). To form the disjoint union, we form a copy $(\Omega'_i, \mu_i', \alpha'_i)$ (for each $i=1,2$ ) of $(\Omega, \mu_i, \alpha)$ , and then form the union $(\bar{\Omega} = \cup \Omega'_i, \bar{\mu} = \sum p_i \mu_i', \bar\alpha)$ where the partition $\bar\alpha$ is defined to be the one that corresponds to the random variable (also denoted $\bar\alpha$ to abuse notation) defined by $\bar\alpha(x) = \alpha'_{\gamma(x)}(x)$ where $\gamma$ is the partition of $\bar \Omega$ corresponding to the binary random variable assigning $i$ to $\Omega'_i$ and where $\alpha'_i$ as a random variable has the same range as $\alpha$ .

E30 is then equal to $H_{\bar\mu}(\bar\alpha) - \sum p_i H_{\mu_i'}(\alpha'_i)$ which is equal to $H_{\bar\mu}(\bar\alpha) - H_{\bar\mu}(\bar\alpha | \gamma)$ which is in $[0, H_{\bar\mu}(\gamma)]$ which is equal to $[0, H_2(p_1, p_2)]$ .

6. exercises

All of the exercises here can be and should be solved by just using properties of entropy listed in this post instead of directly invoking the entropy formula.

Exercise 10. (useful upper bound) Suppose you have a (discrete) random variable $X$ that you want to obtain a good upper bound for its entropy. Suppose you manage to find two random variables $Y,Z$ that together determines $X$ , i.e., $X$ is a function of $(Y,Z)$ . This gives the upper bound: $H(X) \le H(Y) + H(Z)$ (prove it).

Exercise 15. (useful upper bound 2) Suppose you have a (discrete) random variable $X$ that you want to obtain a good upper bound for its entropy. Suppose you manage to find some other random variable $Y$ such that $H(Y)$ or $H(X|Y)$ is easy to estimate. This gives the upper bound $H(X) \le H(Y) + H(X|Y)$ (prove it).

Exercise 20. (useful lower bound) Prove $H(X) \ge P(Y=y) H(X | Y=y)$

Exercise 30. Prove $H_2(st, 1-st) \le s H_2(t, 1-t)$ for $0 \le s, t \le 1$ . (Introduce auxiliary random variables to prove it)

Exercise 40. Show $H_3(a, b, c) \le H_4(\frac{a}{2}, \frac{a+b}{2}, \frac{b+c}{2}, \frac{c}{2})$ .

Exercise 50. (only for students of ergodic theory) Given a measure preserving transformation on a Lebesgue space with positive entropy, and given a positive number less than the positive entropy, show that the system has a factor system with its entropy equal to the given positive number. Should we assume the system has no atoms?

Footnotes:

Properties related to the notion of conditional independence.