Introduction to Wasserstein metric (earth mover’s distance)

This is a short introduction to the notion of Wasserstein metric but presented in a way that explicitly highlights the earth mover intuition on the definition.

1. the cost of moving one particle

Throughout the post, we fix a metric space (X,d). Suppose a particle of mass w \ge 0 is placed at position x \in X. The cost of moving this particle to a new position y \in X is defined to be w d(x,y). With this definition, you see:

  • The cost of moving a particle from place x to the same place x is zero.
  • The cost of moving a particle is proportional to its mass. (This makes sense: a particle of mass 2w at position x is and should be cost-wise equivalent to two particles each of mass w at the same position x.)
  • The cost of moving a particle of unit mass from x to y is d(x,y).

2. The cost of moving a finite collection of particles of total mass 1.

Imagine a configuration of three particles of total mass 1 placed in X. For each i = 1,2,3, the particle i has mass a_i \ge 0 and is placed at x_i \in X. Since the total mass is 1, we require \sum_{i=1}^3 a_i = 1. The mass distribution of this configuration may be written as \mu := \sum_{i=1}^3 a_i \delta_{x_i} where \delta_{x_i} denotes the Dirac delta function centered at x_i. (Alternatively, you could interprete the sum \sum_{i=1}^3 a_i \delta_{x_i} to mean a formal sum, or use Dirac measures instead.)

Now consider another configuration but this time of four particles of total mass 1, placed in the same space X. In this second configuration, for each j=1,2,3,4, there is a particle of mass b_j \ge 0 at position y_j \in X. The mass distribution of this configuration may be written as \nu = \sum_{j=1}^4 b_j \delta_{y_j}.

Given the first configuration (three particles), we can split some or all of the three particles into smaller particles of various smaller mass and move them in a way that results in the second configuration. There is more than one way to do it.

Start with the first configuration. Suppose, for each i=1,2,3, we split the particle of mass a_i at position x_i into four smaller particles of mass w_{i1}, w_{i2}, w_{i3}, w_{i4} \ge 0. The total number of split particles is now 12 = 3 \cdot 4 and a_i = \sum_{j=1}^4 w_{ij} for each i=1,2,3. Move the split particle of mass w_{ij} from the old position x_i to the new position y_j. The total mass of all particles at the new position y_j is \sum_{i=1}^3 w_{ij} and we require this to be equal to b_j, so that they merge to become a particle of mass b_j at position y_j.

Any 3 \times 4 matrix (w_{ij}) of nonnegative numbers such that a_i = \sum_{j=1}^4 w_{ij} for all i and b_j = \sum_{i=1}^3 w_{ij} for all j specifies a split-move-merge procedure that transforms the first configuration into the second configuration.

The cost of a split-move-merge procedure is defined to be its total cost of moving: moving the particle of mass w_{ij} from x_i to y_j costs w_{ij} d(x_i, y_j) and so the total cost is the sum \sum_{ij} w_{ij} d(x_i, y_j).

We define the cost of transforming the mass distribution \mu = \sum_{i=1}^3 a_i \delta_{x_i} into the second \nu = \sum_{j=1}^4 b_j \delta_{y_j} to be the minimum possible cost for split-move-merge procedures from the first configuration to the second configuration.

It is an easy check that this definition of distance between \mu and \nu does not depend on how you write \mu and \nu. (In other words, whether you see a particle at position x of mass \frac12 or see two particles at the same position x each of mass \frac14 does not affect costs.)

3. coupling

Each split-move-merge procedure (w_{ij}) defines a distribution \lambda = \sum_{ij} w_{ij} \delta_{(x_i,y_j)} on X^2. This distribution \lambda is a coupling of \mu and \nu in the sense that it projects to \mu and \nu. Indeed, if you project it to the first coordinate, you would get \sum_{ij} w_{ij} \delta_{x_i} = \sum_i \delta_{x_i} \sum_j w_{ij} = \sum_i a_i \delta_{x_i}.

You could think of the couping \lambda as a distribution on the space X^2 of arrows, where you think of each (x,y) \in X^2 to be an abstract arrow from x to y.

The integration of the function d(x,y) (whose domain is the space X^2 of arrows) w.r.t. the distribution \lambda works out to be the same as the cost \sum_{ij} w_{ij} d(x_i, y_j).

Conversely, each coupling \lambda of \mu and \nu gives rise to a split-move-merge procedure that transforms \mu into \nu whose cost is equal to the integration of d(x,y) w.r.t. \lambda.

So the cost of transforming \mu to \nu is the same as the minimum possible expected value of the length d(x,y) of arrows (x,y) over all possible couplings of \mu and \nu.

4. composing

The distance between \mu and \nu is a metric. The triangle inequality on d(\cdot,\cdot) transfers to the triangle inequality for distance between \mu, \nu because you can compose a transformation from \mu to \nu and another from \nu to \rho to get a third transformation from \mu to \nu whose cost is at most the sum of the costs of the two former transformations. This property is more easily seen if you think of \mu as piles of dirt (of unit mass). In fact, suppose \mu is three piles of dirt: pile i of mass a_i at position x_i. Suppose pile 1 is red dirt, pile 2 is yellow dirt, and pile 3 is green dirt. Move this three piles of dirt to form \nu using some transformation (w_{ij}). Each of the resulting piles may be a mixture of different colors. Now use some other transformation (w'_{ij}) to move the piles again to form \rho.The net effect is a transformation of \mu into \rho and the corresponding composition (w''_{ij}) can be obtained by examining percentages of different colors of dirt in each of the final piles of dirt. (the output (w''_{ij}) is actually not unique here, but there is an obvious canonical choice.)

5. definition of the Wasserstein distance

What we have defined is known as the 1st Wasserstein distance between \mu and \nu and is usually denoted W_1(\mu,\nu). For general non-discrete probability measures \mu and \nu on X, the definition is done using couplings of \mu and \nu but you will have to take inf instead of min, and the distance can be infinite.

This distance can be thought of as a probabilistic analogue of the Hausdorff distance between two sets..

6. Moving a collection of particles of equal mass

Suppose we are to move a collection of n particles whose each mass is \frac1n. In the first configuration, the n particles are at the n places x_1, \cdots, x_n (some of which may be same places). In the second configuration, the new n places are y_1, \cdots, y_n. It is known that in this case there is always a split-move-merge procedure (between the two configurations) that minimizes the cost but also skips the split and the merge steps. The procedures without the split and merge steps are specified by functions \pi from \{1,\cdots,n\} to itself. If \pi is such a function, then this specifies a procedure in which you move the old particle of mass \frac1n at the old position x_i to the new position y_{\pi(i)} (at the cost of \frac1n d(x_i, y_{\pi(i)})) for i=1,2,\cdots,n.

So the cost of transforming \mu = \frac1n \sum_{i=1}^n \delta_{x_i} to \nu = \frac1n \sum_{j=1}^n \delta_{y_j} in this case is the same as the minimum of the average \frac1n \sum_{i=1}^n d(x_i, y_{\pi(i)}) over all permutations \pi on \{1,\cdots, n\}.

The fact that split and merge steps can be discarded is known as Birkhoff-von Neumann theorem, which actually states that any n\times n doubly stochastic matrix (i.e. a split-move-merge procedure in our case) can be written as a convex combination of permutations.

It is known that, in a metric space that is nice enough, any probability measure \mu on it can be approximated by discrete distributions of the form \frac1n \sum_{i=1}^n \delta_{x_i} in an appropriate sense.

7. Some properties

(some lower bounds) If two probability measures \mu and \nu are supported on subsets A, B respectively and if d(A,B) \ge c > 0, then the distance between \mu and \nu is also \ge c. If we relax to \mu(A) = 1 - O(\epsilon) and \nu(B) = 1 - O(\epsilon), then we still have the approximate W_1(\mu,\nu) \ge c (1 - O(\epsilon)).

(upper bounds) If \nu is the pushforward image of \mu under some f: X \to X and if f is such that d(x,f(x)) \le \epsilon for all x, then W_1(\mu,\nu) \le \epsilon. If we relax to just d(x, f(x)) = O(\epsilon) for all x \in A (uniformly) in some big set \mu(A) = 1 - O(\epsilon) and the diameter of X is O(1), then we still obtain an approximate conclusion W_1(\mu,\nu) = O(\epsilon).

Posted in Mathematics | Tagged | Leave a comment

How to close the sudden Camera app from Windows 8

The readers are assumed to be familiar with names of basic Windows 8 touch gestures (for which, see http://windows.microsoft.com/en-us/windows-8/touch-swipe-tap-beyond )

1. introduction

With some Windows 8 tablets (or laptops), you may come across a sudden appearance of the Camera app or the sudden activation of the webcam when you are trying to turn on your laptop (or maybe more precisely, when you are trying to sign in to your PC from the lock screen). Since the screen makes it clear that the webcam is on, you know you have unintentionally launched some Camera app. Surprisingly, the usual ways of closing or exiting the app or switching to other apps does not work. How should one get out of this and get back to the usual lock screen (the log in screen)?

2. how to get out

It is actually quite simple to turn off or deactivate this thing. You simply tap or click Unlock which should be found on the left side of the app commands at the bottom of the screen and you are back to the lock screen.

3. how did I end up there

You usually swipe up on the lock screen to sign in to your PC, but when you swipe down on it instead, that launches the Camera app. This is a useful feature of Windows 8 that you can turn on or off. The Camera app launched in this way seems to show a different interface from the one launched in the usual way. Maybe it is not the same Camera app? I am not sure.

4. how to turn off the feature

This quick access of camera from the lock screen is a useful feature but if you want to turn this feature off for some reason, type the following in Search:

lock screen camera

and by navigating from there, you will arrive at an option that says

Swipe down on the lock screen to use the camera 
Posted in Uncategorized | Tagged , , | Leave a comment

Intuitions on problems from Elements of Information Theory Chapter 2

This post is not about sharing solutions to those problems in the book. As the title indicates, this post is rather about sharing intuitions or interpretations of some results mentioned or alluded in some problems listed at the end of Chapter 2 of the book “Elements of Information Theory” second edition. The chapter deals with the notion of entropy and mutual information.

Some of my colleagues and I were reading some rather huge paper which draws from many basic ideas from many fields, and one of them was the notion of mutual information. We only had a grasp on its special case, the notion of entropy of a random variable and since the general notion was new to us, I decided to read at least first few chapters of “Elements of Information Theory”.

1. meaning of conditional mutual information

Given three (discrete) random variables X, Y, Z, the conditional mutual information I(X; Y | Z) intuitively means how much X says about Y (or equivalently how much Y says about X) under the assumption that the value of Z is already known. Try to use this intuition to make sense of the chain rule for mutual information and the data-processing inequality.

As for its special case, the intuitions for conditional entropy, see (already linked previously) Shannon’s entropy of random variables and partitions.

2. comments on some selected problems from Chapter 2

2.1. Problem 2.6

The problem asks to find examples where mutual information increases or decreases when conditioned to a third random variable. This points out that I(X;Y | Z) is not monotone in Z. One should contrast this with the following two facts:

  • Conditional entropy H(X | Y) is monotone in the conditioning Y.
  • Mutual information I(X;Y | Z) is monotone in X and Y.

2.2. Problem 2.14

Given two independent (discrete) random variables X and Y, the entropy of the sum X+Y is at least \max\{H(X),H(Y)\}. This is perhaps related to why the central limit theorem works. If you keep adding i.i.d. random variables one by one, the entropy of the sum never decreases. So aside from special cases, we can expect that the entropy would get higher and higher. High entropy of the sum alone may not tell you much about the probability distribution of the sum, but the variance and the mean of the sum is also known, and IIRC the normal distribution maximizes entropy under the constraints on the mean and variance. It does not make a proof of the central limit theorem but I think it illustrates the relation between large degrees of freedom (in statistical mechanics) and probability distributions with maximal entropy.

Once you prove that the entropy of the independent sum X+Y is at least \max\{H(X),H(Y)\}, you would also be able to see that the entropy of the independent modulo sum X+Y modulo something like 6 for example is also at least \max\{H(X),H(Y)\}. If you have a biased dice, you may rightfully expect that you can simulate a less biased dice, by simply throwing the same dice several times and summing up the numbers modulo 6, and the entropy argument may be one way to justify this expectation.

2.3. Problem 2.23

The problem deals with n binary random variables X_1, X_2, \dots, X_n related by one linear equation X_1 + \dots + X_n = 0 \pmod 2. The given joint probability distribution is in some sense the most fair distribution under the constraint given by the linear equation.

The results are consistent with our intuition in that:

  • knowing X_1 tells you nothing about X_2
  • and given that you already know X_1, knowing X_2 tells you nothing about X_3
  • and at the last step when you know X_1 to X_{n-2}, knowing X_{n-1} tells you everything about X_n (because of the linear equation).

The results can be different for other probability distributions while satisfying the same linear equation. For example, if we give some biased joint probability distribution, then it’s possible that knowing X_1 might tell you something about X_2 in the sense that I(X_1; X_2) > 0. In fact, you can even choose a distribution so that knowing X_1 tells you everything about X_2 in the sense that I(X_1 ; X_2) = H(X_2) = 1. That is the case when we choose the most fair distribution under the constraint X_1 + X_2 = 1, X_3 = X_4 = \dots = X_n = 0.

2.4. Problem 2.27

It asks to prove the grouping rule for entropy. It can be verified with a simple calculation, but maybe it is instructive to solve this instead by using auxiliary random variables and using already established equalities thus far.

2.5. Problem 2.41

One wishes to identify a random object X. Two i.i.d. random questions Q_1, Q_2 about the object are asked, eliciting answers A_1 = A(X, Q_1), A_2 = A(X, Q_2). The problem asks to show that the two questions are less valuable than twice a single question.

At first, one might expect incorrectly that the two questions are worth twice a single question, but what about asking millions of questions about the outcome of a single dice roll? One million questions about the outcome is worth almost the same as two million questions, since by the time you ask the million’th, you would be almost certain of what the outcome of the dice roll was and the next million questions would not contribute much. The amount of information to be extracted with questions cannot grow linearly with the number of questions indefinitely and in fact there is an upper bound, namely, H(X).

Some more thoughts reveal that it is because of independence (rather than in spite of independence) that two i.i.d questions are less valuable than twice a single question. Suppose X is the random position of a specific chess piece. If the first (random) question asked which row, then the next optimal question to ask would be to ask which column. Making subsequent optimal questions require dependence to previous questions and answers and therefore independent questions are less efficient than dependent questions, as anyone who has played Twenty Questions can attest.

Posted in Mathematics | Tagged , , , | Leave a comment

flood zones in Seoul

While looking for a place to live in Seoul, it came to my mind that I should check out which parts of Seoul tend to get flooded and here is what I found. Let’s hope that this post becomes obsolete in future as the local government is taking some measures.

Several news article in 2014 reported that the Seoul government named five special regions to take care of against floods and these are regions near:

  • 강남역
  • 사당역
  • 광화문
  • 도림천
  • 한강로

도림천 (a small river) and 한강로 (a road) can be located as red lines in Naver Map. Locating the rest (강남역, 사당역, 광화문) should be obvious.

A news article in 2013 named top 4 regions that get flooded again and again and these are regions near:

  • 강남역
  • 광화문
  • 안양천
  • 사당역

There is also an interactive map showing the often flooded regions in Seoul, but it requires you to select a district first: 풍수해 정보지도. It can show affected areas in each of the years 2010, 2011, 2012, 2013.

Posted in Uncategorized | Tagged , , | Leave a comment

Putting a bar or a tilde over a letter in LaTeX

As you are aware, there are commands to put a bar or a tilde over a symbol in math mode in LaTeX. Sometimes, the output doesn’t come out the way some of us might expect or want.

Fortunately, there are alternative commands that do the same task differently that we can try and there are also other ways of using the same commands.

To put a tilde over a letter, we can use either \tilde or \widetilde. As for which one to use in which situation, compiling a document with the following as its part can help comparison.

$\tilde{A}$ vs $\widetilde{A}$

$\tilde{\mathcal A}$ vs $\widetilde{\mathcal A}$

$\tilde{ABC}$ vs $\widetilde{ABC}$

To put a bar over a letter, we can use either \bar or \overline. It seems that \bar is to \overline what \tilde is to \widetilde. There don’t seem to be \overtilde or \widebar.

$\bar{A}$ vs $\overline{A}$

$\bar{\mathcal A}$ vs $\overline{\mathcal A}$

$\bar{ABC}$ vs $\overline{ABC}$

Over a symbol with a subscript or superscript index, one can imagine four ways of putting a bar or tilde over it. Among the four ways, one can say that one or two of them looks right but that may depend. I can’t comment on which ones look good. You just have to try each and decide with your colleagues.

$\tilde{A_2}$ vs $\widetilde{A_2}$ vs $\tilde{A}_2$ vs $\widetilde{A}_2$

$\bar{A_2}$ vs $\overline{A_2}$ vs $\bar{A}_2$ vs $\overline{A}_2$
Posted in Uncategorized | Tagged | Leave a comment

why learn elisp (Emacs Lisp)

This post is the intro module of Living with Emacs Lisp.

The goal of this post is to hopefully convince you why you should learn elisp. Readers are assumed to be beginning users of Emacs. (Of course he who does not use Emacs does not need to learn elisp)

So you are a beginning user of Emacs and let me raise a relevant question: suppose someone asked you “Why did you choose Emacs?” The longer question would be “What is the point of using Emacs in particular when there are other great text editors out there that are also cross-platform, customizable and extensible?”

You probably have your answer to the question. If you want to hear my answer, I’d say “I choose Emacs because Emacs is customizable and extensible to an extreme degree. I believe that is the defining character of Emacs. That may not mean that everybody should use Emacs but that is certainly why I will continue to use Emacs.”

I think my customization of Emacs is only to a mild degree and I may never actually customize it to an extreme degree. However, because I know that Emacs allows extreme customization, I can be sure that Emacs allows any kind of mild customization I might want later. It’s like how some columnist would feel sure that organizations like ACLU will come defend his right to say his mildly controversial speeches if needed after he has seen ACLU defend those with extreme opinions.

The extent to which Emacs can be extended to suit your needs. You can do a lot with Emacs without learning elisp: you can define keyboard macros, you can install and use Emacs packages for your needs, you can use the Customize facility to customize your Emacs and packages to some extent. That is, without learning elisp, you can already experience Emacs to be one of those editors that are cross-platform, customizable and extensible. As you learn elisp and as you start to use hooks in interesting ways, or make your own small minor modes, or even make your own Emacs package, you start to experience Emacs to be an editor that is customizable and extensible to an extreme level. Think of that as a kind of premium feature of Emacs which you unlock by paying, and you pay not with dollars, but with time: time devoted to learning elisp.

On the other hand, there is an old saying: “Time is money”. Of course we can’t and shouldn’t devote all day to learning elisp. And we wouldn’t expect that one hour of learning elisp would unlock all the power. It’s a gradual thing. A learning curve is there, but not as steep as you might expect. Do you have something in mind that you always wanted to have in your text editor? Your mission, should you choose to accept it, is to learn enough elisp to make that something happen.

Posted in Emacs | Tagged | 2 Comments

Type backslash easily in Emacs AUCTeX

Writing LaTeX documents seem to require many presses of backslashes and the backslash key is located at awkward place and hard to type in some types of keyboards. For users of Emacs AUCTeX, there are some ways to ease this pain.

1. type slash to type backslash

With this elisp code added to your init file, whenever you type slash in a LaTeX buffer, backslash is inserted instead. The slash key is easier to reach.

(defun my-make-slash-backslash ()
  (local-set-key (kbd "/") "\\"))

(add-hook 'TeX-mode-hook 'my-make-slash-backslash)

When you have to enter slash itself, which is not often, use C-q. When you have to enter many slashes, for example you could be typing a url, in this case I believe copy and paste is the simplest workaround. It is possible to extend the code so that typing backslash results in insertion of slash as well, so that is another workaround, but I find that disorienting, so I don’t use that workaround, but you may want to use that.

If you want to be able to turn on and off “slash turns into backslash” on the fly, you can write a minor mode for that.

2. use the two commands

Another way to enter backslashes with ease is to use C-c C-m to enter a TeX macro (things like \section{...} and \frac{..}{..}) and C-c C-e to enter a LaTeX environment (things like \begin{..} .. \end{..}) in LaTeX buffers.

Posted in Emacs | Tagged | Leave a comment