Introduction to Wasserstein metric (earth mover’s distance)

This is a short introduction to the notion of Wasserstein metric but presented in a way that explicitly highlights the earth mover intuition on the definition.

1. the cost of moving one particle

Throughout the post, we fix a metric space (X,d). Suppose a particle of mass w \ge 0 is placed at position x \in X. The cost of moving this particle to a new position y \in X is defined to be w d(x,y). With this definition, you see:

  • The cost of moving a particle from place x to the same place x is zero.
  • The cost of moving a particle is proportional to its mass. (This makes sense: a particle of mass 2w at position x is and should be cost-wise equivalent to two particles each of mass w at the same position x.)
  • The cost of moving a particle of unit mass from x to y is d(x,y).

2. The cost of moving a finite collection of particles of total mass 1.

Imagine a configuration of three particles of total mass 1 placed in X. For each i = 1,2,3, the particle i has mass a_i \ge 0 and is placed at x_i \in X. Since the total mass is 1, we require \sum_{i=1}^3 a_i = 1. The mass distribution of this configuration may be written as \mu := \sum_{i=1}^3 a_i \delta_{x_i} where \delta_{x_i} denotes the Dirac delta function centered at x_i. (Alternatively, you could interprete the sum \sum_{i=1}^3 a_i \delta_{x_i} to mean a formal sum, or use Dirac measures instead.)

Now consider another configuration but this time of four particles of total mass 1, placed in the same space X. In this second configuration, for each j=1,2,3,4, there is a particle of mass b_j \ge 0 at position y_j \in X. The mass distribution of this configuration may be written as \nu = \sum_{j=1}^4 b_j \delta_{y_j}.

Given the first configuration (three particles), we can split some or all of the three particles into smaller particles of various smaller mass and move them in a way that results in the second configuration. There is more than one way to do it.

Start with the first configuration. Suppose, for each i=1,2,3, we split the particle of mass a_i at position x_i into four smaller particles of mass w_{i1}, w_{i2}, w_{i3}, w_{i4} \ge 0. The total number of split particles is now 12 = 3 \cdot 4 and a_i = \sum_{j=1}^4 w_{ij} for each i=1,2,3. Move the split particle of mass w_{ij} from the old position x_i to the new position y_j. The total mass of all particles at the new position y_j is \sum_{i=1}^3 w_{ij} and we require this to be equal to b_j, so that they merge to become a particle of mass b_j at position y_j.

Any 3 \times 4 matrix (w_{ij}) of nonnegative numbers such that a_i = \sum_{j=1}^4 w_{ij} for all i and b_j = \sum_{i=1}^3 w_{ij} for all j specifies a split-move-merge procedure that transforms the first configuration into the second configuration.

The cost of a split-move-merge procedure is defined to be its total cost of moving: moving the particle of mass w_{ij} from x_i to y_j costs w_{ij} d(x_i, y_j) and so the total cost is the sum \sum_{ij} w_{ij} d(x_i, y_j).

We define the cost of transforming the mass distribution \mu = \sum_{i=1}^3 a_i \delta_{x_i} into the second \nu = \sum_{j=1}^4 b_j \delta_{y_j} to be the minimum possible cost for split-move-merge procedures from the first configuration to the second configuration.

It is an easy check that this definition of distance between \mu and \nu does not depend on how you write \mu and \nu. (In other words, whether you see a particle at position x of mass \frac12 or see two particles at the same position x each of mass \frac14 does not affect costs.)

3. coupling

Each split-move-merge procedure (w_{ij}) defines a distribution \lambda = \sum_{ij} w_{ij} \delta_{(x_i,y_j)} on X^2. This distribution \lambda is a coupling of \mu and \nu in the sense that it projects to \mu and \nu. Indeed, if you project it to the first coordinate, you would get \sum_{ij} w_{ij} \delta_{x_i} = \sum_i \delta_{x_i} \sum_j w_{ij} = \sum_i a_i \delta_{x_i}.

You could think of the couping \lambda as a distribution on the space X^2 of arrows, where you think of each (x,y) \in X^2 to be an abstract arrow from x to y.

The integration of the function d(x,y) (whose domain is the space X^2 of arrows) w.r.t. the distribution \lambda works out to be the same as the cost \sum_{ij} w_{ij} d(x_i, y_j).

Conversely, each coupling \lambda of \mu and \nu gives rise to a split-move-merge procedure that transforms \mu into \nu whose cost is equal to the integration of d(x,y) w.r.t. \lambda.

So the cost of transforming \mu to \nu is the same as the minimum possible expected value of the length d(x,y) of arrows (x,y) over all possible couplings of \mu and \nu.

4. composing

The distance between \mu and \nu is a metric. The triangle inequality on d(\cdot,\cdot) transfers to the triangle inequality for distance between \mu, \nu because you can compose a transformation from \mu to \nu and another from \nu to \rho to get a third transformation from \mu to \nu whose cost is at most the sum of the costs of the two former transformations. This property is more easily seen if you think of \mu as piles of dirt (of unit mass). In fact, suppose \mu is three piles of dirt: pile i of mass a_i at position x_i. Suppose pile 1 is red dirt, pile 2 is yellow dirt, and pile 3 is green dirt. Move this three piles of dirt to form \nu using some transformation (w_{ij}). Each of the resulting piles may be a mixture of different colors. Now use some other transformation (w'_{ij}) to move the piles again to form \rho.The net effect is a transformation of \mu into \rho and the corresponding composition (w''_{ij}) can be obtained by examining percentages of different colors of dirt in each of the final piles of dirt. (the output (w''_{ij}) is actually not unique here, but there is an obvious canonical choice.)

5. definition of the Wasserstein distance

What we have defined is known as the 1st Wasserstein distance between \mu and \nu and is usually denoted W_1(\mu,\nu). For general non-discrete probability measures \mu and \nu on X, the definition is done using couplings of \mu and \nu but you will have to take inf instead of min, and the distance can be infinite.

This distance can be thought of as a probabilistic analogue of the Hausdorff distance between two sets..

6. Moving a collection of particles of equal mass

Suppose we are to move a collection of n particles whose each mass is \frac1n. In the first configuration, the n particles are at the n places x_1, \cdots, x_n (some of which may be same places). In the second configuration, the new n places are y_1, \cdots, y_n. It is known that in this case there is always a split-move-merge procedure (between the two configurations) that minimizes the cost but also skips the split and the merge steps. The procedures without the split and merge steps are specified by functions \pi from \{1,\cdots,n\} to itself. If \pi is such a function, then this specifies a procedure in which you move the old particle of mass \frac1n at the old position x_i to the new position y_{\pi(i)} (at the cost of \frac1n d(x_i, y_{\pi(i)})) for i=1,2,\cdots,n.

So the cost of transforming \mu = \frac1n \sum_{i=1}^n \delta_{x_i} to \nu = \frac1n \sum_{j=1}^n \delta_{y_j} in this case is the same as the minimum of the average \frac1n \sum_{i=1}^n d(x_i, y_{\pi(i)}) over all permutations \pi on \{1,\cdots, n\}.

The fact that split and merge steps can be discarded is known as Birkhoff-von Neumann theorem, which actually states that any n\times n doubly stochastic matrix (i.e. a split-move-merge procedure in our case) can be written as a convex combination of permutations.

It is known that, in a metric space that is nice enough, any probability measure \mu on it can be approximated by discrete distributions of the form \frac1n \sum_{i=1}^n \delta_{x_i} in an appropriate sense.

7. Some properties

(some lower bounds) If two probability measures \mu and \nu are supported on subsets A, B respectively and if d(A,B) \ge c > 0, then the distance between \mu and \nu is also \ge c. If we relax to \mu(A) = 1 - O(\epsilon) and \nu(B) = 1 - O(\epsilon), then we still have the approximate W_1(\mu,\nu) \ge c (1 - O(\epsilon)).

(upper bounds) If \nu is the pushforward image of \mu under some f: X \to X and if f is such that d(x,f(x)) \le \epsilon for all x, then W_1(\mu,\nu) \le \epsilon. If we relax to just d(x, f(x)) = O(\epsilon) for all x \in A (uniformly) in some big set \mu(A) = 1 - O(\epsilon) and the diameter of X is O(1), then we still obtain an approximate conclusion W_1(\mu,\nu) = O(\epsilon).

This entry was posted in Mathematics and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s