## Introduction to Wasserstein metric (earth mover’s distance)

This is a short introduction to the notion of Wasserstein metric but presented in a way that explicitly highlights the earth mover intuition on the definition.

## 1. the cost of moving one particle

Throughout the post, we fix a metric space (X,d). Suppose a particle of mass $w \ge 0$ is placed at position $x \in X$. The cost of moving this particle to a new position $y \in X$ is defined to be $w d(x,y)$. With this definition, you see:

• The cost of moving a particle from place x to the same place x is zero.
• The cost of moving a particle is proportional to its mass. (This makes sense: a particle of mass $2w$ at position x is and should be cost-wise equivalent to two particles each of mass $w$ at the same position x.)
• The cost of moving a particle of unit mass from x to y is $d(x,y)$.

## 2. The cost of moving a finite collection of particles of total mass 1.

Imagine a configuration of three particles of total mass 1 placed in X. For each i = 1,2,3, the particle i has mass $a_i \ge 0$ and is placed at $x_i \in X$. Since the total mass is 1, we require $\sum_{i=1}^3 a_i = 1$. The mass distribution of this configuration may be written as $\mu := \sum_{i=1}^3 a_i \delta_{x_i}$ where $\delta_{x_i}$ denotes the Dirac delta function centered at $x_i$. (Alternatively, you could interprete the sum $\sum_{i=1}^3 a_i \delta_{x_i}$ to mean a formal sum, or use Dirac measures instead.)

Now consider another configuration but this time of four particles of total mass 1, placed in the same space X. In this second configuration, for each $j=1,2,3,4$, there is a particle of mass $b_j \ge 0$ at position $y_j \in X$. The mass distribution of this configuration may be written as $\nu = \sum_{j=1}^4 b_j \delta_{y_j}$.

Given the first configuration (three particles), we can split some or all of the three particles into smaller particles of various smaller mass and move them in a way that results in the second configuration. There is more than one way to do it.

Start with the first configuration. Suppose, for each $i=1,2,3$, we split the particle of mass $a_i$ at position $x_i$ into four smaller particles of mass $w_{i1}, w_{i2}, w_{i3}, w_{i4} \ge 0$. The total number of split particles is now $12 = 3 \cdot 4$ and $a_i = \sum_{j=1}^4 w_{ij}$ for each $i=1,2,3$. Move the split particle of mass $w_{ij}$ from the old position $x_i$ to the new position $y_j$. The total mass of all particles at the new position $y_j$ is $\sum_{i=1}^3 w_{ij}$ and we require this to be equal to $b_j$, so that they merge to become a particle of mass $b_j$ at position $y_j$.

Any $3 \times 4$ matrix $(w_{ij})$ of nonnegative numbers such that $a_i = \sum_{j=1}^4 w_{ij}$ for all i and $b_j = \sum_{i=1}^3 w_{ij}$ for all j specifies a split-move-merge procedure that transforms the first configuration into the second configuration.

The cost of a split-move-merge procedure is defined to be its total cost of moving: moving the particle of mass $w_{ij}$ from $x_i$ to $y_j$ costs $w_{ij} d(x_i, y_j)$ and so the total cost is the sum $\sum_{ij} w_{ij} d(x_i, y_j)$.

We define the cost of transforming the mass distribution $\mu = \sum_{i=1}^3 a_i \delta_{x_i}$ into the second $\nu = \sum_{j=1}^4 b_j \delta_{y_j}$ to be the minimum possible cost for split-move-merge procedures from the first configuration to the second configuration.

It is an easy check that this definition of distance between $\mu$ and $\nu$ does not depend on how you write $\mu$ and $\nu$. (In other words, whether you see a particle at position x of mass $\frac12$ or see two particles at the same position x each of mass $\frac14$ does not affect costs.)

## 3. coupling

Each split-move-merge procedure $(w_{ij})$ defines a distribution $\lambda = \sum_{ij} w_{ij} \delta_{(x_i,y_j)}$ on $X^2$. This distribution $\lambda$ is a coupling of $\mu$ and $\nu$ in the sense that it projects to $\mu$ and $\nu$. Indeed, if you project it to the first coordinate, you would get $\sum_{ij} w_{ij} \delta_{x_i} = \sum_i \delta_{x_i} \sum_j w_{ij} = \sum_i a_i \delta_{x_i}$.

You could think of the couping $\lambda$ as a distribution on the space $X^2$ of arrows, where you think of each $(x,y) \in X^2$ to be an abstract arrow from x to y.

The integration of the function d(x,y) (whose domain is the space $X^2$ of arrows) w.r.t. the distribution $\lambda$ works out to be the same as the cost $\sum_{ij} w_{ij} d(x_i, y_j)$.

Conversely, each coupling $\lambda$ of $\mu$ and $\nu$ gives rise to a split-move-merge procedure that transforms $\mu$ into $\nu$ whose cost is equal to the integration of d(x,y) w.r.t. $\lambda$.

So the cost of transforming $\mu$ to $\nu$ is the same as the minimum possible expected value of the length $d(x,y)$ of arrows (x,y) over all possible couplings of $\mu$ and $\nu$.

## 4. composing

The distance between $\mu$ and $\nu$ is a metric. The triangle inequality on $d(\cdot,\cdot)$ transfers to the triangle inequality for distance between $\mu, \nu$ because you can compose a transformation from $\mu$ to $\nu$ and another from $\nu$ to $\rho$ to get a third transformation from $\mu$ to $\nu$ whose cost is at most the sum of the costs of the two former transformations. This property is more easily seen if you think of $\mu$ as piles of dirt (of unit mass). In fact, suppose $\mu$ is three piles of dirt: pile i of mass $a_i$ at position $x_i$. Suppose pile 1 is red dirt, pile 2 is yellow dirt, and pile 3 is green dirt. Move this three piles of dirt to form $\nu$ using some transformation $(w_{ij})$. Each of the resulting piles may be a mixture of different colors. Now use some other transformation $(w'_{ij})$ to move the piles again to form $\rho$.The net effect is a transformation of $\mu$ into $\rho$ and the corresponding composition $(w''_{ij})$ can be obtained by examining percentages of different colors of dirt in each of the final piles of dirt. (the output $(w''_{ij})$ is actually not unique here, but there is an obvious canonical choice.)

## 5. definition of the Wasserstein distance

What we have defined is known as the 1st Wasserstein distance between $\mu$ and $\nu$ and is usually denoted $W_1(\mu,\nu)$. For general non-discrete probability measures $\mu$ and $\nu$ on $X$, the definition is done using couplings of $\mu$ and $\nu$ but you will have to take inf instead of min, and the distance can be infinite.

This distance can be thought of as a probabilistic analogue of the Hausdorff distance between two sets..

## 6. Moving a collection of particles of equal mass

Suppose we are to move a collection of n particles whose each mass is $\frac1n$. In the first configuration, the n particles are at the n places $x_1, \cdots, x_n$ (some of which may be same places). In the second configuration, the new n places are $y_1, \cdots, y_n$. It is known that in this case there is always a split-move-merge procedure (between the two configurations) that minimizes the cost but also skips the split and the merge steps. The procedures without the split and merge steps are specified by functions $\pi$ from $\{1,\cdots,n\}$ to itself. If $\pi$ is such a function, then this specifies a procedure in which you move the old particle of mass $\frac1n$ at the old position $x_i$ to the new position $y_{\pi(i)}$ (at the cost of $\frac1n d(x_i, y_{\pi(i)})$) for $i=1,2,\cdots,n$.

So the cost of transforming $\mu = \frac1n \sum_{i=1}^n \delta_{x_i}$ to $\nu = \frac1n \sum_{j=1}^n \delta_{y_j}$ in this case is the same as the minimum of the average $\frac1n \sum_{i=1}^n d(x_i, y_{\pi(i)})$ over all permutations $\pi$ on $\{1,\cdots, n\}$.

The fact that split and merge steps can be discarded is known as Birkhoff-von Neumann theorem, which actually states that any $n\times n$ doubly stochastic matrix (i.e. a split-move-merge procedure in our case) can be written as a convex combination of permutations.

It is known that, in a metric space that is nice enough, any probability measure $\mu$ on it can be approximated by discrete distributions of the form $\frac1n \sum_{i=1}^n \delta_{x_i}$ in an appropriate sense.

## 7. Some properties

(some lower bounds) If two probability measures $\mu$ and $\nu$ are supported on subsets $A, B$ respectively and if $d(A,B) \ge c > 0$, then the distance between $\mu$ and $\nu$ is also $\ge c$. If we relax to $\mu(A) = 1 - O(\epsilon)$ and $\nu(B) = 1 - O(\epsilon)$, then we still have the approximate $W_1(\mu,\nu) \ge c (1 - O(\epsilon))$.

(upper bounds) If $\nu$ is the pushforward image of $\mu$ under some $f: X \to X$ and if $f$ is such that $d(x,f(x)) \le \epsilon$ for all x, then $W_1(\mu,\nu) \le \epsilon$. If we relax to just $d(x, f(x)) = O(\epsilon)$ for all $x \in A$ (uniformly) in some big set $\mu(A) = 1 - O(\epsilon)$ and the diameter of X is O(1), then we still obtain an approximate conclusion $W_1(\mu,\nu) = O(\epsilon)$.

This entry was posted in Mathematics and tagged . Bookmark the permalink.