## How to make prettify-symbols-mode work with AUCTeX

prettify-symbols-mode is a recent feature of Emacs and it’s very nice. And it looks like it can replace TeX-fold-mode in the future. But, at the time of writing, prettify-symbols-mode doesn’t seem to work well with AUCTeX unless you enable two workarounds together.

Tested versions:

• GNU Emacs 25.1

## Usual way to enable pretty symbols

If you want to enable it for elisp buffers, you can add:

(add-hook 'emacs-lisp-mode-hook 'prettify-symbols-mode)

Then something like (lambda () (blah)) in elisp buffers should display as (λ () (blah)).

If you want to enable it also for other lisp buffers, scheme mode buffers etc, you can adjust the following code:

(dolist (mode '(scheme emacs-lisp lisp clojure))
(let ((here (intern (concat (symbol-name mode) "-mode-hook"))))
(add-hook here 'prettify-symbols-mode)))

If you want to enable for all buffers, you can add:

(global-prettify-symbols-mode 1)

And then for major modes of your interest, you may want to adjust the buffer-local prettify-symbols-alist accordingly, following the simple example code you can find from the documentation for prettify-symbols-mode.

## Expected way to use with AUCTeX

Following code may be expected to work:

(add-hook 'TeX-mode-hook 'prettify-symbols-mode)

If it works, then \alpha, \beta, \leftarrow and so on should display as α, β, ←, …  for TeX file buffers. I do not doubt that it will just work fine in future versions of AUCTeX, and if you are reading this as an old article, it is possible that just upgrading your AUCTeX package may be enough to make that line work as you expected. If it doesn’t work, then try making the following two changes.

## First change

(defun my-delayed-prettify ()
(run-with-idle-timer 0 nil (lambda () (prettify-symbols-mode 1))))
(add-hook 'TeX-mode-hook 'my-delayed-prettify)

This way, (prettify-symbols-mode 1) is guaranteed to run after the style hooks and not before. I don’t know what style hooks do, but it looks like they may reset/erase font-lock stuff you have set up.

If pretty symbols still don’t show up in AUCTeX buffers, then try adding the following change, in addition to the above change.

## Second change

(require 'tex)

tex.el will load anyway if you visit a TeX file with Emacs. This is a strange change to make, indeed.

You should also remove the following line that is commonly used with miktex users:

(require 'tex-mik)

tex-mik.el is a good small library but tex-mik.el loads tex.el. Feel free to copy parts of tex-mik.el and paste to your dotemacs if you want.

You can ensure you have removed every call of (require ‘tex) from your dotemacs by appending the following line to the end of dotemacs and then restarting Emacs to see if the warning message shows up:

(if (featurep 'tex) (warn "(require 'tex) is still somewhere!"))

If there was some code in dotemacs that relied on the fact that (require ‘tex) was called before, then you have to wrap that code with the with-eval-after-load macro, like this:

(with-eval-after-load 'tex
("SumatraPDF"
my--sumatrapdf)))

## When OneDrive is stuck at syncing

Purely a OneDrive problem, but it may also concern LaTeX users, Emacs users, etc.

## Problem description

OneDrive is a great feature of MS Windows that many people and I love and use, but you might have encountered some issues with syncing. This becomes more visible when you have OneDrive tray icon always appear on the taskbar, which you can have by first going to “Taskbar settings” and then from there entering the link “Select which icons appear on the taskbar”.

Sometimes OneDrive is stuck in the middle of syncing and never finishes syncing. You can spot it from the tray icon indefinitely staying at the “processing changes” or “uploading …” state. Maybe a short way to describe the problem is to say that OneDrive suddenly behaves like Internet access is off, despite the fact that you can browse the Internet fine at that time.

## What tends to cause it

There seems to be certain types of tasks that tend to mess with OneDrive temporarily:

• Compiling some code inside a folder in OneDrive (e.g. from working with a LaTeX document, writing a program)
• editing a text file in OneDrive using an editor that may generate or maintain auxiliary files alongside the text file (e.g. Emacs lock files)

Whether those tasks will actually cause it seems quite random. So it’s understandable that it might take a while for the folks at MS to fix this bug, if this is a bug.

## What to do when it happens

The shortest way I know is: Long press or right click on the OneDrive icon to open up its context menu and select “pause syncing”. Then open the context menu again and select “Resume syncing”. You might need to update Windows 10 if you don’t see the “pause syncing” button.

Another way is to exit OneDrive (again using its context menu) and then restart it.

The syncing will get unstuck, but when you do one of those aforementioned tasks again, the syncing may get stuck again.

There seems to be no permanent solution available yet and we will all have to wait until a fix is released.

## Introduction to Wasserstein metric (earth mover’s distance)

This is a short introduction to the notion of Wasserstein metric but presented in a way that explicitly highlights the earth mover intuition on the definition.

## 1. the cost of moving one particle

Throughout the post, we fix a metric space (X,d). Suppose a particle of mass $w \ge 0$ is placed at position $x \in X$. The cost of moving this particle to a new position $y \in X$ is defined to be $w d(x,y)$. With this definition, you see:

• The cost of moving a particle from place x to the same place x is zero.
• The cost of moving a particle is proportional to its mass. (This makes sense: a particle of mass $2w$ at position x is and should be cost-wise equivalent to two particles each of mass $w$ at the same position x.)
• The cost of moving a particle of unit mass from x to y is $d(x,y)$.

## 2. The cost of moving a finite collection of particles of total mass 1.

Imagine a configuration of three particles of total mass 1 placed in X. For each i = 1,2,3, the particle i has mass $a_i \ge 0$ and is placed at $x_i \in X$. Since the total mass is 1, we require $\sum_{i=1}^3 a_i = 1$. The mass distribution of this configuration may be written as $\mu := \sum_{i=1}^3 a_i \delta_{x_i}$ where $\delta_{x_i}$ denotes the Dirac delta function centered at $x_i$. (Alternatively, you could interprete the sum $\sum_{i=1}^3 a_i \delta_{x_i}$ to mean a formal sum, or use Dirac measures instead.)

Now consider another configuration but this time of four particles of total mass 1, placed in the same space X. In this second configuration, for each $j=1,2,3,4$, there is a particle of mass $b_j \ge 0$ at position $y_j \in X$. The mass distribution of this configuration may be written as $\nu = \sum_{j=1}^4 b_j \delta_{y_j}$.

Given the first configuration (three particles), we can split some or all of the three particles into smaller particles of various smaller mass and move them in a way that results in the second configuration. There is more than one way to do it.

Start with the first configuration. Suppose, for each $i=1,2,3$, we split the particle of mass $a_i$ at position $x_i$ into four smaller particles of mass $w_{i1}, w_{i2}, w_{i3}, w_{i4} \ge 0$. The total number of split particles is now $12 = 3 \cdot 4$ and $a_i = \sum_{j=1}^4 w_{ij}$ for each $i=1,2,3$. Move the split particle of mass $w_{ij}$ from the old position $x_i$ to the new position $y_j$. The total mass of all particles at the new position $y_j$ is $\sum_{i=1}^3 w_{ij}$ and we require this to be equal to $b_j$, so that they merge to become a particle of mass $b_j$ at position $y_j$.

Any $3 \times 4$ matrix $(w_{ij})$ of nonnegative numbers such that $a_i = \sum_{j=1}^4 w_{ij}$ for all i and $b_j = \sum_{i=1}^3 w_{ij}$ for all j specifies a split-move-merge procedure that transforms the first configuration into the second configuration.

The cost of a split-move-merge procedure is defined to be its total cost of moving: moving the particle of mass $w_{ij}$ from $x_i$ to $y_j$ costs $w_{ij} d(x_i, y_j)$ and so the total cost is the sum $\sum_{ij} w_{ij} d(x_i, y_j)$.

We define the cost of transforming the mass distribution $\mu = \sum_{i=1}^3 a_i \delta_{x_i}$ into the second $\nu = \sum_{j=1}^4 b_j \delta_{y_j}$ to be the minimum possible cost for split-move-merge procedures from the first configuration to the second configuration.

It is an easy check that this definition of distance between $\mu$ and $\nu$ does not depend on how you write $\mu$ and $\nu$. (In other words, whether you see a particle at position x of mass $\frac12$ or see two particles at the same position x each of mass $\frac14$ does not affect costs.)

## 3. coupling

Each split-move-merge procedure $(w_{ij})$ defines a distribution $\lambda = \sum_{ij} w_{ij} \delta_{(x_i,y_j)}$ on $X^2$. This distribution $\lambda$ is a coupling of $\mu$ and $\nu$ in the sense that it projects to $\mu$ and $\nu$. Indeed, if you project it to the first coordinate, you would get $\sum_{ij} w_{ij} \delta_{x_i} = \sum_i \delta_{x_i} \sum_j w_{ij} = \sum_i a_i \delta_{x_i}$.

You could think of the couping $\lambda$ as a distribution on the space $X^2$ of arrows, where you think of each $(x,y) \in X^2$ to be an abstract arrow from x to y.

The integration of the function d(x,y) (whose domain is the space $X^2$ of arrows) w.r.t. the distribution $\lambda$ works out to be the same as the cost $\sum_{ij} w_{ij} d(x_i, y_j)$.

Conversely, each coupling $\lambda$ of $\mu$ and $\nu$ gives rise to a split-move-merge procedure that transforms $\mu$ into $\nu$ whose cost is equal to the integration of d(x,y) w.r.t. $\lambda$.

So the cost of transforming $\mu$ to $\nu$ is the same as the minimum possible expected value of the length $d(x,y)$ of arrows (x,y) over all possible couplings of $\mu$ and $\nu$.

## 4. composing

The distance between $\mu$ and $\nu$ is a metric. The triangle inequality on $d(\cdot,\cdot)$ transfers to the triangle inequality for distance between $\mu, \nu$ because you can compose a transformation from $\mu$ to $\nu$ and another from $\nu$ to $\rho$ to get a third transformation from $\mu$ to $\nu$ whose cost is at most the sum of the costs of the two former transformations. This property is more easily seen if you think of $\mu$ as piles of dirt (of unit mass). In fact, suppose $\mu$ is three piles of dirt: pile i of mass $a_i$ at position $x_i$. Suppose pile 1 is red dirt, pile 2 is yellow dirt, and pile 3 is green dirt. Move this three piles of dirt to form $\nu$ using some transformation $(w_{ij})$. Each of the resulting piles may be a mixture of different colors. Now use some other transformation $(w'_{ij})$ to move the piles again to form $\rho$.The net effect is a transformation of $\mu$ into $\rho$ and the corresponding composition $(w''_{ij})$ can be obtained by examining percentages of different colors of dirt in each of the final piles of dirt. (the output $(w''_{ij})$ is actually not unique here, but there is an obvious canonical choice.)

## 5. definition of the Wasserstein distance

What we have defined is known as the 1st Wasserstein distance between $\mu$ and $\nu$ and is usually denoted $W_1(\mu,\nu)$. For general non-discrete probability measures $\mu$ and $\nu$ on $X$, the definition is done using couplings of $\mu$ and $\nu$ but you will have to take inf instead of min, and the distance can be infinite.

This distance can be thought of as a probabilistic analogue of the Hausdorff distance between two sets..

## 6. Moving a collection of particles of equal mass

Suppose we are to move a collection of n particles whose each mass is $\frac1n$. In the first configuration, the n particles are at the n places $x_1, \cdots, x_n$ (some of which may be same places). In the second configuration, the new n places are $y_1, \cdots, y_n$. It is known that in this case there is always a split-move-merge procedure (between the two configurations) that minimizes the cost but also skips the split and the merge steps. The procedures without the split and merge steps are specified by functions $\pi$ from $\{1,\cdots,n\}$ to itself. If $\pi$ is such a function, then this specifies a procedure in which you move the old particle of mass $\frac1n$ at the old position $x_i$ to the new position $y_{\pi(i)}$ (at the cost of $\frac1n d(x_i, y_{\pi(i)})$) for $i=1,2,\cdots,n$.

So the cost of transforming $\mu = \frac1n \sum_{i=1}^n \delta_{x_i}$ to $\nu = \frac1n \sum_{j=1}^n \delta_{y_j}$ in this case is the same as the minimum of the average $\frac1n \sum_{i=1}^n d(x_i, y_{\pi(i)})$ over all permutations $\pi$ on $\{1,\cdots, n\}$.

The fact that split and merge steps can be discarded is known as Birkhoff-von Neumann theorem, which actually states that any $n\times n$ doubly stochastic matrix (i.e. a split-move-merge procedure in our case) can be written as a convex combination of permutations.

It is known that, in a metric space that is nice enough, any probability measure $\mu$ on it can be approximated by discrete distributions of the form $\frac1n \sum_{i=1}^n \delta_{x_i}$ in an appropriate sense.

## 7. Some properties

(some lower bounds) If two probability measures $\mu$ and $\nu$ are supported on subsets $A, B$ respectively and if $d(A,B) \ge c > 0$, then the distance between $\mu$ and $\nu$ is also $\ge c$. If we relax to $\mu(A) = 1 - O(\epsilon)$ and $\nu(B) = 1 - O(\epsilon)$, then we still have the approximate $W_1(\mu,\nu) \ge c (1 - O(\epsilon))$.

(upper bounds) If $\nu$ is the pushforward image of $\mu$ under some $f: X \to X$ and if $f$ is such that $d(x,f(x)) \le \epsilon$ for all x, then $W_1(\mu,\nu) \le \epsilon$. If we relax to just $d(x, f(x)) = O(\epsilon)$ for all $x \in A$ (uniformly) in some big set $\mu(A) = 1 - O(\epsilon)$ and the diameter of X is O(1), then we still obtain an approximate conclusion $W_1(\mu,\nu) = O(\epsilon)$.

## How to close the sudden Camera app from Windows 8

The readers are assumed to be familiar with names of basic Windows 8 touch gestures (for which, see http://windows.microsoft.com/en-us/windows-8/touch-swipe-tap-beyond )

## 1. introduction

With some Windows 8 tablets (or laptops), you may come across a sudden appearance of the Camera app or the sudden activation of the webcam when you are trying to turn on your laptop (or maybe more precisely, when you are trying to sign in to your PC from the lock screen). Since the screen makes it clear that the webcam is on, you know you have unintentionally launched some Camera app. Surprisingly, the usual ways of closing or exiting the app or switching to other apps does not work. How should one get out of this and get back to the usual lock screen (the log in screen)?

## 2. how to get out

It is actually quite simple to turn off or deactivate this thing. You simply tap or click Unlock which should be found on the left side of the app commands at the bottom of the screen and you are back to the lock screen.

## 3. how did I end up there

You usually swipe up on the lock screen to sign in to your PC, but when you swipe down on it instead, that launches the Camera app. This is a useful feature of Windows 8 that you can turn on or off. The Camera app launched in this way seems to show a different interface from the one launched in the usual way. Maybe it is not the same Camera app? I am not sure.

## 4. how to turn off the feature

This quick access of camera from the lock screen is a useful feature but if you want to turn this feature off for some reason, type the following in Search:

lock screen camera


and by navigating from there, you will arrive at an option that says

Swipe down on the lock screen to use the camera


## Intuitions on problems from Elements of Information Theory Chapter 2

This post is not about sharing solutions to those problems in the book. As the title indicates, this post is rather about sharing intuitions or interpretations of some results mentioned or alluded in some problems listed at the end of Chapter 2 of the book “Elements of Information Theory” second edition. The chapter deals with the notion of entropy and mutual information.

Some of my colleagues and I were reading some rather huge paper which draws from many basic ideas from many fields, and one of them was the notion of mutual information. We only had a grasp on its special case, the notion of entropy of a random variable and since the general notion was new to us, I decided to read at least first few chapters of “Elements of Information Theory”.

## 1. meaning of conditional mutual information

Given three (discrete) random variables X, Y, Z, the conditional mutual information I(X; Y | Z) intuitively means how much X says about Y (or equivalently how much Y says about X) under the assumption that the value of Z is already known. Try to use this intuition to make sense of the chain rule for mutual information and the data-processing inequality.

As for its special case, the intuitions for conditional entropy, see (already linked previously) Shannon’s entropy of random variables and partitions.

## 2. comments on some selected problems from Chapter 2

### 2.1. Problem 2.6

The problem asks to find examples where mutual information increases or decreases when conditioned to a third random variable. This points out that I(X;Y | Z) is not monotone in Z. One should contrast this with the following two facts:

• Conditional entropy H(X | Y) is monotone in the conditioning Y.
• Mutual information I(X;Y | Z) is monotone in X and Y.

### 2.2. Problem 2.14

Given two independent (discrete) random variables X and Y, the entropy of the sum X+Y is at least $\max\{H(X),H(Y)\}$. This is perhaps related to why the central limit theorem works. If you keep adding i.i.d. random variables one by one, the entropy of the sum never decreases. So aside from special cases, we can expect that the entropy would get higher and higher. High entropy of the sum alone may not tell you much about the probability distribution of the sum, but the variance and the mean of the sum is also known, and IIRC the normal distribution maximizes entropy under the constraints on the mean and variance. It does not make a proof of the central limit theorem but I think it illustrates the relation between large degrees of freedom (in statistical mechanics) and probability distributions with maximal entropy.

Once you prove that the entropy of the independent sum X+Y is at least $\max\{H(X),H(Y)\}$, you would also be able to see that the entropy of the independent modulo sum X+Y modulo something like 6 for example is also at least $\max\{H(X),H(Y)\}$. If you have a biased dice, you may rightfully expect that you can simulate a less biased dice, by simply throwing the same dice several times and summing up the numbers modulo 6, and the entropy argument may be one way to justify this expectation.

### 2.3. Problem 2.23

The problem deals with $n$ binary random variables $X_1, X_2, \dots, X_n$ related by one linear equation $X_1 + \dots + X_n = 0 \pmod 2$. The given joint probability distribution is in some sense the most fair distribution under the constraint given by the linear equation.

The results are consistent with our intuition in that:

• knowing $X_1$ tells you nothing about $X_2$
• and given that you already know $X_1$, knowing $X_2$ tells you nothing about $X_3$
• and at the last step when you know $X_1$ to $X_{n-2}$, knowing $X_{n-1}$ tells you everything about $X_n$ (because of the linear equation).

The results can be different for other probability distributions while satisfying the same linear equation. For example, if we give some biased joint probability distribution, then it’s possible that knowing $X_1$ might tell you something about $X_2$ in the sense that $I(X_1; X_2) > 0$. In fact, you can even choose a distribution so that knowing $X_1$ tells you everything about $X_2$ in the sense that $I(X_1 ; X_2) = H(X_2) = 1$. That is the case when we choose the most fair distribution under the constraint $X_1 + X_2 = 1, X_3 = X_4 = \dots = X_n = 0$.

### 2.4. Problem 2.27

It asks to prove the grouping rule for entropy. It can be verified with a simple calculation, but maybe it is instructive to solve this instead by using auxiliary random variables and using already established equalities thus far.

### 2.5. Problem 2.41

One wishes to identify a random object X. Two i.i.d. random questions $Q_1, Q_2$ about the object are asked, eliciting answers $A_1 = A(X, Q_1), A_2 = A(X, Q_2)$. The problem asks to show that the two questions are less valuable than twice a single question.

At first, one might expect incorrectly that the two questions are worth twice a single question, but what about asking millions of questions about the outcome of a single dice roll? One million questions about the outcome is worth almost the same as two million questions, since by the time you ask the million’th, you would be almost certain of what the outcome of the dice roll was and the next million questions would not contribute much. The amount of information to be extracted with questions cannot grow linearly with the number of questions indefinitely and in fact there is an upper bound, namely, $H(X)$.

Some more thoughts reveal that it is because of independence (rather than in spite of independence) that two i.i.d questions are less valuable than twice a single question. Suppose X is the random position of a specific chess piece. If the first (random) question asked which row, then the next optimal question to ask would be to ask which column. Making subsequent optimal questions require dependence to previous questions and answers and therefore independent questions are less efficient than dependent questions, as anyone who has played Twenty Questions can attest.

## flood zones in Seoul

While looking for a place to live in Seoul, it came to my mind that I should check out which parts of Seoul tend to get flooded and here is what I found. Let’s hope that this post becomes obsolete in future as the local government is taking some measures.

Several news article in 2014 reported that the Seoul government named five special regions to take care of against floods and these are regions near:

• 강남역
• 사당역
• 광화문
• 도림천
• 한강로

도림천 (a small river) and 한강로 (a road) can be located as red lines in Naver Map. Locating the rest (강남역, 사당역, 광화문) should be obvious.

A news article in 2013 named top 4 regions that get flooded again and again and these are regions near:

• 강남역
• 광화문
• 안양천
• 사당역

There is also an interactive map showing the often flooded regions in Seoul, but it requires you to select a district first: 풍수해 정보지도. It can show affected areas in each of the years 2010, 2011, 2012, 2013.

## Putting a bar or a tilde over a letter in LaTeX

As you are aware, there are commands to put a bar or a tilde over a symbol in math mode in LaTeX. Sometimes, the output doesn’t come out the way some of us might expect or want.

Fortunately, there are alternative commands that do the same task differently that we can try and there are also other ways of using the same commands.

To put a tilde over a letter, we can use either \tilde or \widetilde. As for which one to use in which situation, compiling a document with the following as its part can help comparison.

$\tilde{A}$ vs $\widetilde{A}$

$\tilde{\mathcal A}$ vs $\widetilde{\mathcal A}$

$\tilde{ABC}$ vs $\widetilde{ABC}$


To put a bar over a letter, we can use either \bar or \overline. It seems that \bar is to \overline what \tilde is to \widetilde. There don’t seem to be \overtilde or \widebar.

$\bar{A}$ vs $\overline{A}$

$\bar{\mathcal A}$ vs $\overline{\mathcal A}$

$\bar{ABC}$ vs $\overline{ABC}$


Over a symbol with a subscript or superscript index, one can imagine four ways of putting a bar or tilde over it. Among the four ways, one can say that one or two of them looks right but that may depend. I can’t comment on which ones look good. You just have to try each and decide with your colleagues.

$\tilde{A_2}$ vs $\widetilde{A_2}$ vs $\tilde{A}_2$ vs $\widetilde{A}_2$

$\bar{A_2}$ vs $\overline{A_2}$ vs $\bar{A}_2$ vs $\overline{A}_2$


## why learn elisp (Emacs Lisp)

This post is the intro module of Living with Emacs Lisp.

The goal of this post is to hopefully convince you why you should learn elisp. Readers are assumed to be beginning users of Emacs. (Of course he who does not use Emacs does not need to learn elisp)

So you are a beginning user of Emacs and let me raise a relevant question: suppose someone asked you “Why did you choose Emacs?” The longer question would be “What is the point of using Emacs in particular when there are other great text editors out there that are also cross-platform, customizable and extensible?”

You probably have your answer to the question. If you want to hear my answer, I’d say “I choose Emacs because Emacs is customizable and extensible to an extreme degree. I believe that is the defining character of Emacs. That may not mean that everybody should use Emacs but that is certainly why I will continue to use Emacs.”

I think my customization of Emacs is only to a mild degree and I may never actually customize it to an extreme degree. However, because I know that Emacs allows extreme customization, I can be sure that Emacs allows any kind of mild customization I might want later. It’s like how some columnist would feel sure that organizations like ACLU will come defend his right to say his mildly controversial speeches if needed after he has seen ACLU defend those with extreme opinions.

The extent to which Emacs can be extended to suit your needs. You can do a lot with Emacs without learning elisp: you can define keyboard macros, you can install and use Emacs packages for your needs, you can use the Customize facility to customize your Emacs and packages to some extent. That is, without learning elisp, you can already experience Emacs to be one of those editors that are cross-platform, customizable and extensible. As you learn elisp and as you start to use hooks in interesting ways, or make your own small minor modes, or even make your own Emacs package, you start to experience Emacs to be an editor that is customizable and extensible to an extreme level. Think of that as a kind of premium feature of Emacs which you unlock by paying, and you pay not with dollars, but with time: time devoted to learning elisp.

On the other hand, there is an old saying: “Time is money”. Of course we can’t and shouldn’t devote all day to learning elisp. And we wouldn’t expect that one hour of learning elisp would unlock all the power. It’s a gradual thing. A learning curve is there, but not as steep as you might expect. Do you have something in mind that you always wanted to have in your text editor? Your mission, should you choose to accept it, is to learn enough elisp to make that something happen.

Posted in Emacs | Tagged | 2 Comments

## Type backslash easily in Emacs AUCTeX

Writing LaTeX documents seem to require many presses of backslashes and the backslash key is located at awkward place and hard to type in some types of keyboards. For users of Emacs AUCTeX, there are some ways to ease this pain.

## 1. type slash to type backslash

With this elisp code added to your init file, whenever you type slash in a LaTeX buffer, backslash is inserted instead. The slash key is easier to reach.

(defun my-make-slash-backslash ()
(local-set-key (kbd "/") "\\"))



When you have to enter slash itself, which is not often, use C-q. When you have to enter many slashes, for example you could be typing a url, in this case I believe copy and paste is the simplest workaround. It is possible to extend the code so that typing backslash results in insertion of slash as well, so that is another workaround, but I find that disorienting, so I don’t use that workaround, but you may want to use that.

If you want to be able to turn on and off “slash turns into backslash” on the fly, you can write a minor mode for that.

## 2. use the two commands

Another way to enter backslashes with ease is to use C-c C-m to enter a TeX macro (things like \section{...} and \frac{..}{..}) and C-c C-e to enter a LaTeX environment (things like \begin{..} .. \end{..}`) in LaTeX buffers.

## Doob-Dynkin Lemma for probability space

Let f, g be measurable functions from a probability space $(\Omega, \mathcal F, \mathbb P)$ to measurable spaces $(X, \mathcal A)$ and $(Y, \mathcal B)$ respectively. Consider the following three conditions:
(1) $\sigma(f) \supset \sigma(g) \bmod \mathbb P$ (in other words, each element in $\sigma(g)$ is equivalent (up to null sets) to some element in $\sigma(f)$. The notation $\bmod \mathbb P$ is used when an author wants to make explicit the practice of treating sub-sigma algebras ignoring null sets.)
(2) There is a measurable $H: (X, \mathcal A) \to (Y, \mathcal B)$ with $g = H \circ f$ a.e.
(3) There is a measure preserving map $H: (X, \mathcal A, \mu) \to (Y, \mathcal B, \nu)$ with $g = H \circ f$ a.e. where $\mu, \nu$ are the pushforward measures obtained from $\mathbb P$.

(2) implies (3) trivially. (3) implies (2) trivially (even if the measure preserving map is only a.e. defined.)

(2) implies (1) trivially. The Doob-Dynkin lemma for probability space is that (1) implies (2) if $(Y, \mathcal B)$ is a standard Borel space. In this post, we present some proofs of this lemma and some corollaries and others. Compare with the Doob-Dynkin lemma for measurable spaces (the previous article). As we’ll see, Doob-Dynkin lemma for probability space is a special case of that for measurable spaces. The one for probability spaces suffices for most applications in probability theory and ergodic theory.

## 1. Proof 1

In this proof, we simply show that the Doob-Dynkin lemma for probability space follows from that for measurable spaces. It is enough to show that the condition $\sigma(f) \supset \sigma(g) \bmod \mathbb P$ can be improved to $\sigma(f) \supset \sigma(g)$ after discarding all points in some null set from $\Omega$.

For each $B \in \mathcal B$, there is $A_B \in \mathcal A$ such that the symmetric difference $E_B := g^{-1}B \cap f^{-1}A_B$ is a $\mathbb P$-null set. It would be nice to be able to discard all points in the union $\bigcup_{B \in \mathcal B} E_B$ but is this a null set? Perhaps not.

Exercise 10. Show that it is enough to discard all points in the countable union $\bigcup_{n} E_{B_n}$ where $(B_n)_n$ is a sequence that generates $\mathcal B$ (Existence of such a sequence is guaranteed by the assumption that $(Y, \mathcal B)$ is a standard Borel space).

## 2. Proof 2

In this proof, we adapt an argument in the previous article to our setting of ignoring null sets. We divide into two cases.

Case I: when the image of g countable

In this case, there is $A_i \in \mathcal A$ such that $g^{-1}y_i = f^{-1}A_i$ a.e. for each i.

Notice that $f^{-1}A_i$ form a countable partition modulo $\mathbb P$, in other words, the intersection of $f^{-1}A_i$ and $f^{-1}A_j$ is a null set when $i \ne j$ and their union over all i has measure 1, or equivalently that for a.e. $\omega$, there exists unique i for which $\omega \in f^{-1}A_i$.

Exercise 20. Show that $A_i$ form a countable partition mod $\mu$. (the general idea is that the measure algebra homomorphism $A \mapsto f^{-1}A$ behaves like an inclusion map)

Let H be any (everywhere-defined) measurable function from X to Y such that for $\mu$-a.e. x and for each i we have $x \in A_i \implies H(x) = y_i$. It is easy to define such an H.

Now it only remains to show $g = H \circ f$ a.e. For a.e. $\omega \in \Omega$, we have:
there is i for which $f(\omega)$ is in $A_i$;
for that i, $H(f(\omega)) = y_i$ (by property of H);
but that $y_i = g(\omega)$ (by how $A_i$ chosen);
so $H\circ f = g$ on this $\omega$.

Case II: when the image of g is not countable.

WLOG assume $(Y, \mathcal B)$ is the unit interval.

Let $g_n := \lfloor 2^n g \rfloor / 2^n$. There is a measurable $H_n: (X, \mathcal A) \to (Y, \mathcal B)$ with $g_n = H_n \circ f$ a.e..

Notice that $g_n \nearrow g$ holds everywhere. It would be nice to take $H = \lim H_n$ but we only know $H_n$ converges at least on $f(\Omega_0)$ for some subset $\Omega_0$ of measure 1, and we don’t know if this image is measurable.

A work around is that we let H be any (everywhere-defined) measurable such that $H(x) = \lim H_n(x)$ for all x for which the limit exists. Such H exists and with this H we can proceed to show $g = H \circ f$ a.e..

Another work around is to notice that the subset $[H_n \nearrow]$ (alternatively also $[\exists \lim H_n]$) is measurable and so one can show that its measure is 1. Then we set H to be a $\mu$-a.e. limit of $H_n$. Then $H \circ f$ is a P-a.e. limit of $H_n \circ f$. So we can proceed with this H too.

## 3. Proof 3

In this proof, we show that (1) implies (3) using the fact that $(Y, \mathcal B, \nu)$ is a Lebesgue space (which follows because $(Y, \mathcal B)$ is a standard Borel space) and the fact that a Lebesgue space can be approximated by a sequence of partitions. This proof is probably equivalent to the previous one, but let’s do it for illustration.

The idea in this proof is that points in X (resp. Y) more or less should correspond to appropriate nested sequence of elements in $\mathcal A$ (resp. $\mathcal B$), so in order to build H, we only need establish an appropriate procedure to transform an appropriate nested sequence of elements in $\mathcal A$ to that of $\mathcal B$, but that is done for free by observing that with abuse of notation we may safely pretend that $\mathcal B \subset \mathcal A \subset \mathcal F \bmod \mathbb P$ (if we identify $\sigma(f), \sigma(g)$ with $\mathcal A, \mathcal B$). We’ll proceed without abuse of notation.

Notice that WLOG we may replace $(X, \mathcal A, \mu)$ and $(Y, \mathcal B, \nu)$ with any other almost-isomorphic probability spaces at the minor cost of losing everywhere definedness of f, g.

WLOG there is a non-decreasing sequence of countable partitions $\beta_n \subset B$ such that for any non-increasing choice $B_n \in \beta_n$ the intersection $\cap_n B_n$ is a single point and that $(\beta_n)_n$ generates $\mathcal B$. This is possible because we may assume that the probability space Y is a disjoint union of the Cantor space and atoms, which follows from the result that any Lebesgue space is almost-isomorphic to a disjoint union of an interval and atoms).

Now we want to pick a good sequence $\alpha_n$ such that for all n we have $g^{-1}(\beta_n) = f^{-1}(\alpha_n) \bmod \mathbb P$. We can assume that for each n all elements of $\beta_n$ (and hence also those of $g^{-1}(\beta_n)$ too) have positive measure. So we can ensure that all elements of $\alpha_n$ have positive measure. For each n, $\alpha_n$ is a countable partition modulo $\mu$ and the sequence $(\alpha_n)_n$ is non-decreasing modulo $\mu$. After discarding only countably many $\mu$-null sets from X, we can ensure that $\alpha_n$ form a non-decreasing sequence of countable partitions. The reason we ensure elements of $\alpha_n$ and $\beta_n$ only have elements of positive measure is because we want to ensure the natural bijection between $\alpha_n$ and $\beta_n$.

For each $x \in X$, let $\alpha_n(x)$ be the unique element in $\alpha_n$ containing x, and let $\beta_n(x)$ be the unique element in $\beta_n$ which corresponds to $\alpha_n(x)$ in the sense that $g^{-1}(\beta_n(x)) = f^{-1}(\alpha_n(x)) \mod \mathbb P$. Define H by $\{H(x)\} = \cap_n \beta_n(x)$ which is a well defined map from X to Y (which is because $\beta_n(x)$ is non-decreasing (for a fixed x) because its counterpart in $\Omega$ is non-decreasing up to null sets because its counterpart in Y is.).

(Measurability of H) For each $B_n \in \beta_n$, it is easy to show that $H^{-1}(B_n)$ is measurable, moreover, equal to $A_n$ (where $A_n \in \alpha_n$ is the element corresponding to $B_n$).

(On $g = H \circ f$ a.e.) Notice that H is measurable and everywhere defined, so $g' := H\circ f$ is an almost-everywhere defined measurable function. So we are comparing two a.e. defined measurable functions from $(\Omega, \mathcal F, \mathbb P)$ to $(Y, \mathcal B)$. Inverse images of $B_n \in \beta_n$ under the two functions are equal up to null sets because of $H^{-1}(B_n) = A_n$. Now notice that the set $[g \neq g']$ is a subset of $\bigcup_{n, B, B': B, B' \in \beta_n, B \neq B'} (g^{-1}B \cap g'^{-1}B')$ where the terms in the latter are null sets because of the previous sentence. We have shown $g = g'$ a.e..

(On measure preserving) It is easy to show that H is measure preserving from the fact that $g = g'$ a.e.. Alternatively, it also follows from $H^{-1}(B_n) = A_n$.

## 4. Uniqueness of H

If H and H’ satisfy (2) , then $H = H'$ holds $\mu$-a.e.. This follows from the following two observations:

• $H \circ f = H' \circ f$ holds $\mathbb P$-a.e.
• $[H = H']$ is measurable (because $(Y, \mathcal B)$ is a standard Borel space).

## 5. applications

Exercise 25. Let f, g be measurable functions from a probability space $(\Omega, \mathcal F, \mathbb P)$ to measurable spaces $(X, \mathcal A)$ and $(Y, \mathcal B)$ respectively. Show that $\sigma(f) = \sigma(g) \bmod \mathbb P$ if and only if there are $X_0 \in \mathcal A$ with $\mu(X_0) = 1$ and $Y_0 \in \mathcal B$ with $\nu(Y_0) = 1$ and a bi-measurable bijection $H: X_0 \to Y_0$ such that $g = H \circ f$ a.e..

Exercise 30. Given two measure-theoretic dynamical systems $(X, \mathcal A, \mu, T)$ and $(Y, \mathcal B, \nu, S)$ (measure preserving transformations on Lebesgue spaces), show that the latter is a factor of the former if and only if there is a joining $\lambda$ (a $T\times S$-invariant probability measure on $(X \times Y)$) such that $\mathcal B \subset \mathcal A$ mod $\lambda$ (with abuse of notation).