How to use Unicode in LaTeX (by LuaTeX or XeTeX)

The goal of this post is to gradually build up minimal examples for making Unicode text work in LaTeX documents by using LuaTeX or XeTeX. In the end, we will have produced PDF files containing CJK text (Chinese, Japanese, Korean or hangul text).

1. before we begin

To make things simple, I will assume your goal is to be able to write LaTeX documents with mainly Korean text in them.

Before we begin, there are four things you must make sure:

  1. Make sure that your web browser is able to display Korean text.
  2. Make sure you know or learn a way to type Korean text on your computer.
  3. Make sure your text editor can display Korean text.
  4. Make sure you know how to save your TeX document as a UTF-8 text file.

in that order.

For 1 and 2, googling will help you. For 3 and 4, see the manual for your editor. Whatever you did to make sure of 1 may have a side effect of automatically making sure of 3.

2. LuaLaTeX vs XeLaTeX and what are they?

LuaTeX and XeTeX are alternative TeX engines and both are designed to work with Unicode text. Invoking LuaLaTeX or XeLaTeX just means invoking LuaTeX or XeTeX with LaTeX format. This post contains examples to be tried in both, but if you want to be persuaded to choose just one, see Why choose LuaLaTeX over XeLaTeX.

3. first example

Write the following LaTeX document:

\documentclass{article}

\begin{document}
\section{ASCII English}
Hello world.
\section{European}
¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu
          Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა
\section{CJK}
(Chinese) 你好, 早晨, (Japanese)こんにちは, (Korean, hangul) 안녕하세요

\end{document}

Save it as first-example.tex and make sure it is saved in UTF-8 encoding. Then compile it with XeLaTeX or LuaLaTeX. The document contains many ways of saying hello in different languages which I copied from the HELLO file that is shipped with GNU Emacs.

Compilation should finish fine without errors and it should produce a PDF file. If you are using TeXworks & MiKTeX on MS Windows, compiling a document with XeLaTeX is as simple as choosing XeLaTeX from the engine list (next to the compile button on the TeXworks window) and then pressing the compile button. Compiling with LuaLaTeX is similar. If TeXworks does not remember your choice next time you open the same file, add the following line to the beginning of the document file.

% !TEX program = LuaLaTeX

If you are using Emacs AUCTeX, start by setting the file local variable TeX-engine to luatex.

The produced PDF output may have some missing letters. In particular, CJK texts are not displayed in the output.

Now add the following line to the preamble (you know what a preamble in a LaTeX document is):

\usepackage{fontspec}

Then save the document and compile with XeLaTeX or LuaLaTeX again. Compilation should finish fine without errors. Now the produced output will have less missing letters, but CJK texts are still not shown. According to my test, with XeLaTeX, broken letters are displayed as spaces, but with LuaLaTeX, broken letters are just gone.

Now add the following line to the preamble and try again (we are specifying a font now):

\setmainfont{Times New Roman}

The produced output will have even less missing letters. CJK texts still not shown. According to my test (done on MS Windows), with XeLaTeX, broken letters are now displayed as boxes. Times New Roman font does not support CJK text.

Now we change that line to the following line (assuming that you have Batang font on your computer, which will already be the case if you enable Korean language support on MS Windows):

\setmainfont{Batang}

Now the output shows most of CJK characters. While the Chinese hello text is still missing one character, Japanese hello and Korean hello are displayed completely now.

The final document:

\documentclass{article}
\usepackage{fontspec}
\setmainfont{Batang}

\begin{document}
\section{ASCII English}
Hello world.
\section{European}
¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu
          Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა
\section{CJK}
(Chinese) 你好, 早晨, (Japanese)こんにちは, (Korean, hangul) 안녕하세요

\end{document}

We loaded a fontspec package and then used the \setmainfont command (from the fontspec package) to choose Batang as the main font.

4. how to experiment with adjusting font features

Now you know how to include Korean text in LaTeX documents. There are still two problems left to solve, but before I mention the two problems, let’s see how to have localized effect of changing font, for example you can write this:

{\fontspec{Batang} hello world}

and that applies Batang font to just that text of hello world. That means that if you write:

one {\fontspec{Batang} two three} four

“two three” will show in Batang font, while “one” and “four” will show in whatever is the main font that is set by the \setmainfont command in the preamble. Now I can demonstrate the first of the two problems: what would the following do?

{\fontspec{Batang} When he goes---``How are you, Alice'', she replies-``I am fine and you?''}

You might expect to see em-dash and double quotes in the produced output but that does not happen. To make that happen, you can either just use Unicode em-dash and Unicode double quotes or you can add a font feature like this:

{\fontspec[Ligatures=TeX]{Batang} When he goes---``How are you, Alice'', she replies-``I am fine and you?''}

If you want to apply the font feature Ligatures=TeX to the whole document, you can use

\setmainfont[Ligatures=TeX]{Batang}

instead of

\setmainfont{Batang}

The second problem: what would the following do?

{\fontspec{Batang} 월남쌈 (goi cuon) \emph{월남쌈 (goi cuon)}}

You might expect the emph part to be displayed in some italic font, but that does not happen. The emph part is not even displayed in a different shape, let alone an italic shape. Even the English ASCII portion of the emph part is not in italic. Bold face is not working either:

{\fontspec{Batang} 월남쌈 (goi cuon) \emph{월남쌈 (goi cuon)} \textbf{월남쌈 (goi cuon)}}

How can we solve this problem? Let’s experiment further with \fontspec command. What if we try a different Korean font? How about this:

{\fontspec{Malgun Gothic} 월남쌈 (goi cuon) \emph{월남쌈 (goi cuon)} \textbf{월남쌈 (goi cuon)}}

With Malgun Gothic, now bold face works but emph is still not distinguished. To make emph look distinguished, you can add a font feature like this:

{\fontspec[ItalicFont={Malgun Gothic Bold}]{Malgun Gothic} 월남쌈 (goi cuon) \emph{월남쌈 (goi cuon)} \textbf{월남쌈 (goi cuon)}}

That makes both the emph part and the textbf part display in bold face, but there will be no visual distinction between the emph part and the textbf part. To fix that, one can add a font feature specifying that italic face to be displayed with a different Korean font, for example, Gungsuh font:

{\fontspec[ItalicFont={Gungsuh}]{Malgun Gothic} 월남쌈 (goi cuon) \emph{월남쌈 (goi cuon)} \textbf{월남쌈 (goi cuon)}}

So we now have ideas on what font features to add because of our experiments using the \fontspec command. To add these font features together and apply them to the whole document, you can use

\setmainfont[ItalicFont={Gungsuh},Ligatures=TeX]{Malgun Gothic}

5. in summary

Assuming you are writing a document with mostly Korean text, we found that something like the following template is a good start:

\documentclass{article}
\usepackage{fontspec}
\setmainfont[ItalicFont={Gungsuh},Ligatures=TeX]{Malgun Gothic}

\begin{document}

\section{월남쌈}
월남쌈 (goi cuon) \emph{월남쌈 (goi cuon)} \textbf{월남쌈 (goi cuon)}

\section{English conversation}
When he goes---``How are you, Alice'', she replies-``I am fine and you?''

\section{돌침대}
\begin{verse}
이게 바로 내가 찾던 남자 돌침대\\
이게 바로 내가 찾던 남자 돌침대\\
이게 바로 내가 찾던 남자 돌침대\\
\end{verse}

\section{수식}
피타고라스
\[ a^2 + b^2 = c^2 \]

\end{document}

6. how to change names

When you compile:

\begin{document}
\maketitle
\tableofcontents
...

you will see the name “Contents” displayed. If you want to change the name to something like “목차”, you can add the following line in the preamble:

\renewcommand*{\contentsname}{목차}

To change other names too, see the relevant table in l2tabu for the list of commands for names like References, Abstract, Bibliography, Figure, Table, …

7. some differences with LuaLaTeX

Something I noticed about LuaLaTeX is that compiling the above document with LuaLaTeX takes about twice as much time as compiling with XeLaTeX, and that only XeLaTeX can take font names in Korean, for example, \fontspec{맑은 고딕} will not work with LuaLaTeX. So it is good to know English names of Korean fonts: 맑은 고딕 = Malgun Gothic, 바탕 = Batang, 궁서 = Gungsuh.

This entry was posted in Mathematics and tagged , , , . Bookmark the permalink.

One Response to How to use Unicode in LaTeX (by LuaTeX or XeTeX)

  1. Pingback: fontspec – How to add Batang font to ShareLaTeX – TeX | Asking

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s