The uneasy digital life of CJK Chinese characters

|

Table of Content

The font problem

When I constructed this site in 2017, the first problem I need to solve is how to correctly display CJK characters from different CJK languages on webpages. CJK languages here means all kinds of Chinese, Japanese, and Korean languages and dialects. I was not able to find a good solution back then. Now as I have more time after finishing coursework and became more technologically literate, I finally found a solution to this problem.

Demonstration of problem

CJK characters like 判, 所, and 画 have different forms in CJK languages but only occupy the same position in unicode, so they need the correct fonts to correctly display the difference. For example, if a Japanese font is applied to a segment of Chinese text, which is highly possible to happen when a Japanese reader browses a Chinese website, these characters will appear in their Japanese form, not the intended Chinese one. This is because the readers’ computer probably don’t have the Chinese font installed, or the system-default Chinese font (e.g. SimSun) has a lower falling-back priority than Japanese fonts on that computer.

判 (“judge”) is a CJK unified ideograph (0x5224) in unicode. It corresponds to three variations (or forms) in Chinese, Japanese, and Korean:

Left-to-right: Simplified Chinese, Japanese, Korean.
Font: Noto Sans CJK

What we want to see: If we write some text in one of the CKJ languages, the characters in that text should be displayed by a font of that language and therefore have the correct form. What the reader can see should be under the author’s control.

What usually happens: If we just type a character without specifying its language and font, since there is no way for the reader’s web browser to know which language this character belongs to, it will be displayed in one of these three forms depending on the font setting of the reader’s computer and browser, which is not what we want. We want the character to be correctly displayed regardless of the reader’s platform. What the reader can see is out of the author’s control.

A solution

Now let’s try to show the 判 character on this webpage in simplified Chinese, Japanese, and Korean, using Google’s Noto Sans CJK font. I choose this font because it is open-source, free to use, and has unified style for all CJK languages. Google claims that the Noto (from “no more tofu”, □) “aims to support all languages with a harmonious look and feel.” (Source) Unicode plus a unified font, very promising! This is a small test to see if this goal is achieved for the CJK languages.

My solution consists of these three steps:

  1. Font-embedding: embed fonts on web pages to avoid the influence of local font environment.
  2. CSS configuration: use the lang attribute of HTML elements (or the :lang pseudo-class) to specify the desired language and use CSS selectors to map this element to the desired font families.
  3. Display CJK characters correctly: add a specific lang attribute to an HTML element, the text within this element will be displayed in the correct form required by that language.

This site use the following HTML and CSS settings to realize the above steps.

Step 1 (in HTML file):

<head>
    ...
    <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Noto+Sans+SC|Noto+Sans+TC|Noto+Sans+JP|Noto+Sans+KR">
    ...
</head>

Step 2 (in CSS/SCSS file):

// Font stack
$font-serif: "PT Serif", Georgia, "Times New Roman", "Noto Sans SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Zen Hei", "MS Pゴシック", serif;
$font-sans:  "PT Sans", Helvetica, Arial, "Noto Sans SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Zen Hei", sans-serif;
$font-font-awesome: FontAwesome;
// For Chinese and Japanese characters
$font-Chinese-sans: "PT Sans", Helvetica, Arial, "Noto Sans SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Zen Hei", sans-serif;
$font-Japanese-sans: "PT Sans", Helvetica, Arial, "Noto Sans JP", "MS Pゴシック", sans-serif;
$font-Korean-sans: "PT Sans", Helvetica, Arial, "Noto Sans KR", "돋움", dotum,"산돌 고딕", sans-serif;

// Selectors
*[lang|="zh"] {
  font-family: $font-Chinese-sans;
}

*[lang="jp"] {
  font-family: $font-Japanese-sans;
}

*[lang="kr"] {
  font-family: $font-Korean-sans;
}

Step 3 (the result):

...
    <div class="pic-column" lang="zh-cn"></div>
    <div class="pic-column" lang="jp"></div>
    <div class="pic-column" lang="kr"></div>
...

The result is:

Left-to-right: Simplified Chinese, Japanese, Korean.

I am not 100% sure what you will see. What I saw on my screen is:

Left-to-right: Simplified Chinese (Noto Sans SC), Japanese (Noto Sans JP), Korean (Dotum).

Not perfect. The Korean hanja character used Dotum font instead of Noto Sans KR. But at least the forms are linguistically correct. Now the author’s control of CJK characters’ display are on the same level as that of English or other Latin characters since the latter may not appear in the desired font either. The Korean hanja character is not displayed by the Noto Sans KR font because the embedded font downloaded from Google Fonts API is a reduced Noto Sans KR that does not include hanja. This problem can be easily solve by add your own Noto Sans KR font file to the server instead of downloading from Google.

You may have noticed that the lang attribute is now tied with the sans-serif font. How about using the serif font? There is a way to manually set serif or sans-serif font for CJK texts or let the CJK font follow the father element’s choice of sans or serif, but that is beyond the scope of this article.

Where did the uneasiness come from?

Now the problem seems to have been (at least partially) solved. But is technical solution the final answer? Do we need to be technologically savvy enough in order to write and read correctly online?

Many people used pictures to ensure the correct display of CJK characters correctly when the presence of the CJK languages in the digital world was even thinner than today. An illustrating example is Donald E. Knuth’s way of showing his Chinese name on his personal website. Knuth, who authored TeX, a typesetting system that became the foundation of today’s widely-used LaTeX document preparation system, had to use pictures instead of typeset-form characters to ensure that every visitor to his homepage would be able to see the characters correctly. And he lamented this situation right after he used the pictures.

Knuth's way of showing his Chinese name.
Snapshots from his site's homepage and FAQ page on May 26th, 1997.

Thanks to Internet Archive’s Wayback Machine, we are able to peek how the digital world looked like in the 1990s. As Knuth wrote in 1997, HTML had yet to support non-Latin characters back then. Around October 2005, Knuth was finally able to include the typeset form of his Chinese name on his site’s FAQ page. As of now (September 2020), though most computers today can display these characters, he left the pictures on the homepage untouched, probably out of compatibility concern.

This is a perfect reflection of the CJK languages’ difficulty with the digital world, a difficulty that is inherited from the printing age.

There was a co-existence of printed form and handwritten form of Chinese characters in books printed by type printing technology. When Joseph Needham’s Science and Civilisation in China was first published in 1954, it used movable types to print Chinese characters. But handwritten form did not disappear from printed books throughout the age of type printing. For example, Toshio George Tsukahira’s Feudal Control in Tokugawa Japan: The Sankin Kōtai System used handwritten form.

This phenomenon is not unique to CJK characters. For some languages, such as Jurchen, using handwritten form was the only possible solution until very recent. 金啓孮’s 女真文辞典 (1984)

Tsukahira

Jin

First there was no solution, then some inconvenient and costly solution, at last the solution becomes common and low-cost but there would still be problem. It seems the printing or displaying CJK characters was never made as easy as that of English alphabet.

The consequence of unicode

The idea of having a unified character set for Chinese, Japanese, and Korean was meant to solve the problem of character encoding and displaying. But it became the origin of the font display problem identified in this article. This is probably an unexpected result of those who participated in the creation of unicode. Maybe some of them expected it but would sacrifice the minor accuracy and correctness which is minor compared to the bigger problem of not being able to encode and display CJK characters together.

Standardization of border and techno-linguistic infrastructure?

Standardization and encoding.

Encoding means a process of de-picturization. Also an action of separating style from content?

Technological universalism

The standardization and encoding of characters should be examined in the context of border formation in modern East Asia. Border is not only physical. Identifying these borders would be an indispensible step for understanding what modernity means for East Asia and the rest of the world.

There is no clear borderlines between Chinese characters used by different East Asian societies before modern time.

The future (and conclusion)

LaTeX still have problem with CJK characters. The unease won’t disappear in the future. The problem brought by unicode is comparable to that of the “return” of China in an increasingly interconnected East and South East Asia. Reintegration after borderline drawing would be painful, just like the globalization in the age of sovereign nation-state.

Comments