Adding hand-drawn font for Chinese, Japanese and Korean

Published November 7, 2024
by Mrazator

Adding a hand-drawn font for Chinese, Japanese, and Korean (CJK) has been on our radar as one of the most requested features. In the meantime, many have tried forking or extending Excalidraw with Chinese font/s, but the experience never really felt right.

It usually resulted in major issues, to name a few:

Rendering inconsistencies between browsers and operating systems, due to relying on system fonts
Vertical layout shift, due to different baselines for CJK characters
Horizontal layout shift, due to different wrapping rules for CJK text
Unloaded fonts in exported SVGs, due to CORS restrictions
General performance issues, due to the large sizes of CJK fonts

We've realized this isn't merely about adding a new font. Instead, we aimed at providing a base CJK font that would complement our Excalifont with a first-class citizen feeling.

Meet Xiaolai「小赖字体」

We’ve decided to add the CJK font as a fallback to Excalifont so that one could possibly combine Latin with CJK characters in one sentence. Adding it as a fallback also inherits Excalifont's baseline, solving the mentioned "Vertical layout shift" issue.

The font we've chosen is Xiaolai, which defines glyphs for over 40,000 codepoints, with an uncompressed file size (TTF) of about 22 MB. It includes characters for Simplified & Traditional Chinese (Han), Japanese (Katakana, Hiragana), and Korean (Hangul). Interestingly this enormous coverage is also thanks to some characters being AI-generated. In contrast, Excalifont supports about 100 alphabetic languages and is about 5% of this size. Surely, we couldn’t just preload this font for everyone.

Font splitting and lazy loading

The solution lies in font splitting, which allows us to break the huge font into multiple font faces. The challenging part is splitting the font into chunks of a similar size without losing any advanced data while grouping the most often-used codepoints together. Happily, there is a brilliant open-source project that does just that: cn-font-split.

It comes with heuristics for splitting the font into meaningful chunks and instructs the open-source text shaping library harfbuzzjs, to perform glyph subsetting based on the established codepoint ranges. The resulting chunks are converted into the lossless WOFF2 format, reducing the size by around 50%. Essentially this is what Google Fonts is doing with fonts like Noto Sans CJK.

Google is likely relying on the very same subsetting logic, as the subsetting was integrated into Harfbuzz in the collaboration with Google Fonts project. Harfbuzz is also the only library we've found that can effectively subset advanced font data inside the browser, including complex kernings and ligatures defined in the GPOS and GSUB font tables.

This results in 209 chunks, each around 50-70 kB small. The chunks are then fed into FontFace API (see below) and added inside window.document.fonts, which essentially allows us to register any font dynamically - possibly even custom fonts in the future.

1import _0 from "./Xiaolai-Regular-09850c4077f3fffe707905872e0e2460.woff2";
2import _1 from "./Xiaolai-Regular-7eb9fffd1aa890d07d0f88cc82e6cfe4.woff2";
3// ...
4import _208 from "./Xiaolai-Regular-2b7441d46298788ac94e610ffcc709b6.woff2";
5
6// Each FontFace is described by an url and a unicode range of contained codepoints
7export const XiaolaiFontFaces: ExcalidrawFontFaceDescriptor[] = [
8  {
9    uri: _0,
10    descriptors: {
11      unicodeRange:
12        "U+f9b8-fa6d,U+fe32,U+fe45-fe4f,U+ff02-ff0b,U+ff0d-ff1e,U+ff20-ff2a",
13    },
14  },
15  {
16    uri: _1,
17    descriptors: {
18      unicodeRange:
19        "U+20dd-20de,U+25ef,U+ff2b-ffbe,U+ffc2-ffc7,U+ffca-ffcf,U+ffd2-ffd7,U+ffda-ffdc,U+ffe0-ffe6,U+ffe8-ffee",
20    },
21  },
22  // ...
23  {
24    uri: _208,
25    descriptors: {
26      unicodeRange:
27        "U+7e2b-7e3a,U+7e3c-7e40,U+7e42-7e46,U+7e48-7e81,U+7e83-7e9a,U+7e9c-7e9e,U+7eae,U+7eb4,U+7ebb-7ebc,U+7ed6,U+7ee4,U+7eec,U+7ef9,U+7f0a,U+7f10,U+7f1e,U+7f37,U+7f39,U+7f3b",
28    },
29  },
30];

Each such font face can then be uploaded to our CDN and lazy-loaded based on the scene content. Browsers are auto-lazy-loading the registered font faces based on the characters needed to be rendered in HTML or through the Canvas API. However, some (looking at you, Safari) need a little push. Due to Safari we are manually lazy-loading the necessary font faces based on the text elements during the scene initialization, or when pasting text.

Concurrent glyph subsetting during SVG export

Treating the CJK font as a first-class citizen means it should work similarly to other fonts during export. Recently we’ve started embedding the font faces into the exported SVGs so that they could be loaded in CORS-restricted environments like Mail, PowerPoint, GitHub, Notion, etc.

Moreover, we’ve essentially shipped client-side glyph subsetting of the embedded font faces (you've guessed it, based on Harfbuzz), so that they contain only glyphs for characters that are used in the exported content.

This helped us reduce each font face size up to 95%!

Since recently we've started inlining fonts in SVGs for better embedding support. The downside was larger file sizes.

We've now addressed this by encoding only the glyphs you're exporting! Via @mrazator

— Excalidraw (@excalidraw) September 2, 2024

However, for large CJK scenes, subsetting more than 200 chunks at a time makes the main thread pretty busy as it means each font face needs to be:

fetched from the Service Worker cache (usually fast, unless it would have to go to the CDN)
decompressed from WOFF2 to TTF or OTF buffer (depending on the font)
subsetted based on the used codepoints with Harfbuzz
compressed the subsetted font back into WOFF2

Surprisingly, the bottleneck here is the WOFF2 decompression and compression, rather than the complex glyph subsetting, taking up to 80% of the computation.

Therefore we’ve decided to offload this whole process into a pool of three concurrently running Web Workers, off the main thread. The font buffers are a natural fit for the workers, as they can be transferred between threads without creating a deep copy. The worker inherently introduced two lazy loadable chunks, one for the tiny worker logic and the other shared chunk for the whole subsetting logic - including the embedded woff2 and harfbuzz Web Assembly modules and their respective JavaScript bindings.

As a result, the whole process is non-blocking and resulting in up to 3x faster export times!

New text wrapping algorithm for CJK and multi-codepoint emojis

Our text editor (WYSIWYG) is a regular <textarea> element with the following CSS properties:

Break "words" based on the language-specific rules word-break: normal;
Preserve whitespaces white-space: pre;
Break long overflowing words overflow-wrap: break-word;

This means that while the browser implements the wrapping algorithm while editing, we need to mimic the very same algorithm inside the canvas. Up until now, the algorithm only broke the "words" based on the whitespaces and hyphens, which was enough for alphabetic-based languages, but it didn't consider any other language-specific rules.

Therefore, we've had to completely rewrite our text-wrapping algorithm to adjust for various wrapping rules special to the CJK languages. Under the hood, it relies on identifying break opportunities based on the rules defined with Lookbehind (?<=...)(?<!...) and Lookahead (?=...)(?!...) assertions, which greatly simplified our existing imperative algorithm.

As of now, all our language-specific rules are defined in the following human-readable structure:

1/**
2 * Specifies the line breaking rules based for alphabetic-based languages,
3 * Chinese, Japanese, Korean and Emojis.
4 *
5 * "Hello-world" → ["Hello-", "world"]
6 * "Hello 「世界。」🌎🗺" → ["Hello", " ", "「世", "界。」", "🌎", "🗺"]
7 */
8Regex.or(
9  // Unicode-defined regex for (multi-codepoint) Emojis
10  getEmojiRegex(),
11  // Rules for whitespace and hyphen
12  Break.Before(COMMON.WHITESPACE).Build(),
13  Break.After(COMMON.WHITESPACE, COMMON.HYPHEN).Build(),
14  // Rules for CJK (chars, symbols, currency)
15  Break.Before(CJK.CHAR, CJK.CURRENCY)
16    .NotPrecededBy(COMMON.OPENING, CJK.OPENING)
17    .Build(),
18  Break.After(CJK.CHAR)
19    .NotFollowedBy(COMMON.HYPHEN, COMMON.CLOSING, CJK.CLOSING)
20    .Build(),
21  // Rules for opening and closing punctuation
22  Break.BeforeMany(CJK.OPENING).NotPrecededBy(COMMON.OPENING).Build(),
23  Break.AfterMany(CJK.CLOSING).NotFollowedBy(COMMON.CLOSING).Build(),
24  Break.AfterMany(COMMON.CLOSING).FollowedBy(COMMON.OPENING).Build(),
25);

The Regex-based wrapping also allowed us to integrate the Regex for multi-codepoint emojis defined by Unicode.

We've also improved text wrapping. It's now more stable and consistent across browsers/text editor, and also doesn't mangle up complex emojis!

— Excalidraw (@excalidraw) October 23, 2024

With the new algorithm, we've also made sure to preserve all the whitespaces, eliminating the previous horizontal layout shift often visible in the centered text containers.

Finally, various wrapping conditions special to Chinese, Japanese, and Korean made us extend our rules considerably, as they often try to keep specific symbols and punctuations together with the preceding or following characters. However, we believe the result is worth the effort - see for yourselves.

Notice how, in the same text element, Latin breaks by words, multi-codepoint emojis (flags) break as one unit and CJK breaks by characters, with the special rules of keeping some characters and symbols together (i.e. 界。 界…」 계! or 好》， は」、 요』, ).

What's next: shared codepoints between Chinese and Japanese

Unicode has one major flaw and that is using the same codepoints for similarly looking Chinese (Hanzi) and Japanese (Kanji) characters. This means a font needs to favor one representation of the character over the other. In our case, characters in Xiaolai should mostly resemble their Chinese version, which might feel unnatural in some Japanese contexts. Currently, the only apparent solution with the Canvas API is to add a separate Japanese font containing the Japanese versions of these characters. However, it might not end up being that simple either, hence for now we are leaving this task for the future.

For more details about the CJK support, feel free to check out the following pull request feat: add first-class support for CJK.