Apocalypse Font (aka Guess they must have picked the wrong eight characters.)

by Michael S. Kaplan, published on 2008/11/19 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/11/19/9116067.aspx

The title of this blog is an allusion to Coppola's Apocalypse Now, and eventually I'll be quoting a bit of the Herr-provided narration (those are the pieces Martin Sheen read)...

It all started with a seemingly innocent question the other day. It went something like this (product and component names removed to protect whatever might deserve protecting):

We are hitting an issue where surrogate pair characters do not display correctly on localized builds, but display correctly on English builds. This appears to be because the MS UI Gothic font used in the localized builds doesn’t “automatically” do the correct font linking. (This can be verified by e.g. opening Wordpad, setting the font to MS UI Gothic, and typing some surrogate pair characters—you just get squares. If the font is something else, e.g. Arial, the font linking works correctly.)

Is this a known issue with the MS UI Gothic font face? We are currently using one function to obtain the desired font face. Should we be calling a different function instead of this, or in addition to this?

Now as it turns out, there were several different issues going on here.

It start with the involvement of GDI font linking and Uniscribe font fallback, discussed previously in blogs like Font Linking vs. Font Fallback.

First and foremost was the fact that this was what they call a tester scenario. Because of this,the actual supplementary, CJK Extension B characters in question were not ones that are in any version of JIS (including the latest JIS X 213), which is why they were seeing notdef glyphs (aka square boxes).

Uniscribe largely stays out of the world of CJK (Chinese, Japanese, and Korean) text, allowing GDI font linking to so most of the work here. Usually this will guarantee that some ideograph will make an appearance, because as long as it is in one of those core CJK fonts, it will be on the screen.

But there is one time when Uniscribe is completely involved and GDI font linking is not -- and that is supplementary characters.

And Uniscribe is not quite as sophisticated in its efforts here -- it will see if the current font claims to support the Unicode supplementary ideographic plane (which contains e.g. CJK Extension B). If it does then the font will be used, even if there turn out to be some missing characters.

For the Japanese fonts, such as MS Gothic:

MS Gothic

and MS PGothic:

MS PGothic

and MS Mincho:

MS Mincho

and MS PMincho:

MS PMincho

and Meiryo:

Meiryo

each font is actually pretty much limited to the 300-some CJK Extension B characters in JIS X 213.

If you pick one of these fonts to display any other random Extension B ideograph, then you will get a square box.

And if you pick a font with no Extension B support at all, then it will pick one font to look in, based on its algorithm and system locale settings -- thus if you choose Arial or Tahoma or Microsoft Sans Serif or Segoe UI, then you will possibly also get an ideograph!

Korean does not have Extension B in any of its fonts.Given the gemneral tendency toward de-emphasis of Hanja in South Korea and the virtual illegality of it in North Korea, this is hardly a surprise (though this could change in the future if the customer demand drives change here).

And for the most part Chinese has the widest support. Because whether one uses the Simplified Chinese SimSun-ExtB font:

SimSun-ExtB

or the Taiwanese style Traditional Chinese font MingLiU-ExtB:

MingLiU-ExtB

or the Taiwanese style Traditional Chinese font PMingLiU-ExtB:

PMingLiU-ExtB

or the Hong Kong style Traditional Chinese font MingLiU_HKSCS-ExtB:

MingLiU_HKSCS-ExtB

one has a much larger number of ideographs to choose from.

The ranges are of course based on preferred glyphs in the PRC GB18030, Taiwan CNS11643, and Hong Kong HKSCS standards, respectively -- kind of the ultimate exercise of using a code page as a repertoire fence (something I have discussed before).

But the bug did not quite end there.

You seem it seems that the application had its own custom font choosing behavior, which in this case happened to be preferring the newer ClearType Simplified Chinese Microsoft YaHei font.

A font that also has some Extension B in it.

Eight CJK Extension B Ideographs, in fact:

Microsoft YaHei

These eight ideographs are:

𠂇 (U+20087)
𠂉 (U+20089)
𠃌 (U+200cc)
𠦝 (U+2099d)
𡗗 (U+215d7)
𢦏 (U+2298f)
𤇾 (U+241fe)
𧾷 (U+27fb7)

So far, these eight characters as a set seem to have no special relationship in China, Taiwan, Hong Kong, Macao, Singapore, Japan, Korea, or Vietnam, those being the major places where ideographs either are in use or have been within the last 1000 years.

If the characters spelled something special, I'd assume it was some kind of Easter Egg in the font (imagine the challenge if coming up with such an egg that relied on eight Unicode characters displayed in code point order -- talk about a fun word challenge in any language!

I am reminded of a bit from Apocalypse Now where Martin Sheen describes a report about Col Kurtz. Specially modified for the current situation, for the conspiracy theory minded:

Late Summer-Fall 2008:
The proper glyphs for ideographic text in the supplementary
planes show up fine in Vista. Then in November in one font
is noted the presence of eight specific ideographs. Two of
them are in JIS X 213, three are from a list of Hong Kong
Cantonese, one is from some from China. The number of
Extension B ideographs visible in the application in China
drops off to nothing. Guess they must have picked the
wrong eight characters.

Kind of a stretch obviously. But still fun to write (had I time to really draw this one out it would have been as much fun in my opinion as that Matrix one!

Whatever the reasons, their presence (due to the Uniscribe design here) can really break Extension B display support if someone is using the cool font with ClearType support.

If I had to guess, I'd wonder whether they were in there as part of an experimental effort at looking at ClearType Extension B support that just never got taken out (why would they? It's not like they are wrong, except in the meta sense of their effect!). But again that is just a guess. Probably more likely than my Apocalypse Font scenario above! ;-)

An interesting situation, in any case....

This blog brought to you by 𠂇𠂉𠃌𠦝𡗗𢦏𤇾𧾷 (U+20087, U+20089, U+200cc, U+2099d, U+215d7, U+2298f, U+241fe, and U+27fb7)

# Andrew West on 19 Nov 2008 6:37 AM:

There is something settish about these eight characters. Firstly, they are all character components rather than standalone characters in their own right, and, more significantly, six of the eight (U+20087, U+20089, U+200CC, U+215D7, U+2298F and U+241FE) are listed as a set in http://std.dkuug.dk/jtc1/sc2/WG2/docs/n2808.pdf as "characters which are already encoded [that] needs a new source reference". As to U+2099D, it is a wide form of U+9FBA (proposed for encoding in N2808), which may have some significance. U+27FB7 is the odd one out: whereas the other seven are all non-radical character components, U+27FB7 is a radical (equivalent to U+2ECA "CJK RADICAL FOOT").

# Michael S. Kaplan on 19 Nov 2008 9:03 AM:

I guess I could have had title fun here -- like "It's totally radical to trip Uniscribe, dude!".

What do you think? I still like the title already there better, but this would a fun alternate! :-)

# Kaenneth on 21 Nov 2008 8:55 PM:

Possibly they are specific 'patches' to cover flawed versions of those characters in a different font...

# Michael S. Kaplan on 22 Nov 2008 12:50 PM:

Perhaps that was the goal, though obviously the goal would pretty much fail in all Microsoft software, so if that was what they were trying to do then they kind of didn't think it all through. :-)

# Ken Lunde, Adobe Systems on 24 Nov 2008 5:51 PM:

These are for GB 18030 support. At least, six of them are. Below is a mapping from GB 18030 code points (two-byte) to Unicode:

0xFE51 U+20087
0xFE52 U+20089
0xFE53 U+200CC
0xFE6C U+215D7
0xFE76 U+2298F
0xFE91 U+241FE

The other two appear to be included because they represent components.

In any case, it is absolutely clear that they are included because of GB 18030. Trust me. ;-)

# Ken Lunde, Adobe Systems on 24 Nov 2008 5:56 PM:

I should also point out that those same six ideographs originally came from GBK, and GB 18030 is a superset thereof. In GBK, they mapped to PUA code points. Earlier versions of GB 18030 continued to map them to PUA code points, but they now have valid (and more appropriate) homes in Extension B.

# Michael S. Kaplan on 24 Nov 2008 6:36 PM:

Are these the only ones with PUA mappings from GBK, though?

# Ken Lunde, Adobe Systems on 24 Nov 2008 6:40 PM:

Nope, there are others, but as of Unicode Version 4.1, everything that was handled via PUA code points can now be handled with non-PUA code points. Getting away from PUA usage is a "good thing."

# Michael S. Kaplan on 24 Nov 2008 6:52 PM:

Given the Uniscribe design issue, do you think that the makers of the new cool ClearType enabled PRC-friendly font would have added just these eight and not the rest if they knew it meant that none of the others would show up? :-(

I do agree that the Unicode change to add them is a good thing, but the Microsoft-specific behavior and the font specific trigger of that behavior? Not so much....

# Michael S. Kaplan on 24 Nov 2008 6:55 PM:

Though I admit it is nice to see the mystery understood a little better. any thoughts on why these eight in particular were singled out above the others?

# Ken Lunde, Adobe Systems on 24 Nov 2008 7:01 PM:

I don't think that they were singled out, at least in the sense that you're thinking. They are unique among the other characters that were previously handled via PUA code points in that they now map to non-BMP code points. The others are handled via BMP code points. For some other examples, check out U+9FB4 through U+9FBB. I have a table on Chapter 3 of "CJKV Information Processing" Second Edition, due out in a month, that details these mapping changes, from PUA to non-PUA.

# Michael S. Kaplan on 24 Nov 2008 7:07 PM:

Ah, that makes sense. They were probably taken right from the former PUA mappings, then -- as a conscious "look at the bug we have fixed" kind of move. If only they realized the bug they introduced in the process!

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2015/08/03 Getting the full Blog fixed up, images and all!

2015/07/08 Fixing up broken and semi broken blog posts, as needed?

2012/01/26 If font linking doesn't fit the text to a T (or ț!), a Romanian letter may be right but not quite look it

go to newer or older post, or back to index or month or day