Johab to be kidding me!

by Michael S. Kaplan, published on 2008/09/14 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/09/14/8950506.aspx

From the list of bugs from that cool presentation from the folks over in Intel localization....

The bug? Well, it seemed that Korean was "randomly" not working!

By randomly I mean it was not working on some machines but everything was just fine on others.

No real discernible pattern at first glance, but they tracked it down eventually -- having a particular font installed was causing it.

Wow, talk about the power of typography, huh? :-)

The font they found behind the problem was Arial Unicode MS, which I have mentioned before as not being the best possible choice for font in blogs like this one and this one and especially this one.

Though to be fair, it takes more than just having Arial Unicode MS installed to cause troubles. In fact a unique constellation of attribute is required to cause problems!

First, it requires code in the application that is setting the LOGFONT.lfCharSet to JOHAB_CHARSET;
Second, it requires that you have a font on your machine that claims support for code page 1361 (Korean Johab);
Third, it requires the appropriate support for code page 1361

It turns out that Arial Unicode MS is just such a font:

The way that a font gets this is setting but 21 in the Code Page Bitfields, and if a font has this set and the code specifically requests the JOHAB_CHARSET then it is aslmost unfair to blame the Font Mapper in GDI for finding a font that matches...

Of course there are probably other fonts out there that have this bit set, though note that none of the Korean fonts that ship in any version of Windows do this.

In fact out of the 863 fonts on this machine, only one font other than Arial Unicode MS has this bit set: Code2000 from James Kass:

I don't know why mega fonts would do this specifically, though I have a guess.

For mega fonts, setting bits in the Code Page Bitfields and the Unicode Subset Bitfields with a paint sprayer seems to be their way of saying "we support a lot of stuff!" even though as this point nothing whatsoever is encoded in this specific weird code page....

To be fair to these two fonts, they are just being promiscuous, and that is not a sin in and of itself.

For sin to take place, you have to request the JOHAB_CHARSET in your code that is loading the font, which I suppose (to continue the less than appropriate metaphor) requires your code to put the $100 between it's teeth looking for a promiscuous typographical partner, which the GDI Font Mapper then facilitates -- it is only doing its best to see that both provider and customer are both satisfied, after all. :-)

And there are machines out there that have that code page included on them (I guess this would be an available back seat somewhere or something?):

Interestingly, this can happen with Unicode applications as well.

Because the whole point of the JOHAB_CHARSET processing is more than just a code page or a charset, though please feel free to add 1361 to the list of code pages that suck:

20269 (also here)
1258
21027
42 (aka CP_SYMBOL)
864

But like the ISO 6937-based code page 20269, the Johab code page actually works under a different character encoding philosophy, as described in several places, from Ken Lunde's CJKV Information Processing to Richard Gillam's Unicode Demystified. It basically has the intent of breaking down Hangul into its constituent Jamo in ways that don't really tend to completely match the way Jamo work in Unicode (in the latter case I speak of the code page on Windows, which is conveniently left out of the list on either Microsoft's or Unicode's sites, except one link under obsolete code pages, here).

Suffice to say that anything that GDI does here, it is only doing because the application has specifically requested it.

At this point Johab is widely deprecated, though one can suppose that some text editor might still be using it, which makes it harder to just remove the code page from Windows (either on the NLS side or the GDI side -- since the latter can impact Unicoe applications, too!).

But at a minimum, you should never specify it, either in your dialog resources (via the FONT statement) or in code (using the LOGFONT structure), unless you are specifically expecting things to be processed Johab-style (which is not the same as Unicode Normalization Forms D or KD).

In any case, a great obscure globalization issue, made harder to track down due to the lack of good documentation describing the behavior of the seldom-used Johab support on Windows)....

This blog brought to you by 갉 (U+ac09, aka HANGUL SYLLABLE KIYEOK A RIEULKIYEOK)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day