Theory vs. practice for Korean text collation

by Michael S. Kaplan, published on 2005/02/25 04:44 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/25/380266.aspx

Korean is encoded two different ways in Unicode. There are the Jamo (the pieces of characters that are used to make Hangul syllables):

Now everything that is modern Hangul can be found in that Hangul syllables block, which was largely based on existing standards from Korea. Each of those can (in theory) also be encoded as the underlying Jamo, though in practice people do not tend to use them that way, in part because IMEs do not use this method and in part because the typography is mostly not there yet.

But Unicode normalization considers the Hangul syllables to be "Form C" (composite/composed) and the underlying Jamo to be "Form D" (decomposed). Thus e.g. the string "Korean" has these two different forms:

Even if you do not know anything about Korean at all, if you can see the text you can probably see how the former is made up of the pieces of the latter.

But the Jamo serve another purpose -- to create Hangul Syllables that are not representable in the modern block. This is usually referred to as Old Hangul. The easiest way to detect whether a string is Old Hangul is try to normalize it to Form C; if that operation does nothing, then it usually means that it is not modern Hangul. For example, a string like "ᄇᄉᄐ" (1107 1109 1110).

Now there are times that even in Modern Hangul that one might have individual Jamo in with the syllables without it being considered "Old Hangul", something that products like Windows 2000 and SQL Server 2000 support.

And then starting in Windows XP, collation of Old Hangul as a scenario is supported, which is interesting since of course (as I said before) we do not yet have a typography story to fully support it. But there is a great deal of data that already exists for Old Hangul and being able to get that into computers is a good thing. And the strings are intelligible, at least.

As a matter of practical implementation, the work to equate the decomposed and composite forms of Korean in collation did not happen. While such an equivalence is interesting in theory, in practice Modern Hangul usually is the composed Hangul syllables. So while it may make for a useful operation in theory, the need to equate the two in collation is less important in practice. It is generally better to have data in Unicode Normalization Form C anyway when one is doing operations like collation in Microsoft products (and most data entered by MS keyboards is in Form C already).

So does Microsoft do the right thing here? Well, I have asked different people this question over the past few years (the last person I asked just days ago). And every time I talk to people who are native speakers of or experts in Korean I am more convinced that we are. Which is not to say that making the equivalence is wrong -- it's not (and no one has said that it is). But it simply does not seem to be something that impacts actual usage, if the equivalence is not made in collation. Those who do need it can of course use Unicode Normalization to have Form C data prior to comparisons, with no need to force everyone to go through this process. This way, everybody wins. :-)

And in my mind, user expectations beat the benefit of following a theoretical practice, every time....

I guess no one can call the cops if you do not follow the rules. I think you said you do not follow the UCA rules anyway!

Well, I don't think Unicode is messed up here, at all. It just did not make sense for Microsoft to do this equivalence. Perhaps other platforms have other situations that drive the need to mnake this equivalence?

Last February and then again the day before yesterday I talked a bit about how Korean was encoded...

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

2010/04/20 You can't get this particular bit of proverbial toothpaste back into the tube

2008/03/03 On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)

2006/07/22 We're off on the road to Korea! We certainly do get around...

2005/09/14 One more thing about Korean....

2005/09/12 Theory vs. practice for Korean text collation, Redux

2005/07/20 More on sort elements

2005/04/30 Normalization vs. .NET text elements