Theory vs. practice for Korean text collation

by Michael S. Kaplan, published on 2005/02/25 04:44 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/25/380266.aspx


Let's take a look at Korean for a moment.

Korean is encoded two different ways in Unicode. There are the Jamo (the pieces of characters that are used to make Hangul syllables):

and the actual Hangul syllables:

I'll talk more about the specifics of Jamo another day.

Now everything that is modern Hangul can be found in that Hangul syllables block, which was largely based on existing standards from Korea. Each of those can (in theory) also be encoded as the underlying Jamo, though in practice people do not tend to use them that way, in part because IMEs do not use this method and in part because the typography is mostly not there yet.

But Unicode normalization considers the Hangul syllables to be "Form C" (composite/composed) and the underlying Jamo to be "Form D" (decomposed). Thus e.g. the string "Korean" has these two different forms:

한국어                          (d55c ad6d c5b4)

한국어            (1112 1161 11ab 1100 116e 11a8 110b 1165)

Even if you do not know anything about Korean at all, if you can see the text you can probably see how the former is made up of the pieces of the latter.

Now in actual practice, Modern Korean uses the Hangul Syllables.

But the Jamo serve another purpose -- to create Hangul Syllables that are not representable in the modern block. This is usually referred to as Old Hangul. The easiest way to detect whether a string is Old Hangul is try to normalize it to Form C; if that operation does nothing, then it usually means that it is not modern Hangul. For example, a string like "ᄇᄉᄐ" (1107 1109 1110).

Now there are times that even in Modern Hangul that one might have individual Jamo in with the syllables without it being considered "Old Hangul", something that products like Windows 2000 and SQL Server 2000 support.

And then starting in Windows XP, collation of Old Hangul as a scenario is supported, which is interesting since of course (as I said before) we do not yet have a typography story to fully support it. But there is a great deal of data that already exists for Old Hangul and being able to get that into computers is a good thing. And the strings are intelligible, at least.

As a matter of practical implementation, the work to equate the decomposed and composite forms of Korean in collation did not happen. While such an equivalence is interesting in theory, in practice Modern Hangul usually is the composed Hangul syllables. So while it may make for a useful operation in theory, the need to equate the two in collation is less important in practice. It is generally better to have data in Unicode Normalization Form C anyway when one is doing operations like collation in Microsoft products (and most data entered by MS keyboards is in Form C already).

So does Microsoft do the right thing here? Well, I have asked different people this question over the past few years (the last person I asked just days ago). And every time I talk to people who are native speakers of or experts in Korean I am more convinced that we are. Which is not to say that making the equivalence is wrong -- it's not (and no one has said that it is). But it simply does not seem to be something that impacts actual usage, if the equivalence is not made in collation. Those who do need it can of course use Unicode Normalization to have Form C data prior to comparisons, with no need to force everyone to go through this process. This way, everybody wins. :-)

And in my mind, user expectations beat the benefit of following a theoretical practice, every time....

 

This post brought to you by "" (U+1110, a.k.a. HANGUL CHOSEONG THIEUTH)


# AC on 25 Feb 2005 7:50 AM:

I guess no one can call the cops if you do not follow the rules. I think you said you do not follow the UCA rules anyway!

# Michael Kaplan on 27 Feb 2005 5:49 AM:

That is correct -- we do not use the UCA. :-)

# AC on 2 Mar 2005 12:35 PM:

Everything you are saying here makes sense. So why is Unicode so messed up here?

# Michael Kaplan on 8 Mar 2005 2:48 AM:

Well, I don't think Unicode is messed up here, at all. It just did not make sense for Microsoft to do this equivalence. Perhaps other platforms have other situations that drive the need to mnake this equivalence?

# Ricardo S. San on 14 Sep 2005 10:16 AM:

Last February and then again the day before yesterday I talked a bit about how Korean was encoded...

referenced by

2010/04/20 You can't get this particular bit of proverbial toothpaste back into the tube

2008/03/03 On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)

2006/07/22 We're off on the road to Korea! We certainly do get around...

2005/09/14 One more thing about Korean....

2005/09/12 Theory vs. practice for Korean text collation, Redux

2005/07/20 More on sort elements

2005/04/30 Normalization vs. .NET text elements

go to newer or older post, or back to index or month or day