Theory vs. practice for Korean text collation, Redux

by Michael S. Kaplan, published on 2005/09/12 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/12/463560.aspx


Back in February, I talked about theory vs. practice for Korean text collation, comparing Microsoft's implementation of old Hangul versus the one specified by Unicode (which uses normalization to move between the Korean Jamo and the Korean Hangul syllables).

I thought I should maybe say a little more about what is happening here.

After all, I spent some time explaining that the effort is not made to give the two different Unicode Normalization forms identical weights. But I did not mention that the relative weight difference between any two strings in one Normalization form versus the other, so that the fact that the Form "C" character order

< < <

ac00  < ac01 < d7a2 < d7a3

means that the Form "D" character order will also be the same.

가 < 각 < 힢 < 힣

1100 1161 < 1100 1161 11a8 < 1112 1175 11c1 < 1112 1175 11c2

Now as I said the actual absolute weights will be different if you are looking at the sort keys, but this is okay since in practice the two forms are not really compared against each other. The fact that the relative weights will be the same is good enough for the real world data that would exist for Korean customers.

(The typographic support for Korean Jamo is not yet there anyway!)

By the way, for those who need it to work for them, the Unicode solution will work as well! As I discussed in the Mitigation tools for IDN security problems post, the Unicode normalization functions that have been added to the NLS API have been provided for customers running downlevel of Vista. Simply install the tools and both the NormalizeString and IsNormalizedString funtions will be available to use. And since they are completely conformant with Unicode 4.1, you can use them to convert strings if you truly need to do so.

For those of you using managed code, the .NET Framework 2.0 (a.k.a. Whidbey) also has the Unicode Normalization functionality, in the System.String.Normalize and System.String.IsNormalized methods. They are also completely conformant to the Unicode 4.1 definition.

 

This post brought to you by "" (U+11c2, a.k.a. HANGUL JONGSEONG HIEUH)
(a character that looks a lot like HANGUL JUNGSEONG I (U+1175), though since one is a trailing jamo and the other is a vowel jamo, they will never sort as exactly the same!)


# kurakuraninja on 14 Sep 2005 10:16 AM:

Last February and then again the day before yesterday I talked a bit about how Korean was encoded...

referenced by

2010/04/20 You can't get this particular bit of proverbial toothpaste back into the tube

2008/03/03 On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)

2005/09/14 One more thing about Korean....

go to newer or older post, or back to index or month or day