by Michael S. Kaplan, published on 2005/12/03 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/03/499419.aspx
Back in the July 2005 post entitled Getting Intermediate Forms, I talked about these middle forms that are not quite Unicode normalization Form C and not quite Form D.
By the rules of Unicode it should not matter -- all three have a canonical equivalence, which means that all of them must be treated as if they were the same string.
Of course as I pointed out in my Sorting It All Out to Search Engines post in November 2005, not everyone seems to be following those rules just yet -- I'd say that Google is probably the worst offender given that it is the de facto web search engine in so many people's minds, but Microsoft is a good runner-up as the worst offender since it is trying to make a name for itself here, too.
Note that they are both members of the Unicode Consortium, though currently Google is only an Associate member so perhaps they do not feel as invested in using the Unicode Standard just yet. :-)
(But anyway, that is a topic for another day. And believe me when I say it is for the Microsoft search folks -- their bug is not the cool one, here!)
Now Microsoft supports Unicode normalization in Whidbey (as I pointed out in January 2005 in the FoldString.NET post). And it is also supported in unmanaged code, both in Vista and in the Internationalized Domain Names Mitigation APIs package.
(So the tools exist for the MS Search folks to support this requirement!)
And it has 100% respect for the Unicode rules about canonically equivalent strings and doing that complete conversion without the intermediate forms between normalization forms C and D.
Though a minor hiccup was reported with the compatibility equivalences that affects normalization form KC.... and that minor hiccup comes into play when you think about the four forms of Korean encoded into the Unicode Standard.
So let us take for example the two Unicode characters ㄱㅏ (U+3131 U+314f).
U+3131 (ㄱ, HANGUL LETTER KIYEOK) has a compatibility equivalence to U+1100 (ᄀ, HANGUL CHOSEONG KIYEOK).
U+314f (ㅏ, HANGUL LETTER A) has a compatibility equivalance to U+1161 (ᅡ, HANGUL JUNGSEONG A).
And 가 (U+1100 U+1161), if converted to normalization form C, becomes 가 (U+ac00, the first precomposed Hangul syllable).
Now if you convert U+3131 U+314f to normalization form KD, you will get U+1100 U+1161 as expected.
And now here comes the bug -- if you convert U+3131 U+314f to normalization form KC, you also get U+1100 U+1161, rather than U+ac00.
Oops!
Of course if you convert again (either to KC or C) you will get U+ac00 as expected. And to be honest you probably shouldn't be using those compatibility forms anyway. Plus it is not like U+1100 U+1161 is actually an intermediate form or anything.
It just struck me as kind of a cool bug when it was discovered, one the developer who owned the code and a Unicode geek (me) were both able to understand what was going on before even looking at the code to see what the reason might be (I think the code owner probably even knew of the underlying problem before he looked at it -- this is just one of those kind of bugs).
Anyway, it will be fixed in Vista and probably in the next Whidbey and IDN Tools updates. And I guess it was pretty obscure anyway. I still find this kind of bug to be cool for reasons I cannot completely define....
Kind of a fun job when even the bugs are cool. :-)
This post brought to you by "ㄱ" (U+3131, a.k.a. HANGUL LETTER KIYEOK)
# Syrian on 3 Dec 2005 12:29 PM:
# Michael S. Kaplan on 3 Dec 2005 1:27 PM:
# Michael S. Kaplan on 3 Dec 2005 1:29 PM: