When even the bugs seem cool

by Michael S. Kaplan, published on 2005/12/03 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/03/499419.aspx


Back in the July 2005 post entitled Getting Intermediate Forms, I talked about these middle forms that are not quite Unicode normalization Form C and not quite Form D.

By the rules of Unicode it should not matter -- all three have a canonical equivalence, which means that all of them must be treated as if they were the same string.

Of course as I pointed out in my Sorting It All Out to Search Engines post in November 2005, not everyone seems to be following those rules just yet -- I'd say that Google is probably the worst offender given that it is the de facto web search engine in so many people's minds, but Microsoft is a good runner-up as the worst offender since it is trying to make a name for itself here, too.

Note that they are both members of the Unicode Consortium, though currently Google is only an Associate member so perhaps they do not feel as invested in using the Unicode Standard just yet. :-)

(But anyway, that is a topic for another day. And believe me when I say it is for the Microsoft search folks -- their bug is not the cool one, here!)

Now Microsoft supports Unicode normalization in Whidbey (as I pointed out in January 2005 in the FoldString.NET post). And it is also supported in unmanaged code, both in Vista and in the Internationalized Domain Names Mitigation APIs package.

(So the tools exist for the MS Search folks to support this requirement!)

And it has 100% respect for the Unicode rules about canonically equivalent strings and doing that complete conversion without the intermediate forms between normalization forms C and D.

Though a minor hiccup was reported with the compatibility equivalences that affects normalization form KC.... and that minor hiccup comes into play when you think about the four forms of Korean encoded into the Unicode Standard.

So let us take for example the two Unicode characters ㄱㅏ (U+3131 U+314f).

U+3131 (ㄱ, HANGUL LETTER KIYEOK) has a compatibility equivalence to U+1100 (, HANGUL CHOSEONG KIYEOK).

U+314f (ㅏ, HANGUL LETTER A) has a compatibility equivalance to U+1161 (, HANGUL JUNGSEONG A).

And 가 (U+1100 U+1161), if converted to normalization form C, becomes 가 (U+ac00, the first precomposed Hangul syllable).

Now if you convert U+3131 U+314f to normalization form KD, you will get U+1100 U+1161 as expected.

And now here comes the bug -- if you convert U+3131 U+314f to normalization form KC, you also get U+1100 U+1161, rather than U+ac00.

Oops!

Of course if you convert again (either to KC or C) you will get U+ac00 as expected. And to be honest you probably shouldn't be using those compatibility forms anyway. Plus it is not like U+1100 U+1161 is actually an intermediate form or anything.

It just struck me as kind of a cool bug when it was discovered, one the developer who owned the code and a Unicode geek (me) were both able to understand what was going on before even looking at the code to see what the reason might be (I think the code owner probably even knew of the underlying problem before he looked at it -- this is just one of those kind of bugs).

Anyway, it will be fixed in Vista and probably in the next Whidbey and IDN Tools updates. And I guess it was pretty obscure anyway. I still find this kind of bug to be cool for reasons I cannot completely define....

Kind of a fun job when even the bugs are cool. :-)

 

This post brought to you by "ㄱ" (U+3131, a.k.a. HANGUL LETTER KIYEOK)


# Syrian on 3 Dec 2005 12:29 PM:

Ha ha ha. Check out abecedaria.blogspot.com

Suzanne is like your biggest fan or something and even she has to use Firefox instead of Internet Exploder.

Serves as an excellent example to motivate the question: When will Microsoft take international issues seriously inside their products instead of just pontificating at great length about them on blogs.

# Michael S. Kaplan on 3 Dec 2005 1:27 PM:

Hard to say, I am running XP SP2, Server 2003 SP1, and random daily builds of Vista, and I almost never have problems with things working well in IE.

(note that the search bug is a problem with SEARCH and the other bug is ours -- so none of them are IE bugs)

So I think I will stick with IE until I actually have a reason to switch....

# Michael S. Kaplan on 3 Dec 2005 1:29 PM:

But beyond that, Syrian -- I guess I am the target of the criticism since I am the primary pontificator on international issues. But I blog about actual features and bugs in MS products along with all that pontification, and am on a team with hundreds of people on it -- so what makes you think MS does not take international issues seriously?

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day