When even the bugs seem cool

by Michael S. Kaplan, published on 2005/12/03 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/03/499419.aspx

Back in the July 2005 post entitled Getting Intermediate Forms, I talked about these middle forms that are not quite Unicode normalization Form C and not quite Form D.

By the rules of Unicode it should not matter -- all three have a canonical equivalence, which means that all of them must be treated as if they were the same string.

Of course as I pointed out in my Sorting It All Out to Search Engines post in November 2005, not everyone seems to be following those rules just yet -- I'd say that Google is probably the worst offender given that it is the de facto web search engine in so many people's minds, but Microsoft is a good runner-up as the worst offender since it is trying to make a name for itself here, too.

Note that they are both members of the Unicode Consortium, though currently Google is only an Associate member so perhaps they do not feel as invested in using the Unicode Standard just yet. :-)

(But anyway, that is a topic for another day. And believe me when I say it is for the Microsoft search folks -- their bug is not the cool one, here!)

And it has 100% respect for the Unicode rules about canonically equivalent strings and doing that complete conversion without the intermediate forms between normalization forms C and D.

U+3131 (ㄱ, HANGUL LETTER KIYEOK) has a compatibility equivalence to U+1100 (ᄀ, HANGUL CHOSEONG KIYEOK).

U+314f (ㅏ, HANGUL LETTER A) has a compatibility equivalance to U+1161 (ᅡ, HANGUL JUNGSEONG A).

And 가 (U+1100 U+1161), if converted to normalization form C, becomes 가 (U+ac00, the first precomposed Hangul syllable).

Now if you convert U+3131 U+314f to normalization form KD, you will get U+1100 U+1161 as expected.

And now here comes the bug -- if you convert U+3131 U+314f to normalization form KC, you also get U+1100 U+1161, rather than U+ac00.

Of course if you convert again (either to KC or C) you will get U+ac00 as expected. And to be honest you probably shouldn't be using those compatibility forms anyway. Plus it is not like U+1100 U+1161 is actually an intermediate form or anything.

It just struck me as kind of a cool bug when it was discovered, one the developer who owned the code and a Unicode geek (me) were both able to understand what was going on before even looking at the code to see what the reason might be (I think the code owner probably even knew of the underlying problem before he looked at it -- this is just one of those kind of bugs).

Anyway, it will be fixed in Vista and probably in the next Whidbey and IDN Tools updates. And I guess it was pretty obscure anyway. I still find this kind of bug to be cool for reasons I cannot completely define....

Ha ha ha. Check out abecedaria.blogspot.com

Suzanne is like your biggest fan or something and even she has to use Firefox instead of Internet Exploder.

Serves as an excellent example to motivate the question: When will Microsoft take international issues seriously inside their products instead of just pontificating at great length about them on blogs.

Hard to say, I am running XP SP2, Server 2003 SP1, and random daily builds of Vista, and I almost never have problems with things working well in IE.

(note that the search bug is a problem with SEARCH and the other bug is ours -- so none of them are IE bugs)

So I think I will stick with IE until I actually have a reason to switch....

But beyond that, Syrian -- I guess I am the target of the criticism since I am the primary pontificator on international issues. But I blog about actual features and bugs in MS products along with all that pontification, and am on a team with hundreds of people on it -- so what makes you think MS does not take international issues seriously?

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.