Microsoft does not use the Unicode Collation Algorithm

by Michael S. Kaplan, published on 2004/11/28 04:10 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/11/28/271121.aspx


Robert A. Heinlein told a story in his book Expanded Universe back in 1980 (bear with me, I promise I'll be making a point eventually):

A few years ago, I was visited by an astronomer, quite young and brilliant. He claimed to be a long-time reader of my fiction and his conversation proved it. I was telling him about a time I needed a synergiestic orbit from Earth to a 24-hour station; I told him the story it was in, he was familiar with the scene, mentioned having read the book in grammar school.

This orbit is similar in appearance to a cometary interplanet transfer but is in fact a series of compromises in order to arrive in step with the space station; elapsed time is an unsmooth integral not to be found in Hudson's Manual but it can be solved by the methods used on the Siacci empiricals for atmosphere ballistic: numerical integration.

I'm married to a woman who knows more math, history, and languages than I do. This should teach me humility (and sometimes does, for a few minutes). Her brain is a great help to me professionally. I was telling this young scientist how we obtained yards of butcher paper, then each  of us worked three days, independently, solved the problem and checked each other -- then the answer disappeared into *one* line of *one* paragraph (SPACE CADET) but the effort had been worthwhile since it controlled what I could do dramatically in that sequence.

Dr Whoosis said "But *why* didn't you just shove it through a computer?"

I blinked at him. Then said slowly, gently, "My dear boy--" (I don't usually call PH.D.'s in hardcore sciences "My dear boy"--they impress me. But this was a special case. "My dear boy... this was *1947*."

It took him some moments to get it, then he blushed....

Its a story that comes into my head every time I get a question these days that proves the person asking is not thinking about the fact that the passage of events has an influence on what is possible. Nowhere is this greater that the subject of this posting -- people who wonder why Microsoft does not support the Unicode Collation Algorithm. People notice that Windows seems to have a similar framework and they assume that both of them use the same "default table" that works as a basis for all collations (in other words they assume that Microsoft is based on the based on the Unicode sort weight tables).

The truth is quite different. Unicode's weights have been a part of the UCA,  which was first a DRAFT Unicode Technical Report in March of 1997. It did not lose its DRAFT status until November of 1999 and was not a Unicode Technical Standard until August of 1999.

Windows, on the other hand, has had its architecture and its default table in place since NT 3.1 shipped, over a decade ago. How could it be based on the Unicode sort weight tables, which did not exist at that time even in draft form? The temptation to respond to the person asking with a "My dear boy..." (or "My dear girl...") is at times overwhelming!

As to the extra functionality, I'll just say that in the past 15 years have seen a lot of language support being added to Windows, and the expertise that has been applied to its collation support is truly amazing. Its a daunting functionality to work on at times given how well it has performed over the years. :-)

From a philosophical perspective, collation in Windows has always based primarily on the linguistic data that is at its core -- the technical issues have always been driven by the data, not the other way around. I think this is a unique strength of the implementation that allows it to outperform others across a range of languages that is also (in my opinion) far superior. The tables were certainly built up with an entirely different linguistic and development philosophy, and ignoring my opinions about which is better, the data of either one would really be a poor fit for the other.

It is of note (well, to me at least!) that at the last two Unicode Technical Committee meetings that several decisions were made which will cause future versions of the UCA's default table to behave more like Microsoft's. This is not because it's Microsoft's way (we give advice about principles for the UCA but really do not innovate for it since we are not using it to come up with innovations) but because one of the authors of the UCA suggested tweaks to the UCA behavior based on expert advice and user feedback. I guess that means we had the right idea, huh? :-)


# Capt. Jean-Luc Pikachu on 28 Nov 2004 1:57 PM:

Thanks for the anecdote...

referenced by

2012/07/16 if you see a ZWNBSP in the Release Preview, don't be insensitive and comment it hasn't been eating enough lately!

2011/06/21 The downside of managing to go native...

2010/12/16 You can't ignore crap and hope it won't cause problems...

2010/11/09 I [will have] told you so! Well, perhaps too late (all things considered)...

2010/08/17 It would be like spelling it Anerica or something.

2010/05/06 Dude! Not so Lao'd!

2009/02/04 The road to hell is paved with attempts at being compatible

2008/02/10 Microsoft still does not use the UCA; the converse is also true

2007/10/29 Microsoft is a Form 'C' shop, Part 1

2005/12/23 What Unicode version do you support?

2005/11/03 My own personal thoughts about collation in the Mono project

2005/10/17 Comparing Unicode file names the right way

2005/07/18 MSLU isn't perfect

2005/05/24 Encoding scheme, encoding form, or other

2005/02/12 Why/how MSLU came to be, and more

2004/12/29 Comparison confusion: INVARIANT vs. ORDINAL

2004/12/08 Where is the locale? "Its Invariant." In where?

go to newer or older post, or back to index or month or day