Microsoft does not use the Unicode Collation Algorithm

by Michael S. Kaplan, published on 2004/11/28 04:10 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/11/28/271121.aspx

Robert A. Heinlein told a story in his book Expanded Universe back in 1980 (bear with me, I promise I'll be making a point eventually):

Its a story that comes into my head every time I get a question these days that proves the person asking is not thinking about the fact that the passage of events has an influence on what is possible. Nowhere is this greater that the subject of this posting -- people who wonder why Microsoft does not support the Unicode Collation Algorithm. People notice that Windows seems to have a similar framework and they assume that both of them use the same "default table" that works as a basis for all collations (in other words they assume that Microsoft is based on the based on the Unicode sort weight tables).

The truth is quite different. Unicode's weights have been a part of the UCA, which was first a DRAFT Unicode Technical Report in March of 1997. It did not lose its DRAFT status until November of 1999 and was not a Unicode Technical Standard until August of 1999.

Windows, on the other hand, has had its architecture and its default table in place since NT 3.1 shipped, over a decade ago. How could it be based on the Unicode sort weight tables, which did not exist at that time even in draft form? The temptation to respond to the person asking with a "My dear boy..." (or "My dear girl...") is at times overwhelming!

As to the extra functionality, I'll just say that in the past 15 years have seen a lot of language support being added to Windows, and the expertise that has been applied to its collation support is truly amazing. Its a daunting functionality to work on at times given how well it has performed over the years. :-)

From a philosophical perspective, collation in Windows has always based primarily on the linguistic data that is at its core -- the technical issues have always been driven by the data, not the other way around. I think this is a unique strength of the implementation that allows it to outperform others across a range of languages that is also (in my opinion) far superior. The tables were certainly built up with an entirely different linguistic and development philosophy, and ignoring my opinions about which is better, the data of either one would really be a poor fit for the other.

It is of note (well, to me at least!) that at the last two Unicode Technical Committee meetings that several decisions were made which will cause future versions of the UCA's default table to behave more like Microsoft's. This is not because it's Microsoft's way (we give advice about principles for the UCA but really do not innovate for it since we are not using it to come up with innovations) but because one of the authors of the UCA suggested tweaks to the UCA behavior based on expert advice and user feedback. I guess that means we had the right idea, huh? :-)

2012/07/16 if you see a ZWNBSP in the Release Preview, don't be insensitive and comment it hasn't been eating enough lately!

2011/06/21 The downside of managing to go native...

2010/12/16 You can't ignore crap and hope it won't cause problems...

2010/11/09 I [will have] told you so! Well, perhaps too late (all things considered)...

2010/08/17 It would be like spelling it Anerica or something.

2010/05/06 Dude! Not so Lao'd!

2009/02/04 The road to hell is paved with attempts at being compatible

2008/02/10 Microsoft still does not use the UCA; the converse is also true

2007/10/29 Microsoft is a Form 'C' shop, Part 1

2005/12/23 What Unicode version do you support?

2005/11/03 My own personal thoughts about collation in the Mono project

2005/10/17 Comparing Unicode file names the right way

2005/07/18 MSLU isn't perfect

2005/05/24 Encoding scheme, encoding form, or other

2005/02/12 Why/how MSLU came to be, and more

2004/12/29 Comparison confusion: INVARIANT vs. ORDINAL

2004/12/08 Where is the locale? "Its Invariant." In where?