Extending collation support in SQL Server and Jet, Part 0 (HISTORY)

by Michael S. Kaplan, published on 2005/09/13 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/13/463962.aspx

Folks on the Microsoft Jet [red] team were really worried about the Jet reliance on the OS collation functions for CJK sorts, since they would give different results across several of the different supported versions of Windows.

To solve this problem, a Program Manager on the Microsoft Jet [red] team staged a raid on the collation data used by Windows. The data was used to create a solution that would give consistent results across all platforms. It would fold together all of the collations that gave identical results (thus no need for separate entries for both Norwegian and Danish, or for Swedish and Finnish, etc.). The project code name was known as Unicorn, and it first shipped in Access 2000 as a dll named MSWSTR10.DLL.

The folks in SQL Server, who were facing the same problem, took the Jet [Red] Unicorn solution and in SQL Server 2000 shipped with something they called SQLSORT.DLL.

In the fancy tradition of the Jet [red] API, the exports for both of these DLLs had their names stripped. And both versions of the solution made their way out into the world, in Office 2000 and SQL Server 2000.

First of all, when I said that the PM staged a raid, I was not in any way exagerating the point. It was done without our knowledge. I know that because he did it just after Windows 2000 Beta 1 and before Windows 2000 Beta 2. It was during that weird period just after then SDE Julie Bennett had once again tried to fix the Turkic 'I' problem and just before she was forced to take the fix back out due to backcompat breaks. And it was also just after all of the DEFAULT TABLE work for Indic and other languages was added but just before most of the necessary EXCEPTION and COMPRESSION data was added for those languages. But since we were not told about the fact the information was being borrowed, we could not warn them to pick up the update to the data that would make the proper results available.

Because high speed of the underlying sorting functions is essential to the efficient operation of Database products, the Windows NT code was substantially optimized when it was ported. For most cases the MSWSTR10.DLL functions are about 50% faster than the Windows NT equivalent functions, but for some languages such as Thai the speed improvement is much, much higher.

I am sure that folks who wanted a 50-300% speed improvement in languages that use compressions (which is where most of the optimization was done) would have appreciated having the issue communicated back to the team that provided the code and the data. However, when you swipe a wallet you probably don't warn the victim that their fly was open.... :-)

In other words, the whole project can probably serve as a textbook example of why teams need to work together, in collaboration with each other. Because if they do not, then in the end everyone suffers....

Well, FWIW, that situation has since been long fixed on the SQL Server side -- we now do work in collboration with folks to provide proper solutions, and in part because of that cooperative spirit SQL Server 2005 will ship with many updated collations based on the Windows Server 2003 data (including the proper support for all of those languages that were missed the first around in Windows 2000).

Now it does not help with (for example) all of the new ELK language support that has been added (as discussed here and here) -- none of it that is in Yukon (more on that in a second).

For the Jet side things are not as good as that, since there is no Jet [Red] update (even Access 2003 still ships with Jet 4.0, just like Access 2000 and 2002 did) to pick up fixes to those problems. So Access/Jet basically has those older tables, missing support for at least 40 languages as of those two ELK releases.

(ASIDE: I do keep calling it Jet [Red] to distingish it from the Jet [Blue] engine that actually still does call our collation functions and never went in for that snapshot stuff -- they ship with the OS and need to support every language the OS does. And I promise that if Brett Shirley ever starts blogging that I will be reading it!)

In the meantime, people have noticed this problem. We claim that Hindi has been supported since Windows 2000, but a Hindi speaker tries to use Access or SQL Server and sees that there is no good collation support for it. Or they get excited about the Quechua or Mapudungun or Maltese support we added in ELKs but again neither Access or SQL Server seems to show that such support exists. And there is no way to look at the upcoming language list for Longhorn and the new locales being added and not get downright depressed about this whole issue, and the fact that as we get more agile in Windows we are starting to make these other products look worse thereby.

Anyway, I have had several talk to me lately, since these new ELK languages have been coming out and since even Vista Beta 1 has an impressive list of languages added (at the recent Internationalization and Unicode Conference, Ning Jin-Grisaffi and Kieran Snyder, including lots of detail on the Tibetan, Mongolian, Uighur, and Yi support!). They want to know how to get support for these languages in either Jet or SQL Server (or both), when running on Vista.

Now most of the above was written over the last few months as I worked to provide the answer to that question -- which was going to be that there was no answer, unfortunately. Sorry, go complain to those products, it is their mess.

However, I then figured out a solution (well, several possible solutions) that would actually be able to provide assistance in tehese scenarios. And thus this post series (Extending collation support in SQL Server and Jet) was born. You can consider this post to be -- as the title indicates -- Part 0, the historical aspects. As I am sure you can imagine, since I am promising solutions (and particularly considering how bleakly the historical picture has been painted) there is nowhere to go but up.... and the going back up part is going to be a lot of fun.

So stay tuned, and if you care about being able to extend the language support of these database products then stay tuned and prepare to have your socks knocked off!

This comment brought to you by "ᕣ" (U+1563, a.k.a. CANADIAN SYLLABICS N-CREE THII) which wants to remind readers everywhere that just because it says the right thing on a link doesn't mean the author actually set the destination of the link correctly.

I'm posting this link because I want to tell you the Access team has forked the Jet 4.0 engine and modified it for Access 12 so you can complain to the Access team.

Aha, yes -- of course the collation code/data is probably not well known to them. An uphill battle, to say the least.... :-(

BTW, doing this is like taking a CVS snapshot of a library, and linking it to your program.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

2007/11/06 The more you understand, the more cynical you may become

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/05/01 Not everyone does the right thing for Romanian

2006/08/26 The myth of cross-product compatibility

2005/10/21 Extending collation support in SQL Server and Jet, Part 4 (What about Jet?)

2005/10/09 Extending collation support in SQL Server and Jet, Part 3 (THAT CLASS)

2005/09/25 Extending collation support in SQL Server and Jet, Part 2.1 (is this on?)

2005/09/18 Extending collation support in SQL Server and Jet, Part 2 (generating sort keys)

2005/09/14 Extending collation support in SQL Server and Jet, Part 1 (the broad strokes)