The jury will give this string no weight

by Michael S. Kaplan, published on 2005/01/18 07:54 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/01/18/355210.aspx

(the title was inspired by a decade and a half of Law & Order on NBC, then A&E, and now TNT!)

I don't want to knock collation on Windows, because I think it rocks. It covers a lot of territory, and it gets the job done (and done well) in a lot of the world. But every once in a while you may find yourself on the bleeding edge of what is supported, and it is important when you are on the bleeding edge to keep from wounding yourself. Thus I am going to talk about one of those edge cases now....

It starts with the NLS APIs that handle collation. When you use the CompareString API to compare two strings or the LCMapString API with the MAP_SORTKEY flag to get the sortkey of one string, an important bit of NLS architecture is involved. That bit is the weight tables that these NLS APIs use to make linguistic comparisons.

The weights are something I discussed a bit in a previous post entitled How do sort keys work? and this post is going talk a little bit more about those weights.

The main problem is that although the weight tables that are used by Windows and the .NET Framework are great for all of the languages and scripts that Windows support, they are not quite as useful when the weights are not present.

There are many reasons for a code point to have no weight. It may actually not be a valid encoded Unicode code point, in which case it would be expected to have no weight.

Or it may be a code point that was not encoded in Unicode until after the operating system shipped a version (in which case it will have no weight since we do not have clairvoyants on staff!).

Or finally (and this is the one that kind of sucks a bit) it may not have been added to our tables yet. So....

If you try to compare strings containing (for example) Tibetan script on any shipping version of Windows, they will all be considered equal to each other. If you tried to get sort keys for them then you will see that they have no weight. Therefore any kind of linguistic comparison will not return useful results; all strings will be equal. And this will happen even though the strings may not be the same length!

There are probably some developers around right now who are objecting to that last point, but I'll give a counterpoint. Let us say that you are comparing "hello" (U+0068 U+0065 U+006c U+006c U+006f) and "hëllô" (U+0068 U+0065 U+0302 U+006c U+006c U+006f U+030a) using CompareString with the NORM_IGNORENONSPACE flag. You would expect them to be considered equal since you are ignoring diacritics, which means "give the diacritics no weight", even though the length of the two strings is different. So the length is not important -- what is important is that the weights on the two strings are the same.

You'll get the same results if you try to compare strings in other scripts that do not yet have weight (such as Yi Syllables or Khmer).

And in Longhorn we plan to give everything that is defined some type of default weight, at least.

On a side note, the original version of the post included a bunch of Tibetan strings in it, but .Text actually fails to post when that text is there (it probably has trouble with those "weightless" strings in its parsing logic?). This only affected the initial post; I was able to edit after the post and add characters (like the sponsor line). Weird bug....

Because with (a) MSKLC available, (b) a publicly defined OpenType spec, and (c) custom cultures coming in the "Whidbey" release of the Visual Studio and the .NET Framework, Microsoft is clearly working to try and "get out of the way" of those who do not want to wait for us to support their language. Such people are right; we should get out of their way, And this is yet another step in that process to help enable them.

And yes, there will be more on these plans in future posts, especially as Beta 2 VS 2005 and Beta 3 of SQL Server 2005 make it out into the world, and then especially as more gets said about the "Longhorn" release of Windows. Stay tuned... because it's gonna keep being interesting. :-)

SQL Server 2002? I must have missed that one ;-)

I take it that SQL Server 2000 on Windows XP or Server 2003 when configured for Windows collation _will_ use the weights for the CJK Unified Ideographs extensions. I'm surprised that SQL Server collations are being extended for SQL Server 2005 - I thought they were deprecated in favour of using the OS support, remaining only for backwards compatibility reasons.

Collations can be something of a nightmare on SQL Server at times - there have been many occasions where we've developed a system and then discovered in deployment that the end customer has a different default collation. If you've not explicitly specified the collation for columns in temporary tables (which I think we now always do - I've been trying to discourage use of temporary tables), you can get collation mismatch errors. This seems to be a particular problem if one site has a SQL collation selected and the other a Windows collation - even if one is SQL_Latin1_General_CP1_CI_AS and the other Latin1_General.

I've even seen problems where Setup will select one collation if you use the Default setup options, but offer you a different default if you select a Custom install. I forget which way round it is.