The jury will give this string no weight

by Michael S. Kaplan, published on 2005/01/18 07:54 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/01/18/355210.aspx


(the title was inspired by a decade and a half of Law & Order on NBC, then A&E, and now TNT!)

I don't want to knock collation on Windows, because I think it rocks. It covers a lot of territory, and it gets the job done (and done well) in a lot of the world. But every once in a while you may find yourself on the bleeding edge of what is supported, and it is important when you are on the bleeding edge to keep from wounding yourself. Thus I am going to talk about one of those edge cases now....

It starts with the NLS APIs that handle collation. When you use the CompareString API to compare two strings or the LCMapString API with the MAP_SORTKEY flag to get the sortkey of one string, an important bit of NLS architecture is involved. That bit is the weight tables that these NLS APIs use to make linguistic comparisons.

The weights are something I discussed a bit in a previous post entitled How do sort keys work? and this post is going talk a little bit more about those weights.

The main problem is that although the weight tables that are used by Windows and the .NET Framework are great for all of the languages and scripts that Windows support, they are not quite as useful when the weights are not present.

There are many reasons for a code point to have no weight. It may actually not be a valid encoded Unicode code point, in which case it would be expected to have no weight.

Or it may be a code point that was not encoded in Unicode until after the operating system shipped a version (in which case it will have no weight since we do not have clairvoyants on staff!).

Or finally (and this is the one that kind of sucks a bit) it may not have been added to our tables yet. So....

If you try to compare strings containing (for example) Tibetan script on any shipping version of Windows, they will all be considered equal to each other. If you tried to get sort keys for them then you will see that they have no weight. Therefore any kind of linguistic comparison will not return useful results; all strings will be equal. And this will happen even though the strings may not be the same length!

There are probably some developers around right now who are objecting to that last point, but I'll give a counterpoint. Let us say that you are comparing  "hello" (U+0068 U+0065 U+006c U+006c U+006f) and "hëllô" (U+0068 U+0065 U+0302 U+006c U+006c U+006f U+030a) using CompareString with the NORM_IGNORENONSPACE flag. You would expect them to be considered equal since you are ignoring diacritics, which means "give the diacritics no weight", even though the length of the two strings is different. So the length is not important -- what is important is that the weights on the two strings are the same.

You'll get the same results if you try to compare strings in other scripts that do not yet have weight (such as Yi Syllables or Khmer).

It will happen in Windows 2000 and in SQL Server 2000 with CJK Unified Ideographs Extension A (1.5MB) or CJK Unified Ideographs Extension B (13MB) -- though we addressed this in Windows XP and Windows Server 2003, and there are new collations in SQL Server 2005 that give these ideographs some weight as well.

And in Longhorn we plan to give everything that is defined some type of default weight, at least.

On a side note, the original version of the post included a bunch of Tibetan strings in it, but .Text actually fails to post when that text is there (it probably has trouble with those "weightless" strings in its parsing logic?). This only affected the initial post; I was able to edit after the post and add characters (like the sponsor line). Weird bug....

Because with (a) MSKLC available, (b) a publicly defined OpenType spec, and (c) custom cultures coming in the "Whidbey" release of the Visual Studio  and the .NET Framework, Microsoft is clearly working to try and "get out of the way" of those who do not want to wait for us to support their language. Such people are right; we should get out of their way, And this is yet another step in that process to help enable them.

And yes, there will be more on these plans in future posts, especially as Beta 2 VS 2005 and Beta 3 of SQL Server 2005 make it out into the world, and then especially as more gets said about the "Longhorn" release of Windows. Stay tuned... because it's gonna keep being interesting. :-)

This post sponsored by "ག" (U+0f42, a.k.a. TIBETAN LETTER GA)


# Mike Dimmick on Tuesday, January 18, 2005 9:08 AM:

SQL Server 2002? I must have missed that one ;-)

I take it that SQL Server 2000 on Windows XP or Server 2003 when configured for Windows collation _will_ use the weights for the CJK Unified Ideographs extensions. I'm surprised that SQL Server collations are being extended for SQL Server 2005 - I thought they were deprecated in favour of using the OS support, remaining only for backwards compatibility reasons.

Collations can be something of a nightmare on SQL Server at times - there have been many occasions where we've developed a system and then discovered in deployment that the end customer has a different default collation. If you've not explicitly specified the collation for columns in temporary tables (which I think we now always do - I've been trying to discourage use of temporary tables), you can get collation mismatch errors. This seems to be a particular problem if one site has a SQL collation selected and the other a Windows collation - even if one is SQL_Latin1_General_CP1_CI_AS and the other Latin1_General.

I've even seen problems where Setup will select one collation if you use the Default setup options, but offer you a different default if you select a Custom install. I forget which way round it is.

# Michael Kaplan on Tuesday, January 18, 2005 9:12 AM:

Actually, SQL Server uses a snapshot of data prior to Windows 2000 RTM, even for Windows collations; they do not currentl use the OS.

So SQL Server 2000 will not ever give these characters weight, unless you use one of the binary collations.

I'll post more about this and the reasons for it another day.

# joe on Tuesday, January 18, 2005 9:20 AM:

Could you say more about getting out of the way?

I see it is in your bio as one of the aspects of your job and I always wondered what it meant. Now you hint here about it, but it seems worth some more description?

# Michael Kaplan on Tuesday, January 18, 2005 9:21 AM:

That is a great idea for another day, joe!

# Michael Kaplan on Tuesday, January 18, 2005 9:23 AM:

I was going to include them in the original post as I said, but .Text did not like them. Here they are, just so you have them.

ཀ་ཅོག་ཞང་གསུམ།
ཀ་གཅིག་སྒོ་གཅིག་མ།
ཀ་གཅིག་ལྕམ་གང་མ།
ཀ་གཅིག་མ།
ཀ་ཆ།
ཀ་ཆུག།
ཀ་ཆེན་བཞི།
ཀ་འཇའ་ལ།
ཀ་གཉིས་པ།

:-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2013/11/18 The price, and more importantly the *cost*, of Social Media can be distracting...

2011/12/12 SharePoint and CJK Extensions A, B, C, D, and even E?

2010/11/01 The consequences of being unintuitive and nonconformant

2010/06/11 Call it Reversible Error, aka Yes it has no weight; it was supposed to have no weight!

2009/02/04 The road to hell is paved with attempts at being compatible

2008/04/09 Fight the Future? (#11 of ??), aka Microsoft is giving this character nada weight but lotsa importance

2008/03/16 On changing the world, or at least the way people order things in it

2007/12/07 If it isn't really Tibetan, could it pinch hit for Burmese?

2007/10/08 The roadꂸ to the solution starts with identifying the actual problem.NET

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/14 A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line)

2007/08/28 Every character has a story #29: U+1000^H^H^H^H0f40, (TIBETAN or MYANMAR LETTER KA, depending on when you ask)

2007/06/07 Putting the camel's nose in Building 24

2007/06/02 Objection, managed code! That zero is leading!

2007/05/12 The exception that proves the rule that was the exception that proves another rule (aka On the variability of the Invariant)

2007/04/30 Giving Yi the weight it deserves

2007/04/19 Search and ye shall find, SIAO style!

2007/04/14 Rhymes with Amharic #4 (a.k.a. we're all [sub]set so turning out the lights and going to [em]bed!)

2007/04/14 Rhymes with Amharic (a.k.a. How about a little breakfast embed, dear?)

2007/03/04 String Indexing?

2007/02/18 He had the strength of an OX[IA], I tell you

2007/02/04 So how does that Naqittaut keyboard work, exactly?

2007/02/03 lo lo lo lo lo-LA

2006/12/27 If you decompose those city elders, you might be able to sort them out!

2006/12/26 The city elders won't give this string weight, either (aka On being consistently dead wrong, aka Ordinal or bust?)

2006/11/19 Even the characters with no weight can be given weight in their own special way

2006/09/25 Why don't all the half forms sort right?

2006/09/06 IsSortable() == false? Well, sometimes it may be lying....

2006/08/15 Did software developers ever learn their ABC's?

2006/05/24 Invariant vs. Ordinal, the third

2006/03/25 I need my SPACE, symbolically speaking

2006/03/09 Don't forget to test the demos

2006/01/18 What do U+0223 and U+0657 have in common?

2005/11/03 My own personal thoughts about collation in the Mono project

2005/09/14 Extending collation support in SQL Server and Jet, Part 1 (the broad strokes)

2005/08/07 New in Vista Beta 1: giving more strings weight

2005/07/20 More on sort elements

2005/06/28 The 'grammar' of identifiers

2005/05/05 A few of the gotchas of CompareString

2005/04/26 Intelligent unmanaged string comparison

2005/04/03 TechEd Bloggers does not work for this site?

2005/02/03 What makes a string meaningful?

2005/02/01 Why that is positively Ethiopic!

2005/01/19 Not all characters are created equal: take SYMBOLS, for example

go to newer or older post, or back to index or month or day