by Michael S. Kaplan, published on 2005/01/18 07:54 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/01/18/355210.aspx
(the title was inspired by a decade and a half of Law & Order on NBC, then A&E, and now TNT!)
I don't want to knock collation on Windows, because I think it rocks. It covers a lot of territory, and it gets the job done (and done well) in a lot of the world. But every once in a while you may find yourself on the bleeding edge of what is supported, and it is important when you are on the bleeding edge to keep from wounding yourself. Thus I am going to talk about one of those edge cases now....
It starts with the NLS APIs that handle collation. When you use the CompareString API to compare two strings or the LCMapString API with the MAP_SORTKEY flag to get the sortkey of one string, an important bit of NLS architecture is involved. That bit is the weight tables that these NLS APIs use to make linguistic comparisons.
The weights are something I discussed a bit in a previous post entitled How do sort keys work? and this post is going talk a little bit more about those weights.
The main problem is that although the weight tables that are used by Windows and the .NET Framework are great for all of the languages and scripts that Windows support, they are not quite as useful when the weights are not present.
There are many reasons for a code point to have no weight. It may actually not be a valid encoded Unicode code point, in which case it would be expected to have no weight.
Or it may be a code point that was not encoded in Unicode until after the operating system shipped a version (in which case it will have no weight since we do not have clairvoyants on staff!).
Or finally (and this is the one that kind of sucks a bit) it may not have been added to our tables yet. So....
If you try to compare strings containing (for example) Tibetan script on any shipping version of Windows, they will all be considered equal to each other. If you tried to get sort keys for them then you will see that they have no weight. Therefore any kind of linguistic comparison will not return useful results; all strings will be equal. And this will happen even though the strings may not be the same length!
There are probably some developers around right now who are objecting to that last point, but I'll give a counterpoint. Let us say that you are comparing "hello" (U+0068 U+0065 U+006c U+006c U+006f) and "hëllô" (U+0068 U+0065 U+0302 U+006c U+006c U+006f U+030a) using CompareString with the NORM_IGNORENONSPACE flag. You would expect them to be considered equal since you are ignoring diacritics, which means "give the diacritics no weight", even though the length of the two strings is different. So the length is not important -- what is important is that the weights on the two strings are the same.
You'll get the same results if you try to compare strings in other scripts that do not yet have weight (such as Yi Syllables or Khmer).
It will happen in Windows 2000 and in SQL Server 2000 with CJK Unified Ideographs Extension A (1.5MB) or CJK Unified Ideographs Extension B (13MB) -- though we addressed this in Windows XP and Windows Server 2003, and there are new collations in SQL Server 2005 that give these ideographs some weight as well.
And in Longhorn we plan to give everything that is defined some type of default weight, at least.
On a side note, the original version of the post included a bunch of Tibetan strings in it, but .Text actually fails to post when that text is there (it probably has trouble with those "weightless" strings in its parsing logic?). This only affected the initial post; I was able to edit after the post and add characters (like the sponsor line). Weird bug....
Because with (a) MSKLC available, (b) a publicly defined OpenType spec, and (c) custom cultures coming in the "Whidbey" release of the Visual Studio and the .NET Framework, Microsoft is clearly working to try and "get out of the way" of those who do not want to wait for us to support their language. Such people are right; we should get out of their way, And this is yet another step in that process to help enable them.
And yes, there will be more on these plans in future posts, especially as Beta 2 VS 2005 and Beta 3 of SQL Server 2005 make it out into the world, and then especially as more gets said about the "Longhorn" release of Windows. Stay tuned... because it's gonna keep being interesting. :-)
This post sponsored by "ག" (U+0f42, a.k.a. TIBETAN LETTER GA)
# Mike Dimmick on Tuesday, January 18, 2005 9:08 AM:
# Michael Kaplan on Tuesday, January 18, 2005 9:12 AM:
# joe on Tuesday, January 18, 2005 9:20 AM:
# Michael Kaplan on Tuesday, January 18, 2005 9:21 AM:
# Michael Kaplan on Tuesday, January 18, 2005 9:23 AM:
2013/11/18 The price, and more importantly the *cost*, of Social Media can be distracting...
2011/12/12 SharePoint and CJK Extensions A, B, C, D, and even E?
2010/11/01 The consequences of being unintuitive and nonconformant
2010/06/11 Call it Reversible Error, aka Yes it has no weight; it was supposed to have no weight!
2009/02/04 The road to hell is paved with attempts at being compatible
2008/04/09 Fight the Future? (#11 of ??), aka Microsoft is giving this character nada weight but lotsa importance
2008/03/16 On changing the world, or at least the way people order things in it
2007/12/07 If it isn't really Tibetan, could it pinch hit for Burmese?
2007/10/08 The roadꂸ to the solution starts with identifying the actual problem.NET
2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)
2007/09/14 A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line)
2007/08/28 Every character has a story #29: U+1000^H^H^H^H0f40, (TIBETAN or MYANMAR LETTER KA, depending on when you ask)
2007/06/07 Putting the camel's nose in Building 24
2007/06/02 Objection, managed code! That zero is leading!
2007/05/12 The exception that proves the rule that was the exception that proves another rule (aka On the variability of the Invariant)
2007/04/30 Giving Yi the weight it deserves
2007/04/19 Search and ye shall find, SIAO style!
2007/04/14 Rhymes with Amharic #4 (a.k.a. we're all [sub]set so turning out the lights and going to [em]bed!)
2007/04/14 Rhymes with Amharic (a.k.a. How about a little breakfast embed, dear?)
2007/03/04 String Indexing?
2007/02/18 He had the strength of an OX[IA], I tell you
2007/02/04 So how does that Naqittaut keyboard work, exactly?
2007/02/03 lo lo lo lo lo-LA
2006/12/27 If you decompose those city elders, you might be able to sort them out!
2006/12/26 The city elders won't give this string weight, either (aka On being consistently dead wrong, aka Ordinal or bust?)
2006/11/19 Even the characters with no weight can be given weight in their own special way
2006/09/25 Why don't all the half forms sort right?
2006/09/06 IsSortable() == false? Well, sometimes it may be lying....
2006/08/15 Did software developers ever learn their ABC's?
2006/05/24 Invariant vs. Ordinal, the third
2006/03/25 I need my SPACE, symbolically speaking
2006/03/09 Don't forget to test the demos
2006/01/18 What do U+0223 and U+0657 have in common?
2005/11/03 My own personal thoughts about collation in the Mono project
2005/09/14 Extending collation support in SQL Server and Jet, Part 1 (the broad strokes)
2005/08/07 New in Vista Beta 1: giving more strings weight
2005/07/20 More on sort elements
2005/06/28 The 'grammar' of identifiers
2005/05/05 A few of the gotchas of CompareString
2005/04/26 Intelligent unmanaged string comparison
2005/04/03 TechEd Bloggers does not work for this site?
2005/02/03 What makes a string meaningful?
2005/02/01 Why that is positively Ethiopic!
2005/01/19 Not all characters are created equal: take SYMBOLS, for example
go to newer or older post, or back to index or month or day