A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

by Michael S. Kaplan, published on 2007/09/18 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/18/4971376.aspx

Previous posts in this series:

You should think of Part 8 as kind of a seventh inning stretch in this series, where I sit back and you sit back and I impart some of the wisdom I have acquired over the years I have used, worked on, owned, and then assisted with as a "developer emeritus" the collation functionality in Windows.

I'm also going to take potshots at some product designs, mainly pointing at a single Microsoft product when in reality it applies to many MS and non-MS products....

Feel free to disregard it

Feel free to disagree with it.

But don't skip reading it, because I am right. :-)

The issue is one that causes at least half of the possible collation operations done in products like SQL Server to be wrong.

It is one I have discussed previously, in posts like 'Which comes first?' vs. 'Are they equal?'.

The central problem is that IGNORING WEIGHTS in ordering decisions is just plain ignorant.

Just like I pointed out how If you don't always preserve case, you don't always preserve meaning, distinctions in the data should not be ignored, or folded away, or treated as un-needed. They are distinctions -- in case, in width, in diacritics, in Kana. When you order the data, you should keep those distinctions in mind, as ignoring them is requesting that the items be put in random, non-deterministic order.

In short, it is ignorant.

Of course, when doing identity checks, it can make sense to ignore distinctions.

Even when querying for a subset of the data, it can also make sense as an option.

Neither of those two uses is ignorant.

But ordering data (even the data that comes from that subset query) and ignoring distinctions is ignorant.

So why I was on picking on SQL Server a minute ago?

Well, I guess I am picking on most databases here, not just SQLS. Since most of them:

Let's ignore the IGNORANT nature of the first and second of those bullet points, since the reasons should pretty obvious.

And also they don't apply to SQL Server except perhaps in some earlier, lamer versions that came along when it was less of a product that it has managed to become.

Instead let's think about that third bullet point.

This one an my answer don't (strictly speaking) apply to SQL Server since they don't use our sort keys for their indexes (as I pointed out in When good SQL queries have trouble...), they build their own using their borrowed CompareString call. Though you could always hold them responsible too if their solution does not allow what I am about to suggest. :-)

Now if you look back on that very first post, Part 0, where I gave how the sort key looked:

[all Unicode sort weights] 01 [all Diacritic weights] 01 [all Case weights] 01 [all Special weights] 01 [Punctuation weights] 00

Passing NORM_IGNORENONSPACE or NORM_IGNOREWIDTH/NORM_IGNORECASE or NORM_IGNOREKANA is telling LCMapString/LCMapStringEx with the LCMAP_SORTKEY flag to basically take the diacritic or case or special weights and just skipping them.

In other words, when you pass those flags, everything relevant between the appropriate 01 sentinels is removed.

Now if in SQL Server you use the method I talked about in Making SQL Server index usage a bit more deterministic, which is essentially the only way around the problem of supporting two different methods of indexing, you are STILL required to have separate indexes, even if one of the indexes is a literal and complete subset of another (i.e. an index that ignores case and diacritics vs. one that denies neither).

When there is a clear way to use the same index value and not create the huge space burden of mainly duplicated indexes via a non-intuitive and not very well documented syntax with no good user interface support?

Ignoring the multi-language issue for a moment (since those indexes are usually not subsets), providing an intuitive way to support the different views of the data for identity/subset vs. ordering operations via a single index is actually more work to decide how to expose the feature to customers then it would be to provide the technical solution....

Hmmm.... some features to think about for future versions, huh? :-)

On the other hand, they are still at least half a hundred collations behind Windows on the language side with some terrible defaults in the Unicode collation side, let alone providing proper results for scenarios that are only well-understood by regular readers of collation/case posts in this blog. So I don't think we're talking about next version here.

But one day, ignorance of collation can perhaps be cured in products....


This post brought to you by 8 (U+0038, a.k.a. DIGIT EIGHT)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 Wanna take a look at the collation bug Kim pointed out? You're in for a hell of a Tripp!

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

go to newer or older post, or back to index or month or day