The creation of sort keys does not always make sense

by Michael S. Kaplan, published on 2006/01/12 03:03 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/12/511884.aspx

I have talked a lot about sort keys in the past, like in this post and this one.

They are pretty cool from a conceptual standpoint, but they do have one flaw that I was just reminded of the other day.

They can actually take up a hell of a lot of space.

As I pointed out in The String is Freaking Empty!, the minimally sized weight for a string that has no meaningful content is

01 01 01 01 00

So there is a minimum five byte overhead, plus a minimum of two bytes per sort element and a maximum per sort element that is much higher once you add case and diacritic and width and special differences, you have one basic guarantee about a sort key -- that is will often be larger than the string.

It kind of makes the signature for LCMapString look a little bit silly, right? Attend me for a moment....

int LCMapString(
LCID Locale,       // locale identifier
DWORD dwMapFlags, // mapping transformation type
LPCTSTR lpSrcStr, // source string
int cchSrc,        // number of characters in source string
LPTSTR lpDestStr, // destination buffer
int cchDest        // size of destination buffer
);

And then of course there is that very important note about cchDest:

If the function is being used to generate a sort key, the size is a byte count. This byte count must include space for the sort key 0x00 terminator.

That one guarantee you have, that the sort key will often be larger than the original string, makes a cch for the source and a cb for the dest look silly, when the source takes up two bytes per character. A string that is even one gigabyte in size is likely to need more than a 2 gigabyte buffer, and unless the string is entirely lowercase text with no diacritics in it, it may need 3 gigabytes or even more.

Kind of makes you wonder why we bothered to confuse everyone and make cchDest mean a byte count for sort keys when we decided to confuse everyone in the opposite direction in GetLocaleInfo/SetLocaleInfo when you pass in larger INT and FONTSIGNATURE values. So why not be confusing in the same way when we could get some benefit out of it?

Of course, it is mildly immaterial since the notion of any kind of database that has rows that are over a gigabyte in size with index values that are 1.5-3 gigabytes or more per row in size is fairly unlikely.

But if you are looking at limits, the ones laid out for the function that creates sort keys are fairly absurd, since a sort key that is anywhere near that size is absurd. There will come a point where the overhead of creating, updating, and deleting such key values will be much greater than any performance gain you get over using binary values rather than CompareString calls.

Where is that point, exactly? Well, the strings would have to be very large.

For a hint, think about limitations in databases on indexes on TEXT columns, which have this very tradeoff in mind. There are many such times that sort keys are completely absurd to use, not including what's wrong with the Sortkey class in the .NET Framework.

I will talk more about that tradeoff and how different implementations handle it another day. In the meantime, keep in mind that it does not make sense in all scenarios to use sort keys.

This post brought to you by "ﬃ" (U+fb03, a.k.a. LATIN SMALL LIGATURE FFI)
(A character that is thinking if only more letters followed its example, sort keys could fit three times as many characters!)

# Mihai on 12 Jan 2006 12:46 PM:

Hi MisKa

Is there a chance to document the sort key?
At least a little bit?
Not in the official documentation, and not complete, because having ppl. start using/messing with it is a problem.
But at least enought for the readers of your blog to understand what you are talking about.

Yes, one might spend some hours with LCMapString and "reverse engineer" at least parts of the key (which I did a while ago), but why?

# Michael S. Kaplan on 12 Jan 2006 1:05 PM:

Sounds like something for the suggestion box if ever there was something. :-)

# Michael S. Kaplan on 12 Jan 2006 6:02 PM:

Also, who is the 'mishka' chap everyone keeps talking about? :-)

# Maurits [MSFT] on 28 Mar 2006 12:00 PM:

Mischka is the Russian equivalent of "Mike" (Mischa + dimunitive ka)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day