by Michael S. Kaplan, published on 2004/12/30 12:04 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/30/344389.aspx
A Sort key is basically an array of bytes. The intention of the sort key is to make for faster comparisons of strings, so that if you compare the sort key values for two strings you will get the same results as comparing the strings themselves. They abstract out all of the irrelevant data (for example if you use NORM_IGNORECASE or CompareOptions.IgnoreCase) then the binary sort key for "AAAA", "AaAa", and "aaaa" will all be identical. As such, sort keys make a great basis for an index of string values, like you would have in a database engine.
But how are they structured to allow this to happen?
They have the same architecture in both managed code (via the SortKey class) and unmanaged code (via LCMapString with the LCMAP_SORTKEY flag). The structure is described in the LCMapString topic in the Platform SDK:
[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x00
Note that the sort key is null-terminated. This is true regardless of the value of cchSrc. Also note that, even if some of the sort weights are absent from the sort key, due to the presence of one or more ignore flags in dwMapFlags, the 0x01 separators and the 0x00 terminator are still present.
The reason for this structure is that the primary weights (called the Unicode weights, above) need to take priority over secondary weights (called Diacritic weights, above), which themselves have to take priority over the tertiary weights (called Case weights, above), and so forth. In this way, all of the following examples are true when using the invariant locale/culture, as described in the last post:
AAAA < AAAB (primary difference)
AAAA < AÃAA (secondary difference)
aaaa < AAAA (tertiary difference)
AÃAA < AAAB (primary difference, secondary difference ignored)
AAAA < aaaB (primary difference, tertiary difference ignored)
aaaaab < aaab (primary difference, length and tertiary difference ignored)
AAA < aaà (secondary difference, special width and tertiary difference ignored)
And so forth. For that to work, the four different categories need to be kept separate and each one needs to be put in the sort key in its entirety, and if any type of weight is ignored then that whole section will be empty.
You can take the sort key, this structured array of bytes, and use it as an index for the string. Comparisons of two byte arrays will always be faster than comparing the string themselves.
This of course assumes that the sort keys are pre-calculated, like in an index. If they are not and you are looking at the difference between caculating then comparing the sort key values for two strings versus comparing the strings themselves, the string comparison will almost certainly be faster. The reason for that is that the sort key calculation involves analyzing the information of the entire string (and still does not include the actual comparison) whereas string comparisons will exit as soon as they can come up with an answer to the question of which one comes first.
I was doing a presentation a few years ago and it occurred to me that looking at direct string comparison versus sort key calculation/comparison was like looking at the "retail" version of collation vs. the "Wholesale" one. Only some people in the crowd felt it was an illuminating analogy, and I once again learned that I should not blurt out "good ideas thst suddenly occur to me" when I am in the middle of a presentation. :-)
One last thought -- no, there is not an Ordinal type of sort key. Because Ordinal comparisons are already done in a binary manner!
This post brought to you by "р" (U+0440, a.k.a. CYRILLIC SMALL LETTER ER)
# Norman Diamond on 4 Jan 2005 5:25 PM:
# Michael Kaplan on 4 Jan 2005 5:36 PM:
# Norman Diamond on 4 Jan 2005 11:00 PM:
# Michael Kaplan on 5 Jan 2005 12:09 AM:
# Arjen Poutsma on 5 Jan 2005 1:42 PM:
# Norman Diamond on 5 Jan 2005 4:34 PM:
# Michael Kaplan on 5 Jan 2005 5:06 PM:
bhushan on 5 Apr 2011 2:40 AM:
thanks a lot ...
Nice explaination....
Alexander Savin on 1 Dec 2011 5:06 PM:
If I need to generate a sort key for the string I have to call LCMapString twice:
bufferSize = LCMapStringW(..., 0);
...Allocate buffer...
LCMapStringW(..., bufferSize);
Is there a performance flaw in this approach? Doesn't this imply that LCMapString does a double work? Does LCMapString need to scan an input string to determine the sort key size?
If it does, I would very much prefer having an ability to pass, in a single call, the pre-allocated buffer size and having LCMapString return the required buffer size if the input one appears to be not large enough.
What about the following approach (if I don't need a sort key itself but only its hash code)?
...Allocate 512 bytes on stack...
sortKeySize = LCMapStringW(..., 512);
if (sortKeySize == 0)
{
sortKeySize = LCMapStringW(..., 0);
...Allocate buffer...
LCMapStringW(..., sortKeySize);
}
...Calc hash...
It does a triple work in the worst case but should work faster in most cases. Shouldn't it?
Also, having to estimate the worst sort key size, considering an input string has the length of N, is it true that its sort key is no longer than (2*N + 1)*4?
And if IgnoreCase is specified, for example, will the [all Case weights] section be empty?
Michael, it would be very helpful if you could clarify all this? Thanks.
Michael S. Kaplan on 8 Dec 2011 7:20 AM:
Answers here....
referenced by
2011/12/08 Sort keys - answers to several questions
2007/09/10 A&P of Sort Keys, part 0 (aka The empty string sorts the same in every language)
2006/11/25 Punctuation... now, isn't that SPECIAL [weights] ?
2006/06/02 What the @!#$% is the TERTIARY_WEIGHTS() function for?
2006/06/02 It is only of SECONDARY importance
2006/01/12 The creation of sort keys does not always make sense
2005/03/11 When good SQL queries have trouble....
2005/01/18 The jury will give this string no weight
2005/01/05 What is up with number sorting?
2005/01/01 That is not actually a bug in sort keys....