How do sort keys work?

by Michael S. Kaplan, published on 2004/12/30 12:04 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/30/344389.aspx


A Sort key is basically an array of bytes. The intention of the sort key is to make for faster comparisons of strings, so that if you compare the sort key values for two strings you will get the same results as comparing the strings themselves. They abstract out all of the irrelevant data (for example if you use NORM_IGNORECASE or CompareOptions.IgnoreCase) then the binary sort key for "AAAA", "AaAa", and  "aaaa" will all be identical. As such, sort keys make a great basis for an index of string values, like you would have in a database engine.

But how are they structured to allow this to happen?

They have the same architecture in both managed code (via the SortKey class) and unmanaged code (via LCMapString with the LCMAP_SORTKEY flag). The structure is described in the LCMapString topic in the Platform SDK:

[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x00

Note that the sort key is null-terminated. This is true regardless of the value of cchSrc. Also note that, even if some of the sort weights are absent from the sort key, due to the presence of one or more ignore flags in dwMapFlags, the 0x01 separators and the 0x00 terminator are still present.

The reason for this structure is that the primary weights (called the Unicode weights, above) need to take priority over secondary weights (called Diacritic weights, above), which themselves have to take priority over the tertiary weights (called Case weights, above), and so forth. In this way, all of the following examples are true when using the invariant locale/culture, as described in the last post:

AAAA   <   AAAB      (primary difference)
AAAA   <   AÃAA      (secondary difference)
aaaa   <   AAAA      (tertiary difference)
AÃAA   <   AAAB      (primary difference, secondary difference ignored)
AAAA   <   aaaB      (primary difference, tertiary difference ignored)
aaaaab <   aaab      (primary difference, length and tertiary difference ignored)
AAA  <   aaà      (secondary difference, special width and tertiary difference ignored)

And so forth. For that to work, the four different categories need to be kept separate and each one needs to be put in the sort key in its entirety, and if any type of weight is ignored then that whole section will be empty.

You can take the sort key, this structured array of bytes, and use it as an index for the string. Comparisons of two byte arrays will always be faster than comparing the string themselves.

This of course assumes that the sort keys are pre-calculated, like in an index. If they are not and you are looking at the difference between caculating then comparing the sort key values for two strings versus comparing the strings themselves, the string comparison will almost certainly be faster. The reason for that is that the sort key calculation involves analyzing the information of the entire string (and still does not include the actual comparison) whereas string comparisons will exit as soon as they can come up with an answer to the question of which one comes first.

I was doing a presentation a few years ago and it occurred to me that looking at direct string comparison versus sort key calculation/comparison was like looking at the "retail" version of collation vs. the "Wholesale" one. Only some people in the crowd felt it was an illuminating analogy, and I once again learned that I should not blurt out "good ideas thst suddenly occur to me" when I am in the middle of a presentation. :-)

One last thought -- no, there is not an Ordinal type of sort key. Because Ordinal comparisons are already done in a binary manner!

 

This post brought to you by "р" (U+0440, a.k.a. CYRILLIC SMALL LETTER ER)


# Norman Diamond on 4 Jan 2005 5:25 PM:

I don't know of any language with an "A with tilde" letter, so would like to ask about languages with an "n with tilde" letter. Notice that I misdescribed the letter, it is more appropriate to call it "enye". (As an analogy, in languages that have an "O with bottom tilde" letter, it is appropriate to call that letter "queue".)

To the best of my understanding, in languages that have an enye, it is a different letter from an en. It is not a secondary difference in the way that you described for a diacritic on an A, it is a primary difference (just as queue is a different letter from oh). Does Windows consider the locale in deciding whether to recognize a difference as primary or as secondary?

I'm not sure whether to ask further about how the array of bytes is treated under different locales. For example in some languages "ng" is a single letter even though it looks exactly the same as the pair of letters "n" and "g". In some languages "ch" is a letter.

In the Library of Congress locale, I'm not sure if a leading "Mc" or "Mac" on a name is treated as a single letter or not, but have read that they are treated as identical ("McDonald" comes before "MacHenry" because both start with that identical prefix, one followed by a D and one followed by an H). OK, the Library of Congress locale probably doesn't have sort keys treated as arrays of bytes, it's probably somewhere on the road to being a phonebook sort. But my other questions still hold.

# Michael Kaplan on 4 Jan 2005 5:36 PM:

The point is that not all languages DO have such a letter -- ENGLISH does not. If I sort data using the English LCID, it is an N with a squiggle on it, and nothing more.

Think of it as "all of Unicode, according to a particular language" and then you will see what is being done here.

Given that fact, it becomes clear that the sort keys will vary depending on the locale that is passed to the API. It has to, since the meaning and the way things sort changes with each locale.

On languages that treat 'ch' as a unique sort element that sorts have 'c' but before 'd', there will obviously be differences in the sort key -- there has to be.

There is no "library of congress" locale on Windows and I do not work for the library of congress so I cannot comment on how they would implement such a sort. I have ideas how we might, but we do not have it as a request right now.

# Norman Diamond on 4 Jan 2005 11:00 PM:

> If I sort data using the English LCID,

Sure, but I was asking a more general question. To me the base note appeared to be more general, but was that just me? I guess that it's necessary to read your postings with an assumed context of "Unicode US English" unless you say otherwise?

> There is no "library of congress" locale on
> Windows

Yeah I know, and I doubt that there is in most OSes. Nonetheless their sort ordering has enough arcane rules that there used to be printed instructions near card catalogs in libraries. In ancient history, card catalogs were made from bits of dead trees with physical markings on them. In order to obtain full search results, it was not sufficient to depend on the card catalog being consistent with its own ordering, users also had to make themselves consistent with it. In modern times it wouldn't even matter if the invariant locale were used, because it doesn't matter what sort ordering the computer uses internally in its database, as long as searches find all matches.

# Michael Kaplan on 5 Jan 2005 12:09 AM:

Hardly, dude. What I said also applies to Thai. Or to Russian. Or to any number of languages that do not recognize as a unique letter with prfimary weight N with a tilde.

To them, it is a N with a smudge on it.

Their opinion is valid and measningful, and we work to make it happen for them as they expect.

Lets drop the card catalog stuff, it really has very little to do with the topic here.

# Arjen Poutsma on 5 Jan 2005 1:42 PM:

I don't know if you have anything to do with it, but I consider .NET class SortKey as one ugliest kludges of the entire .NET framework, especially if you compare it with the Java CollationKey class. Why?

First, there is the static Compare method, but the SortKey does not implement IComparable, nor IComparer. This makes it pretty hard to store them in sortable collections as the ArrayList.

Second, the Equals and GetHashCode() of SortKey operate on both the byte array _and_ the original string. Imagine a scenario where you have a set of localized strings, and you want to check whether the set contains a certain word in various spellings. So the list might contain the German "weiß", which means it also contains "weiss".
One would think that the fastest ways to implement this using .NET is to create a Hashtable with SortKeys as keys, and to check whether the table contains a certain key. However, since the Equals and GetHashcode also operate on the original string, it will not work. The byte arrays might be equal for weiß and weiss, but the original strings are not.

Both of these issues can be resolved (the first by writing a custom IComparer which uses SortKey.Compare; the second by a custom IHashCodeProvider which only relies on the byte array), but kludges they remain.

# Norman Diamond on 5 Jan 2005 4:34 PM:

> What I said also applies to Thai. Or to
> Russian. Or to any number of languages that
> do not recognize as a unique letter with
> prfimary weight N with a tilde.

OK, this time wording "implies" (in a human manner, not mathematical manner, of implying) that for languages that do have enye as a unique letter, Microsoft's sort keys will recognize their primary weight. Let's hope they do.

(Learn to do the same for the yen sign in Japanese and maybe you'll get fewer complaints from Japanese customers.)

# Michael Kaplan on 5 Jan 2005 5:06 PM:

I do not need to hope -- I know it works.

As for the Yen, you are the only one complaining. Everyone else has expressed either nothing lor joy that we take two things that look the same and say they are equivalent. :-)

bhushan on 5 Apr 2011 2:40 AM:

thanks a lot ...

Nice explaination....

Alexander Savin on 1 Dec 2011 5:06 PM:

If I need to generate a sort key for the string I have to call LCMapString twice:

bufferSize = LCMapStringW(..., 0);

...Allocate buffer...

LCMapStringW(..., bufferSize);

Is there a performance flaw in this approach? Doesn't this imply that LCMapString does a double work? Does LCMapString need to scan an input string to determine the sort key size?

If it does, I would very much prefer having an ability to pass, in a single call, the pre-allocated buffer size and having LCMapString return the required buffer size if the input one appears to be not large enough.

What about the following approach (if I don't need a sort key itself but only its hash code)?

...Allocate 512 bytes on stack...

sortKeySize = LCMapStringW(..., 512);

if (sortKeySize == 0)

{

   sortKeySize = LCMapStringW(..., 0);

   ...Allocate buffer...

   LCMapStringW(..., sortKeySize);

}

...Calc hash...

It does a triple work in the worst case but should work faster in most cases. Shouldn't it?

Also, having to estimate the worst sort key size, considering an input string has the length of N, is it true that its sort key is no longer than (2*N + 1)*4?

And if IgnoreCase is specified, for example, will the [all Case weights] section be empty?

Michael, it would be very helpful if you could clarify all this? Thanks.

Michael S. Kaplan on 8 Dec 2011 7:20 AM:

Answers here....


referenced by

2011/12/08 Sort keys - answers to several questions

2007/09/10 A&P of Sort Keys, part 0 (aka The empty string sorts the same in every language)

2006/11/25 Punctuation... now, isn't that SPECIAL [weights] ?

2006/06/02 What the @!#$% is the TERTIARY_WEIGHTS() function for?

2006/06/02 It is only of SECONDARY importance

2006/01/12 The creation of sort keys does not always make sense

2005/03/11 When good SQL queries have trouble....

2005/01/18 The jury will give this string no weight

2005/01/05 What is up with number sorting?

2005/01/01 That is not actually a bug in sort keys....

go to newer or older post, or back to index or month or day