A&P of Sort Keys, part 2 (aka The string that won? Didn't have a mark on him!)

by Michael S. Kaplan, published on 2007/09/12 03:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/12/4875560.aspx


Previous posts in this series:

Because the nature of comparisons is such that the essentially question one is asking is which comes first?, it is very natural to think of it as a competition, almost a fight.

Of course the order is deterministic (minus the odd bug or two) so one could claim that to the extent it is a competition, the game is somewhat rigged. :-)

I could make the claim that the occasional a < b, b < c, c < a scenario is just an attempt to be fair and make everyone a winner, but I am pretty sure that neither Brandon non Jun (both devs on the SQL team) would buy it....

Anyway, given that it can be thought of as a fight, it is only natural that you might expect the loser of the fight to be a bit bruised, to have some marks on himself.

Which brings us to our test strings today:

aaá (or aaá)

aáa (or aáa)

áaa (or áaa)

aaa

The strings are basically various random combinations of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) on the left and the canonically equivalent versions made with U+0061 (LATIN SMALL LETTER A) and U+00e1 (LATIN SMALL LETTER A WITH ACUTE) on the right. The two strings in each row will have identical sortkey values (if they don't it is a bug), and the sort keys look like this (secondary a.k.a. diacritic weights are red):

0e 02 0e 02 0e 02 01 02 02 0e 01 01 01 00

0e 02 0e 02 0e 02 01 02 0e 01 01 01 00

0e 02 0e 02 0e 02 01 0e 01 01 01 00

0e 02 0e 02 0e 02 01 01 01 01 00

Okay, so since I did it all with the default table, the primary weights are identical for all four strings. In order to provide the proper ordering of the strings, the secondary weight for the ACUTE is that 0e (actually 02 + 0c, the minimal weight is always added in there), and when a letter in the string does not have a diacritic weight, a minimal weight of 02 has to be put in as a placeholder so that the proper comparison can be made depending on where the acute actually is.

When there are no additional weights, the extra padding is not needed -- which saves some space.

Any time a binary comparison (e.g. memcmp) of two of the byte arrays is done, it will return the same results as if the full string comparison is happening.

And if you have multiple diacritics, the weights are simply added together -- thus when looking at:

s    (LATIN SMALL LETTER S)

ș    (LATIN SMALL LETTER S + COMBINING COMMA BELOW)

ś    (LATIN SMALL LETTER S + COMBINING ACUTE ACCENT)

ș́    (LATIN SMALL LETTER S + COMBINING COMMA BELOW + COMBINING ACUTE ACCENT)

ș́    (LATIN SMALL LETTER S + COMBINING ACUTE ACCENT + COMBINING COMMA BELOW)

The weights:

0e 91 01 01 01 01 00

0e 91 01 5b 01 01 01 00

0e 91 01 0e 01 01 01 00

0e 91 01 67 01 01 01 00

0e 91 01 67 01 01 01 00

are figured out with easy addition operations:

Easy, right?

Now, there is the occasional bug like:

But for normal strings (in normalization form C pre Vista, any form Vista and beyond), the method works quite well.

Looking back to Brett's original question from Part 0, we are starting to build up some interesting rules related to sort key size:

So far, that last point is probably the first one that seems potentially concerning -- imagine a string 2mb in size with the last character only having a diacritic -- your sort key would be 2mb bigger!

Luckily such sort keys with mostly duplicated values can be readily compressed.... :-)

So how are we doing? Is it getting interesting yet?

If not, then just stay tuned for Part 3, coming up tomorrow!

 

This post brought to you by 2 (U+0032, a.k.a. DIGIT TWO)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

2007/09/14 A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line)

2007/09/13 A&P of Sort Keys, part 3 (aka Should you let a string make it's case? If so, Y?)

go to newer or older post, or back to index or month or day