A&P of Sort Keys, part 2 (aka The string that won? Didn't have a mark on him!)

by Michael S. Kaplan, published on 2007/09/12 03:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/12/4875560.aspx

Previous posts in this series:

Part 0: The empty string sorts the same in every language
Part 1: The law of the letter -- e.g. Latin < Greek < Cyrillic

Because the nature of comparisons is such that the essentially question one is asking is which comes first?, it is very natural to think of it as a competition, almost a fight.

Of course the order is deterministic (minus the odd bug or two) so one could claim that to the extent it is a competition, the game is somewhat rigged. :-)

I could make the claim that the occasional a < b, b < c, c < a scenario is just an attempt to be fair and make everyone a winner, but I am pretty sure that neither Brandon non Jun (both devs on the SQL team) would buy it....

Anyway, given that it can be thought of as a fight, it is only natural that you might expect the loser of the fight to be a bit bruised, to have some marks on himself.

Which brings us to our test strings today:

aaá (or aaá)

aáa (or aáa)

áaa (or áaa)

aaa

The strings are basically various random combinations of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) on the left and the canonically equivalent versions made with U+0061 (LATIN SMALL LETTER A) and U+00e1 (LATIN SMALL LETTER A WITH ACUTE) on the right. The two strings in each row will have identical sortkey values (if they don't it is a bug), and the sort keys look like this (secondary a.k.a. diacritic weights are red):

0e 02 0e 02 0e 02 01 02 02 0e 01 01 01 00

0e 02 0e 02 0e 02 01 02 0e 01 01 01 00

0e 02 0e 02 0e 02 01 0e 01 01 01 00

0e 02 0e 02 0e 02 01 01 01 01 00

Okay, so since I did it all with the default table, the primary weights are identical for all four strings. In order to provide the proper ordering of the strings, the secondary weight for the ACUTE is that 0e (actually 02 + 0c, the minimal weight is always added in there), and when a letter in the string does not have a diacritic weight, a minimal weight of 02 has to be put in as a placeholder so that the proper comparison can be made depending on where the acute actually is.

When there are no additional weights, the extra padding is not needed -- which saves some space.

Any time a binary comparison (e.g. memcmp) of two of the byte arrays is done, it will return the same results as if the full string comparison is happening.

And if you have multiple diacritics, the weights are simply added together -- thus when looking at:

s    (LATIN SMALL LETTER S)

ș    (LATIN SMALL LETTER S + COMBINING COMMA BELOW)

ś    (LATIN SMALL LETTER S + COMBINING ACUTE ACCENT)

ș́    (LATIN SMALL LETTER S + COMBINING COMMA BELOW + COMBINING ACUTE ACCENT)

ș́    (LATIN SMALL LETTER S + COMBINING ACUTE ACCENT + COMBINING COMMA BELOW)

The weights:

0e 91 01 01 01 01 00

0e 91 01 5b 01 01 01 00

0e 91 01 0e 01 01 01 00

0e 91 01 67 01 01 01 00

0e 91 01 67 01 01 01 00

are figured out with easy addition operations:

the comma below has a weight of 59 (that plus the minimal 02 weight makes 5b)
the acute has a weight of 0c (that plus the minimal 02 weight makes 0e)
the two diacritics combined have a weight of 65 (0c plus 59 plus the minimal 02 weight makes 67)
no matter what order the diacritics are in, they have the same weight -- note that in this case only one of the last two strings is valid, due to canonical reordering rules in Unicode. Can you guess which one is valid and why?

Easy, right?

Now, there is the occasional bug like:

prior to Vista, there were many times that the canonically equivalent strings did not have the same weight values in locales that were not using the default table -- fixed in Vista
as I described in Everybody's doing the wraparound, you can combine diacritics in ways to make strings that are not the same appear equal, or to have invalid 00 or 01 sentinel-like values

But for normal strings (in normalization form C pre Vista, any form Vista and beyond), the method works quite well.

Looking back to Brett's original question from Part 0, we are starting to build up some interesting rules related to sort key size:

As described in Part 0, every sort key has a fixed overhead of five bytes per sortkey;
According to Part 1, most of Unicode will give you two bytes of primary weight per character;
As described above, no matter how many diacritics you pile on a character, it will never add more than one byte per character;
As also described above, the need to pad in order to have the diacritic weights line up can actually add one additional byte per character preceding one with a diacritic.

So far, that last point is probably the first one that seems potentially concerning -- imagine a string 2mb in size with the last character only having a diacritic -- your sort key would be 2mb bigger!

Luckily such sort keys with mostly duplicated values can be readily compressed.... :-)

So how are we doing? Is it getting interesting yet?

If not, then just stay tuned for Part 3, coming up tomorrow!

This post brought to you by 2 (U+0032, a.k.a. DIGIT TWO)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

2007/09/14 A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line)

2007/09/13 A&P of Sort Keys, part 3 (aka Should you let a string make it's case? If so, Y?)

go to newer or older post, or back to index or month or day