by Michael S. Kaplan, published on 2007/09/12 03:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/12/4875560.aspx
Previous posts in this series:
Because the nature of comparisons is such that the essentially question one is asking is which comes first?, it is very natural to think of it as a competition, almost a fight.
Of course the order is deterministic (minus the odd bug or two) so one could claim that to the extent it is a competition, the game is somewhat rigged. :-)
I could make the claim that the occasional a < b, b < c, c < a scenario is just an attempt to be fair and make everyone a winner, but I am pretty sure that neither Brandon non Jun (both devs on the SQL team) would buy it....
Anyway, given that it can be thought of as a fight, it is only natural that you might expect the loser of the fight to be a bit bruised, to have some marks on himself.
Which brings us to our test strings today:
aaá (or aaá)
aáa (or aáa)
áaa (or áaa)
The strings are basically various random combinations of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) on the left and the canonically equivalent versions made with U+0061 (LATIN SMALL LETTER A) and U+00e1 (LATIN SMALL LETTER A WITH ACUTE) on the right. The two strings in each row will have identical sortkey values (if they don't it is a bug), and the sort keys look like this (secondary a.k.a. diacritic weights are red):
0e 02 0e 02 0e 02 01 02 02 0e 01 01 01 00
0e 02 0e 02 0e 02 01 02 0e 01 01 01 00
0e 02 0e 02 0e 02 01 0e 01 01 01 00
0e 02 0e 02 0e 02 01 01 01 01 00
Okay, so since I did it all with the default table, the primary weights are identical for all four strings. In order to provide the proper ordering of the strings, the secondary weight for the ACUTE is that 0e (actually 02 + 0c, the minimal weight is always added in there), and when a letter in the string does not have a diacritic weight, a minimal weight of 02 has to be put in as a placeholder so that the proper comparison can be made depending on where the acute actually is.
When there are no additional weights, the extra padding is not needed -- which saves some space.
Any time a binary comparison (e.g. memcmp) of two of the byte arrays is done, it will return the same results as if the full string comparison is happening.
And if you have multiple diacritics, the weights are simply added together -- thus when looking at:
s (LATIN SMALL LETTER S)
ș (LATIN SMALL LETTER S + COMBINING COMMA BELOW)
ś (LATIN SMALL LETTER S + COMBINING ACUTE ACCENT)
ș́ (LATIN SMALL LETTER S + COMBINING COMMA BELOW + COMBINING ACUTE ACCENT)
ș́ (LATIN SMALL LETTER S + COMBINING ACUTE ACCENT + COMBINING COMMA BELOW)
0e 91 01 01 01 01 00
0e 91 01 5b 01 01 01 00
0e 91 01 0e 01 01 01 00
0e 91 01 67 01 01 01 00
0e 91 01 67 01 01 01 00
are figured out with easy addition operations:
Now, there is the occasional bug like:
But for normal strings (in normalization form C pre Vista, any form Vista and beyond), the method works quite well.
Looking back to Brett's original question from Part 0, we are starting to build up some interesting rules related to sort key size:
So far, that last point is probably the first one that seems potentially concerning -- imagine a string 2mb in size with the last character only having a diacritic -- your sort key would be 2mb bigger!
Luckily such sort keys with mostly duplicated values can be readily compressed.... :-)
So how are we doing? Is it getting interesting yet?
If not, then just stay tuned for Part 3, coming up tomorrow!
This post brought to you by 2 (U+0032, a.k.a. DIGIT TWO)
go to newer or older post, or back to index or month or day