by Michael S. Kaplan, published on 2007/09/11 03:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/11/4861436.aspx
Previous posts in this series:
Okay, we'll start with something simple, basically a bunch of simple lowercase letters with no diacritics on them.
I'll take a string that grabs some of those look-alike characters from the Latin, Cyrillic, and Greek scripts. Our test string will be:
Which is U+0061 U+006f U+006b U+03b1 U+03bf U+03ba U+0430 U+043e U+043a, or:
LATIN SMALL LETTER A, LATIN SMALL LETTER O, LATIN SMALL LETTER K,
GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OMICRON, GREEK SMALL LETTER KAPPA,
CYRILLIC SMALL LETTER A, CYRILLIC SMALL LETTER O, CYRILLIC SMALL LETTER KA
Ok, let's look at the sort key, with a simple call to LCMapString with the LCMAP_SORTKEY flag, using 0x0409 (US English), although many other LCIDs will give use that same default table....
The sort key is the following stream of bytes. As in the first post, the sentinels are in black. The Unicode weight (UW) values will be in green.
0e 02 0e 7c 0e 36 0f 02 0f 10 0f 0b 10 06 10 48 10 36 01 01 01 01 00
Ok, so we have a nice easy read here -- 9 characters, 18 bytes.
Notice how the first byte in this two byte Unicode weight is the same for each script? That is the Script Member (SM) value:
and so on. The assignments here are arbitrary and of course are not guaranted if a major version change occurs.
Thus where I generated this key, quite arbitrarily, Latin < Greek < Cyrillic. Thus if one is comparing the first letter to the fourth, one will get the same results as comparing the first byte of 0e 02 with the first bytes of 0f 02.
The second byte is a value used to distinguish primary differences between letters. If one were comparing the second byte of 0e 02 with the second byte of 03 7c.
Any time a string contains only simple alphabetic characters with no additional weights on them, this same kind of key will be generated.
Since a shorter string always comes before an otherwise equal but longer one, comparing αο to αοκ would be comparing
0e 02 0e 7c 01 01 01 01 00
0e 02 0e 7c 0e 36 01 01 01 01 00
and suddenly the value of those sentinels becomes clear - since they will (bugs aside) be less than any legitimate weight, the shorter string comes before the longer.
Now at this point there is not much more to say about this simple case,since it is (after all) simple. So tomorrow we will try to make things a little bit more complicated....
This post brought to you by 1 (U+0031, a.k.a. DIGIT ONE)
go to newer or older post, or back to index or month or day