A&P of Sort Keys, part 1 (aka The law of the letter -- e.g. Latin < Greek < Cyrillic)

by Michael S. Kaplan, published on 2007/09/11 03:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/11/4861436.aspx


Previous posts in this series:

Okay, we'll start with something simple, basically a bunch of simple lowercase letters with no diacritics on them.

I'll take a string that grabs some of those look-alike characters from the Latin, Cyrillic, and Greek scripts. Our test string will be:

aokαοκаок

Which is U+0061 U+006f U+006b U+03b1 U+03bf U+03ba U+0430 U+043e U+043a, or:

LATIN SMALL LETTER A, LATIN SMALL LETTER O, LATIN SMALL LETTER K,
GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OMICRON, GREEK SMALL LETTER KAPPA,
CYRILLIC SMALL LETTER A, CYRILLIC SMALL LETTER O, CYRILLIC SMALL LETTER KA

Good enough?

Ok, let's look at the sort key, with a simple call to LCMapString with the LCMAP_SORTKEY flag, using 0x0409 (US English), although many other LCIDs will give use that same default table....

The sort key is the following stream of bytes. As in the first post, the sentinels are in black. The Unicode weight (UW) values will be in green.

0e 02 0e 7c 0e 36 0f 02 0f 10 0f 0b 10 06 10 48 10 36 01 01 01 01 00

Ok, so we have a nice easy read here -- 9 characters, 18 bytes.

Notice how the first byte in this two byte Unicode weight is the same for each script? That is the Script Member (SM) value:

and so on. The assignments here are arbitrary and of course are not guaranted if a major version change occurs.

Thus where I generated this key, quite arbitrarily, Latin < Greek < Cyrillic. Thus if one is comparing the first letter to the fourth, one will get the same results as comparing the first byte of 0e 02 with the first bytes of 0f 02.

The second byte is a value used to distinguish primary differences between letters. If one were comparing the second byte of 0e 02 with the second byte of 03 7c.

Any time a string contains only simple alphabetic characters with no additional weights on them, this same kind of key will be generated.

Since a shorter string always comes before an otherwise equal but longer one, comparing αο to αοκ would be comparing

0e 02 0e 7c 01 01 01 01 00

 to

0e 02 0e 7c 0e 36 01 01 01 01 00

and suddenly the value of those sentinels becomes clear - since they will (bugs aside) be less than any legitimate weight, the shorter string comes before the longer.

Now at this point there is not much more to say about this simple case,since it is (after all) simple. So tomorrow we will try to make things a little bit more complicated....

 

This post brought to you by 1 (U+0031, a.k.a. DIGIT ONE)


no comments

referenced by

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

2007/09/14 A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line)

2007/09/13 A&P of Sort Keys, part 3 (aka Should you let a string make it's case? If so, Y?)

2007/09/12 A&P of Sort Keys, part 2 (aka The string that won? Didn't have a mark on him!)

go to newer or older post, or back to index or month or day