A&P of Sort Keys, part 1 (aka The law of the letter -- e.g. Latin < Greek

by Michael S. Kaplan, published on 2007/09/11 03:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/11/4861436.aspx

Okay, we'll start with something simple, basically a bunch of simple lowercase letters with no diacritics on them.

I'll take a string that grabs some of those look-alike characters from the Latin, Cyrillic, and Greek scripts. Our test string will be:

Ok, let's look at the sort key, with a simple call to LCMapString with the LCMAP_SORTKEY flag, using 0x0409 (US English), although many other LCIDs will give use that same default table....

The sort key is the following stream of bytes. As in the first post, the sentinels are in black. The Unicode weight (UW) values will be in green.

Notice how the first byte in this two byte Unicode weight is the same for each script? That is the Script Member (SM) value:

and so on. The assignments here are arbitrary and of course are not guaranted if a major version change occurs.

Thus where I generated this key, quite arbitrarily, Latin < Greek < Cyrillic. Thus if one is comparing the first letter to the fourth, one will get the same results as comparing the first byte of 0e 02 with the first bytes of 0f 02.

The second byte is a value used to distinguish primary differences between letters. If one were comparing the second byte of 0e 02 with the second byte of 03 7c.

Any time a string contains only simple alphabetic characters with no additional weights on them, this same kind of key will be generated.

Since a shorter string always comes before an otherwise equal but longer one, comparing αο to αοκ would be comparing

and suddenly the value of those sentinels becomes clear - since they will (bugs aside) be less than any legitimate weight, the shorter string comes before the longer.

Now at this point there is not much more to say about this simple case,since it is (after all) simple. So tomorrow we will try to make things a little bit more complicated....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

2007/09/14 A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line)

2007/09/13 A&P of Sort Keys, part 3 (aka Should you let a string make it's case? If so, Y?)

2007/09/12 A&P of Sort Keys, part 2 (aka The string that won? Didn't have a mark on him!)

A&P of Sort Keys, part 1 (aka The law of the letter -- e.g. Latin < Greek < Cyrillic)