A&P of Sort Keys, part 0 (aka The empty string sorts the same in every language)

by Michael S. Kaplan, published on 2007/09/10 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/10/4847780.aspx


So I was talking with Brett the other day (yes, that Brett, the one whose blog is only occasionally written to!), I can't remember what the original purpose if the conversation was.

Though I did not mind, because our conversations and really anything he feels like talking or writing about is always interesting (even when he is talking about the size of someone's DIT, a blog post title after my own heart!).

I did take the opportunity while we were talking anyway to congratulate him on his promotion (he was surprised I knew; I guess not everyone is used to that whole address book title transparency thing as it gets rolled out to the company bit by bit!), and we talked about that whole SP/Server issue that is on its way to solved.

Now I remember what the call was about. He had a question for me, one that he was trying to ask in mail but I was having trouble understanding his question and then he was having trouble with my answer, too. Surprisingly, it turns out we were both clearer than we realized....

He was trying to figure out two things, essentially:

The more I thought about it, the more I realized that both of these things might be of more general interest to readers here who like that whole "peek behind the curtain" that SiaO enjoys providing so much.

So think of this post as the introduction to an exciting new series of posts, one per day!

Now I am talking about going beyond that old How do sort keys work? post and the later ones that point to it, and get into real nitty gritty.

Of course the best place to start is at the theoretical beginning string -- the empty string.

I actually have talked about this before (see The string is freaking empty! for details), but the theoretical sort key for the zero length string would be:

01 01 01 01 00

 

The reason I call this a theoretical sort key is that in practice (as The string is freaking empty! pointed out) you will get an error trying to get a sort key from an empty string. But you can easily fake it, though -- just pass only symbols and also include the NORM_IGNORESYMBOLS flag.

Those numbers are bytes, and the 0x01 byte values are sentinels that split up the various pieces of the key, as follows:

[all Unicode sort weights] 01 [all Diacritic weights] 01 [all Case weights] 01 [all Special weights] 01 [Punctuation weights] 00

The numbers will be hexadecimal always (but I will often skip the 0x prefix to save space), and I will make the sentinel values black for the rest of the series (other pieces of weights will be different colors).

Now ideally no weight will ever use either 0x00 or 0x01, but there are a few times that bugs will put these values into sort keys that are either bugs (e.g. this one) or design flaws (e.g. this other one). If you ignore these bugs (one of which has limited use linguistically and the other one of which is fixed in Vista) then it actually makes a great frame work for evetyone's future.

When I am all done, I will give one of the possible answers to the questions raised in I am not a nudist, but I do support stripping when it is appropriate, part 1. With a full explanation of how the sort key can help there....

This particular post has very little of linguistic value (because the empty string is the same in every language!), but many other posts in the series will.

I hope the series sounds interesting to you if you are a regular reader, otherwise half the posts over the next while will bore the living snot out of you. :-)

 

This post brought to you by 0 U+0030, a.k.a. DIGIT ZERO)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/10/09 Making a point without explaining the whole point of the point? *That* is the point!

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/01/25 On reversing the irreversible (grabbing the data, part I)

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

2007/09/14 A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line)

2007/09/13 A&P of Sort Keys, part 3 (aka Should you let a string make it's case? If so, Y?)

2007/09/12 A&P of Sort Keys, part 2 (aka The string that won? Didn't have a mark on him!)

2007/09/11 A&P of Sort Keys, part 1 (aka The law of the letter -- e.g. Latin < Greek < Cyrillic)

go to newer or older post, or back to index or month or day