Reversing sort keys?

by Michael S. Kaplan, published on 2005/03/16 06:25 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/03/16/396704.aspx

An interesting question! The kind of question that would almost make a good interview question (I say almost because there would likely not enough time to set up the complexity of the sort keys themselves and that would be important if someone was to be thinking about reversing them).

For admittedly geeky reasons I wish I could say something like, "well yes, just use LCMapString with the LCMAP_RECOVERSTRING flag or maybe use the .NET Framework's sister class to SortKey, the SortKeyReversi class." But obviously the documentation will not find such a class. :-(

To see why, let's take a closer look at a sort key. The (hideously contrived) string we will use is: "Mötley ¡Crüe! Chugß". using the Traditional Spanish LCID (0x040a), the sort key on Server 2003 is:

The Unicode Weight portion of the string is 38 bytes long, the diacritic weight is 12 bytes long, the case weight is 15 bytes long, and there are no special weights. So given the problem of reconstructing the string of the weight meets its first challenge -- there seems to be no pattern!

Well, wait a minute. The string is 19 characters long and there are 38 bytes in the Unicode weight -- that is a sort of a pattern, right? Kind of. Except in Traditional Spanish an instance of the letters "Ch" is treated as a single character, so that would mean we expect that it would be only 36 bytes. Ah, but wait -- the Sharp S at the end there, as we learned from earlier posts, gets treated like an "ss", so that would mean we are back up to 38. So we are back in business. Right?

Well, not so much -- as the diacritic and case weights do not seem to line up with those Unicode weights in an intuitive way.

Clearly we would also need the exact LCID and the flags -- after all, the same string with NORM_IGNORECASE | NORM_IGNORE | NORM_IGNORESYMBOLS, you will get back:

and obviously trying to pass some of the flags but not all of them will return some amount of sort key between these two extremes.

So every single call would require behind it a reverse lookup table with every weight and the character it maps to.

And then there are things like Old Hangul, Japanese Kana, Extension A ideographs, double compressions (used in Hungarian), and more that are not entirely table based, requiring code to reverse the operations on demand.

There are the obvious problems of trying to appropriately combine all of the weights together, and knowing how they fit.

Lets look at a string that will use some of these other weights, with the hideously contrived "Ḡrēăşėď いなずま" (the Hiragana word is for lightning). The sort key:

Trying to map the diacritic and case weight are the same hard problem as before, and the special weights are in this case very hard to place other than by knowing where in the alphabetic weights they might come from. In this case note that the sort key does not have to know the position in the string of the Japanese Hiragana since the UW values will properly sort the string against times that you put the same Hiragana string in a different position. Thus "いなずま Ḡrēăşėď" is:

Identical special weights, but you have to know what the Hiragana looks like. This is in contrast to the diacritic weights, which are sort of positional yet seemingingly not entirely so (wait until you add reverse diacritics in French or double compressions in Hungarian!). Or even the positional notion of case.

So, a proper GetStringFromSortKey would have to be able to handle all of these cases, and properly return the string with less imformation when flags were passed to filter out weights. It ould have to be able to properly hook up these different sections of vaying sizes, with enough specific knowledge of the various Unicode weights to know what script each belonged to so any special rules can be handled. Finally, it must be able to handle all of those ligature expansions and contractions of characters. And code to handle all of the cases where code was used to get the strings there in the first place.

A very non-trivial operation, to be sure! Without even a clear need (for example in a database, if you need the string you have it right there, so trying to reconstruct it from the index will never be as efficient or lossless as just using it directly).

So it does appeal on the level of the inner "geek" child inside me, but not so much on the practical "what is the user scenario we solve?" adult....

Oh well. It was just a thought. I did not really have a user scenario in mind. Maybe I am the same kind of geek.

It is an interesting problem, FWIW. It is one of those kinds of problems that starts off really easy and then just keeps getting harder as you notice other issues start piling up....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

2008/01/13 On reversing the irreversible (the introduction)

2007/07/08 UnLCMapString or LCUnMapString?

2005/06/15 Once more into the palindrome

2005/03/17 The offline address book (OAB) in Exchange....