On reversing the irreversible (The Set-Up)

by Michael S. Kaplan, published on 2008/01/14 10:16 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/01/14/7102809.aspx

The first blog in this series was On reversing the irreversible (the introduction).

First we'll lay put some requirements....

This is not a "call it once and then never again" kind of functionality, since the initial cost of building up the data needed to interpret sort keys is non-trivial in nature and there is a high fixed cost for that setup. So what we really need is just a few functions:

InitializeSortkeyReverseData -- this function would take an LCID and the flags to be used, and returns a handle that represents an opaque handle to be passed on future calls.
InitializeSortkeyReverseDataEx -- same as InitializeSortkeyReverseData but takes a locale name rather than an LCID, and returns a handle that represents an opaque handle to be passed on future calls.
ReverseSortKey -- Takes a handle from InitializeSortkeyReverseData[Ex] and a byte array returned from a prior sortkey grabbing LCMapString/LCMapStringEx call, and returns a string that would have returned that sort key, or as close as can be managed.
UnInitializeSortkeyReverseData -- frees up the data associated with a prior InitializeSortkeyReverseData[Ex] call.

Now there are some truths relating to collation that are kind of implicit in the contract that these functions will be promising -- such as:

Certain locales return identical results to certain other locales and that if you know what they are you can save yourself the hassle of allocating the same identical data multiple times.
Most locale differences amount to nothing but subtle table alterations but there are specific exceptions that are a much bigger deal and on the whole it is best to not interpret the sortkey value from one locale by using the data of another's (more on this in an upcoming blog).
Any time two strings return identical sortkeys, only one string will ever be returned by ReverseSortKey -- the string will be chosen deterministically, if arbitrarily.
All of the NORM_IGNORE* flags are essentially supersets of not calling with them, and thus it is probably better to be less restrictive and not include these flags in the InitializeSortkeyReverseData[Ex] call even if you plan to use the flags later, because otherwise those deterministic returns I just mentioned might seem a lot more random (I'll explain why in an upcoming blog).
The other flags (SORT_STRINGSORT and LCMAP_BYTEREV) both cause very different results to be returned and can legitimately be considered to be entirely different data sets, in the former for the whole sort key and the former for certain punctuation, as described in A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating).
We're only doing Unicode strings. If you're not using Unicode then you're simply out of scope for me here, and when I say out of scope I mean under scope....

Anyway, if you're reading the series you can think about the kind of data you will bring to the party for the future blogs in the series.

Next up: initializing the data, after deciding what we want the data to look like....

What we really want is a good hashtable algorithm that we can give the byte arrays and strings to 9the former as index entries, the latter as content). Let me know if you have any in particular you like best....

This post brought to you by 𐄭 (U+1012d, aka AEGEAN NUMBER THIRTY THOUSAND)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/03/03 On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)

2008/01/25 On reversing the irreversible (grabbing the data, part I)

go to newer or older post, or back to index or month or day