On reversing the irreversible (grabbing the data, part I)

by Michael S. Kaplan, published on 2008/01/25 08:16 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/01/25/7228941.aspx

The first blog in this series was On reversing the irreversible (the introduction) and the second was On reversing the irreversible (The Set-Up).

The real question that comes up is how to make InitializeSortkeyReverseData[Ex] do the job here -- starting with how to store the information?

Looking at our prototypical empty sort key value, discussed in A&P of Sort Keys, part 0 (aka The empty string sorts the same in every language):

01 01 01 01 00

or with the placeholders for actual sort key data included:

[all Unicode sort weights] 01 [all Diacritic weights] 01 [all Case weights] 01 [all Special weights] 01 [Punctuation weights] 00

For each sort key value obtained, we need to separately store these five chunks, hashing them by the original string used to get the sort key.

And how to get the big bunch of sort key values?

Well, you may recall some of my prior discussion on sort elements, such as Sort element vs. text element and More on sort elements.

(if not then you can read them now!)

Well, the goal is to isolate each call to retrieve only one single sort element. There are two important consequences of this need:

1) Any time you have an expansion -- discussed in that second post above and more rigorously in A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons) -- you do not need to store the information. Because you are really getting two or three sort elements there, your later attempts to do lookups of single sort keys are never going to find them (the individual sort elements will be found instead).

The consequence? You will in most cases not be able to retrieve Æ (U+00c6, LATIN CAPITAL LETTER AE) since it will for most locales only ever be treated like AE.

It may be tempting to remove the items (perhaps detecting them via a call to FoldString with the MAP_EXPAND_LIGATURES flag), but as I point out in Why doesn't FoldString take an LCID? there is no way to get the information for anything but the default table. So on the whole storing this data is a harmless thing, and nothing you really need to worry about (though if you wanted to remove them to have fewer entries you could try that FoldString route).

2) Any time you have a compression in a locale -- more on these all over the place in the blog like A Microsoft convention for compressions in sorting and the definitive A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts) -- you will get a single sort element from multiple letters.

So reversing them is easy, but in order to store them you have to know what they are either because you know the language and its rules or because you had the actual data (or you have to do huge brute force calls to get every single letter combination to look for sort key values that are of different length than they are in the default table which has no compressions).

This is actually the easiest part for people to skip (and most of the attempts at solving this problem, people do skip), even though it means you miss out on a lot of what Windows has to offer in the way of collation support. But in many ways being willing to be flexible and work a little harder to build the data cache up will reap huge rewards and is worth the extra effort....

So how to proceed? Well, that part is easy to start -- naively you can scroll through every single code point from U+0001 to U+10fffd and store the sort key value. Less naively you can generally skip lots of different code points, like via a GetStringTypeW call skipping anything that is C1_CONTROL or for some purposes consider whether you want to be throwing out C1_BLANK or C1_PUNCT or C3_SYMBOL values.

You don't have to do this of course, but you have to ask yourself whether it is worth retaining all of that data or whether it is extraneous -- like you may want the punctuation saved so that the word don't can be retained, but do you really care about control characters? And if you are going to put in a SPACE for all of the characters you skip then maybe you could take that smallest symbolic weight (ref: I need my SPACE, symbolically speaking) and plug it in for many of these non-letter type characters.

And when you find a duplicate sort key, you can do one of two things -- you can throw away the new one, or you can replace the old one with it. And you probably want to not pass a bunch of NORM_IGNORE* flags as I mentioned before in (The Set-Up) -- I'll discuss how to get those same results later, but for now let's keep the data.

By default we'll assume that we should get all of the language specific stuff so we'll figure out a huge naive way of doing it -- we'll optimize it later, after we figure out what we have.

Now I have skipped stuff that is seemingly quite important that has come up in this blog before, like Japanese Kana, Korean Jamo, and more. But for now just trust me when I say that they are either nothing to worry about (or at worst no more awful than compressions, above), and in an upcoming blog I will explain why.

I will also explain in an upcoming blog about how the actual reversal operation will use this data that has been conceptually built in up, so that the combination of how we get the data and how we are gonna use it can help drive how to implement the actual storage....

Think of this series like a bunch of interview questions, but ones that I am going to answer more than I ask (which is not to say I won't do some quizzing of you readers!).

Stay tuned -- this series is only gonna get cooler! :-)

This post brought to you by Ặ (U+1eb6, aka LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/03/03 On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)

go to newer or older post, or back to index or month or day