On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)

by Michael S. Kaplan, published on 2008/03/03 10:16 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/03/03/7984960.aspx

Please read the disclaimer; content not Microsoft approved!

Previous blogs in this series:

In Part 3, I did make a particular claim:

Now I have skipped stuff that is seemingly quite important that has come up in this blog before, like Japanese Kana, Korean Jamo, and more. But for now just trust me when I say that they are either nothing to worry about (or at worst no more awful than compressions, above), and in an upcoming blog I will explain why.

You see, while much of the collation support is table-based, there are particular characters that aren't.

For example there is Japanese Kana -- discussed at length in A&P of Sort Keys, part 10 (aka I've Kana wanted to start talking about Japanese) -- which is basically sorted via a state machine of sorts (pun intended). This state machine makes assumptions about what amounts to particular clusters of characters that will produce different sort keys than the individual characters would if t heir sort key values were combined -- and in any case with sort weights that contain extra values that go in the Special weight section of the sort key:

[all Unicode sort weights] 01 [all Diacritic weights] 01 [all Case weights] 01 [all Special weights] 01 [Punctuation weights] 00

And for another example there are the conjoining Jamo used for Old Hangul -- discussed in theory vs. practice for Korean text collation and the redux thereof, as well as a not yet blogged blog on this Blog -- entitled A&P of Sort Keys, part 14 (Old Hangul: the good, the bad, and the ugly) -- which talks about the state machine used to build up the sort keya for Old Hangul syllables, in which up to nine conjoining Jamo can produce one single sort element that contains an additional six bytes in the Unicode weight section in the form FF ?? FF ?? FF ?? where ?? would be replaced by specific byte values that amount to a combination of the appropriate lead, vowel, and trailing Jamo, each of which will be 1-3 characters (I talked about this a bit here, with a huge contrived yet legal nine-Jamo example).

A third example is the way CJK Extension A is handled pre-Vista and how all non-Extension B ideographic characters that are not included in the linguistic data of a given locale's sort are handled in Vista and beyond -- they are given a special large valued two byte prefix in their Unicode Weight, followed by another two bytes that put the characters in code point order.

And a fourth example is the way CJK Extension B is handled pre-Vista and how all ideographic characters that are not included in the linguistic data of a given locale's sort are handled in Vista and beyond -- their Unicode Weight is given a large two byte value for the high surrogate followed by a not a large two byte value for the low surrogate.

And a fifth example is the way every other supplementary character that is not in Extension B or the Supplementary Ideographic Plane is weighed -- the same technique as is used for Extension B but with a significantly lower two byte value for the high surrogate's Unicode Weight.

One interesting factor that all of the above things have in common is that they are really not all that locale specific -- you will get the same results across all locales.

Now a sixth example is one that does not yet exist but that I was specifically asked to have people think about if they move into the area of reverse engineer sort keys -- which is that enhancements happen.

Yes, that is important to keep in mind -- things improve and sometimes in order for that improvement to happen new algorithms might be created, and those new algorithms might create different arrangements for sort key values.

I am avoiding using the really popular I-word that people from Microsoft like to use for these cases -- INNOVATION -- since I hesitate to call every feature or change or improvement an actual innovation. But since I am not a VP there is no rule that I have to use the I-word here so let's just say that some of these changes might be those, and some might not be. Let's just not judge either positively or negatively here. :-)

In any case, it is possible to imagine such changes producing new patterns for sort keys that would presumably not be mistaken for existing characters but which would need to be understood well enough to be sure that the data-grabbing stage was able to capture the sort keys....

In every one of these cases, the challenge is to either produce intelligent data gathering code based on knowledge of the possible expected combinations (e.g. in the case of Conjoining Jamo to take the specific possible known legal combinations of Old Hangul syllable made up of sequences that are legal according to the table from the OpenType site (which lists the legal Old Hangul sequences) -- in theory you could use this information to construct sort keys for all of old Hangul sort keys.

Or you can use the naive algorithm you might just use for potential compressions (keeping in mind that in Vista there is at least one locale with 8-to-1 compressions in it).

Both approaches can be hideously complex depending on how much work you want to do, and that is even before you get to the locale-specific stuff that I will be talking about in the next part of the series....

Which is why you might want to stop worrying about the full support story some time prior to solving all of these problems? :-)

This blog brought to you by ừ (U+1eeb, aka LATIN SMALL LETTER U WITH HORN AND GRAVE)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day