A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

by Michael S. Kaplan, published on 2007/09/15 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/15/4924008.aspx

Previous posts in this series:

Part 0: The empty string sorts the same in every language
Part 1: The law of the letter -- e.g. Latin < Greek < Cyrillic
Part 2: The string that won? Didn't have a mark on him!
Part 3: Should you let a string make it's case? If so, Y?
Part 4: It isn't a race but let's make an EXCEPTION and cross the Finnish line

In older posts in the blog here (such as these two) I have talked about SORT ELEMENTS, even going so far a to define them:

A sort element is a code point or combination of code points that a user thinks of as a character.

Now in this series the definition has not yet been all that relevant since with the exception of Part 2 each WCHAR in the string was really a letter in the user's mind, and even in Part 2 I was clearly talking about combining characters that were diacritic marks, and showed how they were equal to some precomposed characters anyway.

Well now we are doing to change all that, and talk about EXPANSIONS, where a single code point can map to up to three different letters (in the underlying implementation it is only two, but some of them are nested so the net effect for collation is that it can be three)¹.

In theory the code could support even bigger nested expansions, however the nesting rules in the code are:

Each step must also be a ligature, thus ﬄ (U+fb04, a.k.a. LATIN SMALL LIGATURE FFL) can be expanded to f + ﬂ (U+fb02, a.k.a. LATIN SMALL LIGATURE FL), which can be expanded to ffl;
Currently, LCMapString/LCMapStringEx with the LCMAP_SORTKEY flag only allocate the space assuming one level of nesting, so if bigger ligatures were added, someone would have to modify some code there;
EXPANSION entries cannot overlap with COMPRESSION entries (which I'll be talking about tomorrow).

Okay, so let's dig in now. Here we go....

Starting with ﬃ (U+fb03, a.k.a. LATIN SMALL LIGATURE FFI), compared against the string ffi using the default table:

en-US U+fb03 0e 23 0e 23 0e 32 01 01 01 01 00
en-US U+0066 U+0066 U+0069 0e 23 0e 23 0e 32 01 01 01 01 00

See how they are identical? And how the UNICODE WEIGHT piece (discussed in Part 1) is six bytes in size, meaning that we are looking at three sort elements?

At some binary level you may not want LATIN SMALL LIGATURE FFI to be considered the same as ffi, but any normal user looking at them would expect them to be treated like they were kind of the same....

Now let's muddy the waters a bit, and look at æ (U+00e6, a.k.a. LATIN SMALL LETTER AE). Now in many languages it makes sense to treat it like ae so that words like Cæsars and Caesars can be treated the same, thus in the default table:

en-US a U+0061 0e 02 01 01 01 01 00
en-US ae U+0061 U+0065 0e 02 0e 21 01 01 01 01 00
en-US æ U+00e6 0e 02 0e 21 01 01 01 01 00

This is all well and good, and the folks running Windows over at Cæsars Palace probably appreciate that (though as far as I know they have not offered to fly anyone from Window International down there yet!), but in the beautiful country of Iceland, this is not acceptable.

Because in Icelandic, æ is a little different, and thusly the weights look a little different:

is-IS a U+0061        0e 02 01 01 01 01 00
is-IS ae U+0061 U+0065 0e 02 0e 21 01 01 01 01 00
is-IS z U+0079        0e a9 01 01 01 01 00
is-IS æ U+00e6        0e ac 01 01 01 01 00

As you can see, æ is its own letter that comes right after z and it just one sort element -- and is thus not an EXPANSION there (although in most locales it is).

Now as I pointed out in these three posts:

It is also possible to add EXPANSION entries for a specific language only;
Everyone forgot about that fact for a long time;
The problem that was thereby caused is one that will be fixed in Windows Server 2008.

But in any case, that kind of explains the EXPANSION functionality in collation².

1 - I even mentioned once (in Why doesn't FoldString take an LCID?) how you can use FoldString with the MAP_EXPAND_LIGATURES flag to access the default table's take on these EXPANSION entries (while bemoaning the fact that the locale-specific entries were unavailable since FoldString itself doesn't accept any kind locale parameter (name or LCID)³.
2 - And even in ligature expansion, via FoldString, though there are limitations there.
3 - I did recommend that a FoldStringEx that would take a locale name be added to the next version of Windows before I moved from NLS to the International Fundamentals group, but I have no idea what the plans are here for the future....

This post brought to you by 5 (U+0035, a.k.a. DIGIT FIVE)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2009/02/04 The road to hell is paved with attempts at being compatible

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/01/25 On reversing the irreversible (grabbing the data, part I)

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

go to newer or older post, or back to index or month or day