A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

by Michael S. Kaplan, published on 2007/09/15 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/15/4924008.aspx


Previous posts in this series:

In older posts in the blog here (such as these two) I have talked about SORT ELEMENTS, even going so far a to define them:

A sort element is a code point or combination of code points that a user thinks of as a character.

Now in this series the definition has not yet been all that relevant since with the exception of Part 2 each WCHAR in the string was really a letter in the user's mind, and even in Part 2 I was clearly talking about combining characters that were diacritic marks, and showed how they were equal to some precomposed characters anyway.

Well now we are doing to change all that, and talk about EXPANSIONS, where a single code point can map to up to three different letters (in the underlying implementation it is only two, but some of them are nested so the net effect for collation is that it can be three)1.

In theory the code could support even bigger nested expansions, however the nesting rules in the code are:

Okay, so let's dig in now. Here we go....

Starting with (U+fb03, a.k.a. LATIN SMALL LIGATURE FFI), compared against the string ffi using the default table:

en-US U+fb03               0e 23 0e 23 0e 32 01 01 01 01 00
en-US U+0066 U+0066 U+0069 0e 23 0e 23 0e 32 01 01 01 01 00

See how they are identical? And how the UNICODE WEIGHT piece (discussed in Part 1) is six bytes in size, meaning that we are looking at three sort elements?

At some binary level you may not want LATIN SMALL LIGATURE FFI to be considered the same as ffi, but any normal user looking at them would expect them to be treated like they were kind of the same....

Now let's muddy the waters a bit, and look at æ (U+00e6, a.k.a. LATIN SMALL LETTER AE). Now in many languages it makes sense to treat it like ae so that words like Cæsars and Caesars can be treated the same, thus in the default table:

en-US a  U+0061        0e 02 01 01 01 01 00
en-US ae U+0061 U+0065 0e 02 0e 21 01 01 01 01 00
en-US æ  U+00e6        0e 02 0e 21 01 01 01 01 00

This is all well and good, and the folks running Windows over at Cæsars Palace probably appreciate that (though as far as I know they have not offered to fly anyone from Window International down there yet!), but in the beautiful country of Iceland, this is not acceptable.

Because in Icelandic, æ is a little different, and thusly the weights look a little different:

is-IS a  U+0061        0e 02 01 01 01 01 00
is-IS ae U+0061 U+0065 0e 02 0e 21 01 01 01 01 00
is-IS z  U+0079        0e a9 01 01 01 01 00
is-IS æ  U+00e6        0e ac 01 01 01 01 00

As you can see, æ is its own letter that comes right after z and it just one sort element -- and is thus not an EXPANSION there (although in most locales it is).

Now as I pointed out in these three posts:

But in any case, that kind of explains the EXPANSION functionality in collation2.

 

1 - I even mentioned once (in Why doesn't FoldString take an LCID?) how you can use FoldString with the MAP_EXPAND_LIGATURES flag to access the default table's take on these EXPANSION entries (while bemoaning the fact that the locale-specific entries were unavailable since FoldString itself doesn't accept any kind locale parameter (name or LCID)3.
2 - And even in ligature expansion, via FoldString, though there are limitations there.
3 - I did recommend that a FoldStringEx that would take a locale name be added to the next version of Windows before I moved from NLS to the International Fundamentals group, but I have no idea what the plans are here for the future....

 

This post brought to you by 5 (U+0035, a.k.a. DIGIT FIVE)


no comments

referenced by

2009/02/04 The road to hell is paved with attempts at being compatible

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/01/25 On reversing the irreversible (grabbing the data, part I)

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

go to newer or older post, or back to index or month or day