All right, mistakes were made #1 (a.k.a. Expanding the EXPANSION table)

by Michael S. Kaplan, published on 2007/05/05 12:36 -04:00, original URI:

(Apologies for the Dogma/Carlin allusion in the title)

It was funny how it all happened.

Not funny Ha-Ha, more like funny interesting.

It is about a bug.

Maybe I should explain.

I have mentioned EXPANSIONS in the past, and how you can access them directly, at least the ones in DEFAULT collation table (ref: Why doesn't FoldString take an LCID?).

The way it works? Well, the DEFAULT collation table entry for these characters has a pointer to the special EXPANSIONS table. So the DEFAULT table would look something like this:

0x00c6   2   0   0   0
0x00e6   2   0   0   1
0x0152   2   0   0   2
0x0153   2   0   0   3
0x01c4   2   0   0   4
0x01c5   2   0   0   5
0x01c6   2   0   0   6

and so on, and the EXPANSION table would look something like this (note the order of the entries below corresponding to that last number in each entry above):

0x00c6    0x0041    0x0045    ; Æ --> A + E
0x00e6    0x0061    0x0065    ; æ --> a + e
0x0152    0x004f    0x0045    ; Œ --> O + E
0x0153    0x006f    0x0065    ; œ --> o + e
0x01c4    0x0044    0x017d    ; DŽ --> D + Ž
0x01c5    0x0044    0x017e    ; Dž --> D + ž
0x01c6    0x0064    0x017e    ; dž --> d + ž

and so on. Easy, right?

Now turning it off in particular locales is quite important -- after all, in Icelandic, Æ/æ do not expand to AE/ae. At all. One may as well claim that the letter b in English should expand to l and o or something!

Luckily it is also easy -- just add an EXCEPTION entry for the locale that replaces the pointer to the EXPANSION table with the appropriate weight for the code point.

It was maybe six months before that post, I was trying to figure out if future versions might require the ability to do more than just remove EXPANSIONS but maybe even add them for a particular locale.

Julie didn't remember it coming up, and Cathy didn't either. Kieran thought it might be needed some day (though she couldn't think of any, offhand). So the idea was filed away in one of those various explanatory documents that had been popping up, in case it was needed.

(Intuitive folks might see at this point (especially with the hint that there is something to see here!) what Julie and Cathy had pretty much forgotten about and what didn't occur to Kieran or I or anyone else who was thinking about the issue.)

Now in the meantime, as a feature the EXPANSION table had grown a lot in Vista.

In prior versions it had only 37 entries in it, but there were a whole lot of languages using characters across Unicode that could benefit, so many were added to the table (711 in all). Since no one really wanted to have to maintain both this bigger EXPANSION table and the 711 pointers to the items in it from the DEFAULT table, some work was done to take this derived data and simply derive it -- generate those 711 pointers when the data is being built.

I recall being a bit nervous about the idea, but after talking to the former development and PM owners (who actually remembered times that this kind of derived data has fallen out of sync in the past) and some others from the team, the decision was made to proceed with the plan -- and the 711 pointers to the EXPANSION table were now added at build time, and removed from the source. They weren't needed anyway, right?

Coming soon in post #2, how mistaken we all were (but especially how mistaken I was)


This post brought to you by œ (U+0153, a.k.a. LATIN SMALL LIGATURE OE)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

2007/09/08 2001, a Correctness Odyssey (aka What's the matter with Ü?)

2007/05/05 All right, mistakes were made #2 (What the %#$* is wrong with German Phonebook sorting?)

go to newer or older post, or back to index or month or day