What is impossible for Microsoft can be simply undesireable for Unicode

by Michael S. Kaplan, published on 2009/09/14 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/09/14/9894607.aspx

Sometimes an implementation makes a certain feature impossible.

Like the way Microsoft does collation, in particular the way its DEFAULT table is implemented (a flat DWORD table for everything 0x0000 to 0xFFFF) means that you can't ever have compressions in the default table.

Could the implementation be expanded to allow for this feature, so that more languages could be a part of the default table?


But the current implementation has no solution here to the problem.

Now the Unicode Collation Algorithm does not define such a limitation, they allow compressions (they call 'em contractions) in their DUCET (what they call their default table).

Thus questions like Doug Ewell's are obvious ones to ask:

The announcement of the Public Review issue stated:

1. The data files contain weights for all new assigned characters.
      b. The ordering for Tamil and Malayalam has been improved,
         but would still need tailoring for the Tamil and Malayalam

I guess I'm puzzled why the default order for these two scripts wouldn't match the overwhelmingly dominant language written in those scripts.  It's often stated that the default ordering for Latin also isn't appropriate for any language, but that's more understandable since so many languages are written in Latin.

I don't claim to be an expert in either Tamil or Malayalam.

So why don't they just put everything in the default table to make it better for languages that have no need of the "dumber" version for these letters?

Why not, indeed!

Well, this is described in the UCA in section 3.2 Default Unicode Collation Element Table:

The Default Unicode Collation Element Table does not aim to provide precisely correct ordering for each language and script; tailoring is required for correct language handling in almost all cases. The goal is instead to have all the other characters, those that are not tailored, show up in a reasonable order. In particular, this is true for contractions, because the use of contractions can result in larger tables and significant performance degradation. While contractions are required in tailorings, in the Default Unicode Collation Element Table their use is kept to the bare minimum to avoid such problems.

In the Default Unicode Collation Element Table, contractions are required in those instances where a canonically decomposable character requires a distinct primary weight in the table, so that the canonically equivalent character sequences are also given the same weights. For example, Indic two-part vowels have primary weights as units, and their canonically equivalent sequence of vowel parts must be given the same primary weight by means of a contraction entry in the table. The same applies to a number of precomposed Cyrillic characters with diacritic marks and to a small number of Arabic letters with madda or hamza marks.

Contractions are also entered in the table for Thai and Lao logical order exception vowels. Because both Thai and Lao both have five vowels that are represented in strings in visual order, instead of logical order, they cannot simply be weighted by their representation order in strings. One option is to require preprocessing of Thai and Lao strings, to identify and reorder all logical order exception vowels around the following consonant. That approach was used in Version 4.0 (and earlier) of the UCA. Starting with Version 4.1 of the UCA, contractions for the relevant combinations of Thai and Lao vowel+consonant have been entered in the Default Unicode Collation Element Table instead.

Those are the only two classes of contractions allowed in the Default Unicode Collation Element Table. Generic contractions of the sort needed, for example, to handle digraphs such as "ch" in Spanish or Czech sorting, should be dealt with instead in tailorings to the default table -- in part because they often vary in ordering from language to language, and in part because every contraction entered into the default table has a significant implementation cost for all applications of the default table, even those which may not be particularly concerned with the affected script. See the Unicode Common Locale Data Repository (CLDR) for extensive tailorings of the DUCET for various languages, including those requiring contractions.

Kind of says it all. There is a strong desire to not slow down for everyone's results just to help specific languages -- a tailoring for those languages just ends up being a better option overall, from the point of view of the people who write the spec for the algorithm.

Microsoft takes it a step further by not even allowing these exceptional cases in the default table; the only one that is really fascinating is the Thai case as it has an interesting story that I'll talk about another day (tomorrow, maybe?).

Now with all that said, there are times that I simply do not buy either Microsoft's or Unicode's argument, mainly when doing the design for a language that big companies are unlikely to ever provide tailorings for in their software implementations -- in such cases, putting the entries in the default table if it were possible (for Microsoft) or desirable (for Unicode) would mean no support required to make these languages work in a LOOT of places. And it would be nice for there to be a way to provide optimal support for as many people as possible.

Say if Microsoft had a "bonus default table" one could opt into that would contain all compressions that would go into the default table, if possible.

Unicode could solve the problem the same way, with a general purpose tailoring designed for everyone except when the extra performance benefits of its absence made it essential (if Unicode had this they might even be able to pull out some of the ones they have in there now!)....

no comments

go to newer or older post, or back to index or month or day