Microsoft still does not use the UCA; the converse is also true

by Michael S. Kaplan, published on 2008/02/10 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/02/10/7579576.aspx


In recent conversations about the atomic Malayalam chillu on Unicode's Indic list, we do find that the fact they have been encoded has not stopped widespread argument about them from several, even now.

In the midst of all that, several comments about the DUCET (the Default Unicode Collation Element Table) came up, with people arguing that whether or not the chillu are encoded atomically impacts whether the DUCET can support Malayalam without tailoring.

This is of course not true; the Unicode Collation algorithm has no technical barrier to supporting any kind of collation needed.

With that said, there is a performance impact to contractions, such that the proposed update text to Unicode 5.1 actually includes the following text. Here is the old text, slated to be removed:

Contractions are provided for those instances where a canonical decomposable character needed to be given a distinct primary weight in the main weight table, which implied that the canonically equivalent character sequences should also be given the same weights. These currently include Indic two-part vowels and with some Cyrillic accented characters, to match the expected collating behavior for those scripts. Contractions are also provided for Thai/Lao reordering.

And here is new text in the latest version under public review:

The Default Unicode Collation Element Table does not aim to provide precisely correct ordering for each language and script; tailoring is required for correct language handling in almost all cases. The goal is instead to have all the other characters, those that are not tailored, show up in a reasonable order. In particular, this is true for contractions, because the use of contractions can result in larger tables and significant performance degradation. While contractions are required in tailorings, in the Default Unicode Collation Element Table their use is kept to the bare minimum to avoid such problems.

In the Default Unicode Collation Element Table, contractions are required in those instances where a canonically decomposable character requires a distinct primary weight in the table, so that the canonically equivalent character sequences are also given the same weights. For example, Indic two-part vowels have primary weights as units, and their canonically equivalent sequence of vowel parts must be given the same primary weight by means of a contraction entry in the table. The same applies to a number of precomposed Cyrillic characters with diacritic marks and to a small number of Arabic letters with madda or hamza marks.

Contractions are also entered in the table for Thai and Lao logical order exception vowels. Because both Thai and Lao both have five vowels that are represented in strings in visual order, instead of logical order, they cannot simply be weighted by their representation order in strings. One option is to require preprocessing of Thai and Lao strings, to identify and reorder all logical order exception vowels around the following consonant. That approach was used in Version 4.0 (and earlier) of the UCA. Starting with Version 4.1 of the UCA, contractions for the relevant combinations of Thai and Lao vowel+consonant have been entered in the Default Unicode Collation Element Table instead.

Those are the only two classes of contractions allowed in the Default Unicode Collation Element Table. Generic contractions of the sort needed, for example, to handle digraphs such as "ch" in Spanish or Czech sorting, should be dealt with instead in tailorings to the default table -- in part because they often vary in ordering from language to language, and in part because every contraction entered into the default table has a significant implementation cost for all applications of the default table, even those which may not be particularly concerned with the affected script. See the Unicode Common Locale Data Repository (CLDR) for extensive tailorings of the DUCET for various languages, including those requiring contractions.

The upshot of this is that while it may be true that there is no technical limitation blocking the support of any language's needs within the DUCET, in practice there is a policy to limit contractions within the DUCET, due to the performance cost on all implementations to have such an addition.

People who need language-specific support, therefore, should turn to CLDR to get tailoring, and not expect that the DUCET will support all aspects of collation.

Now as I pointed over three years ago, Microsoft does not use the Unicode Collation Algorithm. We just couldn't wait for it, given when we wanted to add collation support to Windows, especially since there was no way to know there would be something wait for?

I have had it suggested to me in the past by people both inside and outside of Microsoft that details on the collation implementation with data tables was requested of people over a decade ago and that request was in fact turned down, which led to the eventual UCA creation (back in early 1997) not having as any kind of source or basis the work that Microsoft has included. Though since they are trying to model the same thing they can often return the same results, the sometimes arbitrary nature of collation definitely can lead to substantial differences between the two.

Anyway, although Microsoft does not use the UCA, its own implementation of its version of the DUCET -- its default collation table -- is implemented currently as a flat table covering the Unicode code values from 0x0000 to 0xFFFF -- thus there is no room for contractions (what we call compressions, a term that would have confusing to use that way in the UCA due to the important discussion of sort key compression) within.

There is also no support for supplementary characters -- anything from 0x10000 to 0x10FFFF -- other than as surrogate pairs, which is how they are currently implemented -- this issue led directly to the way by which the mathematical sort (discussed in What is SORT_INVARIANT_MATH for?) was implemented, as that article discusses.

I'm not really directly involved with any of those things anymore, though all things being equal if it were up to me I'd probably be inclined to just extend the flat table to be bigger than it currently is, though of course the size increase would make one want to rethink the flat table idea a bit and would lead to some mildly interesting decisions when it comes to high surrogates and low surrogates for data largely designed to support UTF-16-based functions....

For Microsoft, which can manage to avoid the public scrutiny that Unicode has to deal with for some of these issues, the decision of what to do with newly added scripts is a very real one (I talk about some of the issues in How does Microsoft assign new collation weights?).

But it is interesting how over the past few years these two completely different implementations have been moving closer to each other, for example in

and so on. They are getting closer to each other conceptually.

Now all that really remains that is different is the arbitrary nature of weight assignment, a difference that is unfortunate since it makes it impossible to (for example) use the tailoring defined in one implementation with the other, since they are both modifications of entirely different starting points (default tables).

Because of this there is not much that Microsoft can do for Unicode other than give technical advice when UCA drafts come up and not much Unicode can do for Microsoft since we can't use any of the CLDR-defined tailorings directly (we also kind of rely on our current model of working with linguists, native speakers, and language experts that essentially amounts to a procedural difference in how data is added between the two platforms.

I wonder now, on the far side of that long ago decision to not share data whether it would have been better to share it, though to be honest I doubt they would have been likely to match given the many differences Microsoft has had over the years like Korean re-ordering and not entirely compatible ideas of how to handle default table and multiple language/script scenarios.

Would the situation be easier to unravel now? The world may never know....

 

 


# Esperance on 10 Feb 2008 8:51 AM:

Nice Work !

We hope  that every language (encoding) would have the chance to be perfectly displayed.

# John Cowan on 10 Feb 2008 1:51 PM:

A similar problem is the fact that Windows time zone support goes its own way from the rest of the computing world.  The Olsen time zone package isn't supported by a hairy industrial consortium as CLDR is, but it has over time become universal everywhere *but* Windows.

An interesting proposal on the leapsecs mailing list just crashed to a halt when I pointed out that there are two sources of timezone data, and that changing Olsen wouldn't do squat for the zillions of Windows boxen out there.

# Michael S. Kaplan on 10 Feb 2008 2:54 PM:

I've been a lot less involved with time zones than I have been with collation and locales, in the past, so it is harder for me to comment on what would be involved. I do know from having people describe the differences between our locales and the CLDR "competitive assessment" version of them that their image of MS locales has some flaws in it, and I know from talking to old IBM folks who had been claiming to have reverse engineered collation support that they are missing some things too. So it is easy to imagine real problems trying to plug the Olsen data into Windows....

# Pavanaja U B on 11 Feb 2008 11:44 AM:

Thanks god that Kannada sorting is perfect in MS. May be my relentless followup also has helped :-) (blowing my trumphet :-))

Regards,

Pavanaja


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/11/09 I [will have] told you so! Well, perhaps too late (all things considered)...

2010/08/17 It would be like spelling it Anerica or something.

2009/02/04 The road to hell is paved with attempts at being compatible

go to newer or older post, or back to index or month or day