Dude! Not so Lao'd!

by Michael S. Kaplan, published on 2010/05/06 07:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2010/05/06/10006797.aspx

There are times that one can be looking for information about a somewhat errant behavior, spelunking through code and data tables and emails and standards documents that one has not looked at for many years that one can feel a bit off.

Almost the same way one feels when one is hungover after a wild night involving too many T&Ts (Tanqueray and tonics) for an adult male to consider healthy.

Even when one is completely sober at the time and in no way hungover.

So looking through these sources, one may channel that hungover guy when people talk to him and instead of complaining :

Dude! Not so loud!

can take the fact that the research relates to Laotian and say:

Dude! Not so Lao'd!

So, after I blogged Sing. Sing a song. Sing it Lao'd (just in case the sort's still wrong!), I was not entirely certain that John Durdin would respond, but I was pretty sure.

And I was not disappointed. :-)

He responded with an amazing catalog of issues that have all been forwarded on to the right people, including a little bit about the sorting that I will be talking about today....

First, you will probably want to read the bottom half of that previous blog and especially the footnotes from John and Marc.

Then you can read the part of John's response that related to Lao sorting:

Sorting of tables in Office 2010 by Lao text keys is not consistent with any of the alphabetization rules normally used for sorting Lao. It roughly follows the (orthographic) sorting conventions used for Thai, but does not handle prefix vowels correctly, so that all words with prefix vowels are grouped at the end, instead of being grouped by the initial consonant. It has many of the same problems that beset Thai sorting - for example, consonants and vowels are ordered in a single sequence, so words with characters that can have either use often appear in unexpected places in the list. It is not clear what is being done with tone marks - they appear to result in the marked consonant being ordered before the unmarked consonant.

The most accepted method of sorting Lao (c.f. Kerr's 1972 Lao-English Dictionary) orders words according to syllables, then the sorting of each syllable is by initial consonant (or cluster), final consonant, vowel, tone mark (if any). This makes the Lao dictionary much easier to use than a Thai dictionary when the pronunciation of the word is known approximately, but not the exact spelling.

After I got his response I dug into the support added in Windows 7 a little bit.

And John is completely correct here.

Perhaps I should give a little background, though.

If you go back to almost the very beginning of the Unicode Collation Algorithm until version TR10-11 released in January of 2004 around the time of Unicode 4.0, it contained text like the following:

3.1.3 Rearrangement

Certain characters are not coded in logical order, such as the Thai vowels เ through ไ and the Lao vowels ເ through ໄ (this list is indicated by the Logical_Order_Exception property). For collation, they are rearranged by swapping with the following character before further processing, since logically they belong afterwards. For example, here is a string processed by rearrangement:

input string: 0E01 0E40 0E02 0E03

normalized string: 0E01 0E02 0E40 0E03

Now Microsoft, in its own support that doesn't support the UCA, had three very important differences in this area:

We really didn't support Lao at all, and
Thai support was done by means of compressions (which the UCA calls contractions) rather than re-ordering, and
Since Thai support used compressions (which due Microsoft's implementation cannot be in its default table), Thai sorting only worked when the Thai locale was specified.

Then some time before the next release of TR10-14 in May of 2005, there was a meeting (involving if memory serves Mark Davis, Ken Whistler, Cathy Wissink, and I and maybe some others I don't remember), where Mark made the suggestion that rather than all of that reordering stuff which was quite less than ideal from a processing standpoint, that the UCA should instead define contractions that would support the same results as the Thai and Lao reordering did.

Both Cathy and I pointed out that this was something we were essentially already doing and had been for years in Microsoft Windows, so we could hardly fault Unicode making such a technical decision. :-)

And in that next release of TR10-14 in May of 2005 version of the Unicode Collation Algorithm, the following text replaced the text that had been there before about Thai/Lao reordering:

3.1.3 Rearrangement

Certain characters are not coded in logical order, such as the Thai vowels เ through ไ and the Lao vowels ເ through ໄ (this list is indicated by the Logical_Order_Exception property in the Unicode Character Database [UCD]). For collation, they are rearranged by swapping with the following character before further processing, because logically they belong afterwards. This is done by providing these sequences as contractions in the Collation Element Table.

with of course the indicated changes in the Unicode Character Database.

Of course Microsoft still had those remaining two differences:

We really didn't support Lao at all, and
Since Thai support used compressions which due Microsoft's implementation cannot be in its default table, Thai sorting only worked when the Thai locale was specified.

when compared to the UCA.

but since Microsoft was shipping no Lao fonts and had no Lao locale, this lack was not considered to be too terribly high by way of priority.

The Vista and Windows 7 story can be summed up in Despite progression, the bug calls out to me quite LAOdly, the aforementioned Sing. Sing a song. Sing it Lao'd (just in case the sort's still wrong!), and John Durdin's comment, quoted earlier in this blog.

After looking at everything, the following conclusions can be stated authoritatively:

1. Microsoft did not add the appropriate support for "Lao preceding vowel weight reordering". This support, which Microsoft added for Thai before the UCA was even originally written, was apparently skipped for Lao entirely (the 2-to-1 compressions added in Windows 7 were only consonant-vowel combinations. Try to imagine if over 1/3 of the Thai collation support were simply missing.

2. Microsoft improperly weighs the Lao tone marks as having Alphabetic Weight (AW) before all of the other letters. In contrast to Thai, which adds tone marks at the end of appropriate compressions to slightly alter the overall weight of Thai sort elements, the Lao tone marks will generally not cause affected syllables to be placed appropriately since primary weight differences are being added to the end of those syllables. Try to imagine if U+0301 (COMBINING ACUTE ACCENT) were treated as a full letter that sorted right before A.

3. Neither Microsoft nor Unicode supports the "most accepted method of Lao sorting (c.f. Kerr's 1972 Lao-English Dictionary). Given the table-based methods that both algorithms use, support of what this dictionary does for Lao (which would require real analysis to determine syllable boundaries) is pretty much impossible to support without either insanely big tables and lots of "look ahead" and "jump back" logic, or else the kind of pre-processing step that Microsoft has never had and Unicode specifically worked to drop.

So #3 is really out of scope for general purpose implementations on both sides, but #1 and #2 will tend to baffle people looking at the Lao collation support in Windows 7, which matches neither dictionaries nor the full orthographic order supported by Thai.

I'll admit that even years out and conceptually even further away from the code, things like this pain me. Like I want to apologize to a country or something.

Irrational, I suppose.

Yet still....

# John Cowan on Thursday, May 06, 2010 10:19 AM:

*sigh*

Of course it's INCONCEIVABLE that anyone with Thai or Lao data might want to DO The Right Thing with it, even in a non-Thai (or non-Lao) locale.

# Michael S. Kaplan on Thursday, May 06, 2010 2:20 PM:

I call shenanigans the premise of your statement John; this happens for EVERY OTHER LANGUAGE; the only reason Unicode takes the hit here is their guilt over the encoding model. If you aren't going to fix Hindi or every other language that would benefit, then dinging Microsoft for being unable to do something architecturally that for 99% of all cases is a non-goal for Unicode is not realistic....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day

input string:	`0E01 0E40 0E02 0E03`
normalized string:	`0E01 0E02 0E40 0E03`