Despite progression, the bug calls out to me quite LAOdly

by Michael S. Kaplan, published on 2008/02/23 15:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/02/23/7861817.aspx

In the early days of my ownership of the collation functionality, I did have a bit of an inferiority complex about the more linguistic aspects to the work.

So I would talk about my delusions of linguistic aptitude and how I was the architect of all the collations that required no linguistic knowledge (being algorithmically derived), while looking on the process by which the reverse engineering of dictionaries in fact took place and be amazed at the results. And to be frank amazed about the people who were doing the work, who held some kind of secret knowledge that I didn't have and which (unlike most technologies and functionalities I run across in my work) I probably never would (blogs like Some sort of order to collation and Collation can actually be linguistic aside, of course -- they id tend to prove my fascination more than my abilities -- that I had become a sortophile of sorts).

And of course as some of the people who had been doing this work for years were moving onto new challenges.

I got to see firsthand that I was not the only one who didn't fully understand how to do it; we even had a false start or two as some people tried to do the work but were simply unable to handle that linguistic side that I (recognizing that I couldn't -- or didn't think I could -- do) kept myself from trying to do.

Eventually others came in to fill those shoes, people who still had those seemingly intrinsic abilities, even as they learned the new abilities to bring the feature to Windows with new languages.

And in those early days that I usually think of as B.R. (Before Ryan) when Ryan was not yet the collation tester, we were at the time kind of between testers for the area. So the automation that was there kept running but we did not have someone looking at new things with their tester eyes just yet....

Somewhere in all of that, between the outgoing and the incoming who didn't quite get it and the incoming who did get it but was still learning how to use it and not having a great tester in place and the one who knew he didn't get it but was the one charged with checking in the final results, a bug or two could (and did) slip in, as I am sure you might be able to imagine....

In particular, the collation for Lao is kind of broken for both vowels and tone marks.

And as anyone who knows Lao can tell you, sorting Lao without the expected results for vowels and tone marks is a bit like trying to wash your foot with your shoes and socks on -- you may get things wet but it won't be very effective and later on quite uncomfortable!

Since I am the one who checked it in I actually take full responsibility, though as I have come to expect from working with competent people and bugs like this the program manager who used to work on the data but never saw the data in this case, the program manager who created the data while learning how the data was supposed to be created, and even the tester who was not even working in the area at the time could just as easily try to make the same claim and to be honest often do....

I get to win since it is my name in the checkin log, a dubious pleasure at best.

Kind of puts blogs like Not so Lao[d], at least not until Vista in perspective, though....

The various vowels and tone marks I mean are:

U+0ec8 ່ LAO TONE MAI EK
U+0ec9 ້ LAO TONE MAI THO
U+0eca ໊ LAO TONE MAI TI
U+0ecb ໋ LAO TONE MAI CATAWA
U+0ec6 ໆ LAO KO LA
U+0eb0 ະ LAO VOWEL SIGN A
U+0eb2 າ LAO VOWEL SIGN AA
U+0eb4 ິ LAO VOWEL SIGN I
U+0eb5 ີ LAO VOWEL SIGN II
U+0eb6 ຶ LAO VOWEL SIGN Y
U+0eb7 ື LAO VOWEL SIGN YY
U+0eb8 ຸ LAO VOWEL SIGN U
U+0eb9 ູ LAO VOWEL SIGN UU
U+0ec0 ເ LAO VOWEL SIGN E
U+0ec1 ແ LAO VOWEL SIGN EI
U+0ec2 ໂ LAO VOWEL SIGN O
U+0ebb ົ LAO VOWEL SIGN MAI KON
U+0ec4 ໄ LAO VOWEL SIGN AI
U+0ec3 ໃ LAO VOWEL SIGN AY
U+0eb3 ຳ LAO VOWEL SIGN AM
U+0eb1 ັ LAO VOWEL SIGN MAI KAN
U+0ebc ຼ LAO SEMIVOWEL SIGN LO
U+0ebd ຽ LAO SEMIVOWEL SIGN NYO

The bug is an interesting one since the results are much worse when comparing sort keys via LCMapString calls than via CompareString calls, with the former producing embedded 0x00 byte values in the middle of the sort key while the latter simply produces results that are a bit off (most noticeable in huge sorted lists).

Now since those early days everyone involved has progressed:

I did manage to get over my inferiority complex (upgrading my delusions of linguistic aptitude to notions thereof);
the program manager who did this work in prior versions has proven to be simply amazing in newer roles;
the program manager who didn't quite get it left the company (presumably for other reasons no matter how far reaching the affects of collation may be!);
the other program manager who didn't quite get it found a job they did get where they are much happier;
the tester proved to be a freaking Rembrandt;
the later tester who picked up the area proved to also be quite the artist (and is the one who first reported this bug, over a year after Vista had shipped);
the program manager who produced the data proved quite more than able to do many of the later collations with both skill and inspiration.

but the legacy of the not quite correct vowels and tones in Lao call out to me quite LAOdly....

This blog sponsored by all of the above cited characters, not only in gratitude for the chance to be properly noticed in this new world but also in the hope of future adjustments

John Durdin on 25 Feb 2008 2:45 AM:

Conventions for sorting are probably still not fully accepted in Lao PDR, but sorting according to the rules given in Kerr's 1972 Lao-English Dictionary is widely followed. The algorithm is (primarily) phonetic, unlike Thai, which uses an orthographic sorting approach. From a user's perspective, it is much easier, since you can find a word in a dictionary without knowing how it is spelled. Most Thai university students do not use a dictionary effectively - if you don't know how the word is spelled, it can be quite hard to find it (and Thai, like English, has very irregular spelling). The problem with the Lao approach is that words (or text) *must* be split at syllable boundaries (reasonably well) before determining the sorting key for each syllable, which adds computational complexity, but can be done.

Michael S. Kaplan on 25 Feb 2008 11:22 AM:

We actually had the sorting figured out okay, just without the vowels there to act as secondary distinctions and the tones to act as tertiary distinctions, it doesn't work....

Marc Durdin on 25 Feb 2008 5:39 PM:

I'm not quite sure I can see how the sorting can even kinda work without taking each syllable as a whole. Do you work on a syllable-by-syllable basis?

Unless you take each syllable as a whole (initial consonant, final consonant, vowel, tone), the sorting just won't work. And because there can be ambiguity with final consonants and open syllables, you really need to split each syllable before sorting.

Michael S. Kaplan on 25 Feb 2008 6:22 PM:

Well, like I said it turns out it does not work, for the above stated reasons. But had those mistakes not been made, it would have.

And there is hope for the future, at least....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/05/06 Dude! Not so Lao'd!

2010/04/17 Sing. Sing a song. Sing it Lao'd (just in case the sort's still wrong!)

go to newer or older post, or back to index or month or day