Sing. Sing a song. Sing it Lao'd (just in case the sort's still wrong!)

by Michael S. Kaplan, published on 2010/04/17 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/04/17/9997539.aspx

If you can believe the stats about it, then it ranks #32 of all the blogs on the server, and if you take off the team blogs for a moment (like IE and Excel and Outlook and such), then it is like #14.

If you don't count Raymond's bump late last month that awoke a March 31st reddit burp and ycombinator belch, then the numbers are more accurately #34 and #16, for what its worth. That'll sort itself out soon enough when the UNIX heads realize I'm not teasing them with every blog!

Like when the girl I'm going out with says she thinks I'm hot. I never know what to do to that, since I don't think I am. They certainly didn't all say it.

My self image is a bit like that Cerebus picture in the corner; I think they may eventually see me that way.

But the looking hot thing is weird, like these numbers. And the numbers are even weirder than the girlfriends since the numbers don't seem to be breaking up with me quite as readily even when I abanadon them for months for no good reason, or have an online nervous breakdown over the death of a friend, and no matter how outrageous I've been over these last few years¹.

Maybe the Blog fills some unique niche; I had a reader point out to me that if you look at the Official Google Blog they have 27 posts on accessibility, 2 on Africa, 8 on Asia, 24 on Europe, 8 on Geo, 11 on India, 7 on Latin America, and 0 on Unicode, 0 on Globalization, 0 on internationalization, 0 on localization, 0 on translation, 0 on localizability, 0 on langauge, and 0 on linguistics.

I pointed out that this might be unfair since some of those regional tags might point to relevant content. And they did have the blog about the goats, which was kind of cool and I'll give them props for.

Anyway, if you compare the number of comments per blog on this Blog, the numbers are much lower than most of the others on the server -- even ones lower in the stats.

I can't quite figure it -- maybe people pop in and leave when they realize I'll be talking about Mongolian or uppercase or song lyrics or iBots or Unicode or whatever, so they never make it to the bottom of the page where the Comments box is.

Actually, I know that probably isn't what it is -- the issue is that the majority of the traffic is people coming off of Bing or Google searches. More and more Bing, by the way -- and not just from the search box here. Like Bing from outside the server.

So when I talk about "regular readers" as if there are many of them, the number is probably not as huge as the "rank" might indicate; people just show up here searching for answers to some random thing, and I happen to write about all kinds of random things. So Presto, people get their answer and go.

I'll keep the myth going and talk about regular readers as if you've all been here for the last half decade or so, like me.

Anyway, if you've been around here for a few years at least, you may remember a little over two years ago, when I reported (in Despite progression, the bug calls out to me quite LAOdly) how the Laotian sorting in Windows did well on the consonants but not so much on vowels and tone marks. And that it was essentially broken.

In comments, John Durdin³ and Marc Durdin⁴ expressed concern that even if the bug did not exist it is likely that more would need to be done and that Lao sorting would not have looked right if it was just a matter of adding the proper weights to these code points; some compressions (what the UCA calls contractions) would be needed.

We do that with Thai (299 2-to-1 compressions and 230 3-to-1 compressions, which when applied put us within conformance range with the Royal Thai sort).

So obviously they, knowing a hell of a lot more than I do about Laotian, would be likely to be correct.

Windows 7 added a 464-entry 2-to-1 Laotian compression table covering a huge array of letter and vowel combinations.

And the fact that the MAI EK, MAI THO, MAI TI, and MAI CATAWA tone marks were given alphabetic (primary) weight....

And the fact that there are no 3-to-1 or 4-to-1 compressions to allow for the clustering requirements they both suggested (which is pretty much how we do it for Thai)....

All of that makes me suspect that Lao sorting may well be closer to goal but is likely still off the mark in Windows 7.

I will have to wait to hear from some of these very interested people on how close it ends up being -- it may also be right under the "hundreds of wrong answers that give you a right behavior" principle that table based collations can sometimes bring to the mix, which are unsatisfying to linguists since hundreds of wrongs shouldn't make a right, but I tend to be okay with due to being more results driven....

1- With the arguable high point being when a VP asked why I wasn't being fired and one HR Generalist offered me an only midly insukting RIF. Though he left and so did she, so I guess that movement kind of fizzled out.

2 - Or looking for people who think I'm hot.

3 - Spake John: "Conventions for sorting are probably still not fully accepted in Lao PDR, but sorting according to the rules given in Kerr's 1972 Lao-English Dictionary is widely followed. The algorithm is (primarily) phonetic, unlike Thai, which uses an orthographic sorting approach. From a user's perspective, it is much easier, since you can find a word in a dictionary without knowing how it is spelled. Most Thai university students do not use a dictionary effectively - if you don't know how the word is spelled, it can be quite hard to find it (and Thai, like English, has very irregular spelling). The problem with the Lao approach is that words (or text) *must* be split at syllable boundaries (reasonably well) before determining the sorting key for each syllable, which adds computational complexity, but can be done."

4 - Spake Marc: "I'm not quite sure I can see how the sorting can even kinda work without taking each syllable as a whole. Do you work on a syllable-by-syllable basis? Unless you take each syllable as a whole (initial consonant, final consonant, vowel, tone), the sorting just won't work. And because there can be ambiguity with final consonants and open syllables, you really need to split each syllable before sorting."

Well, from outside this server. :-)

I figure that searches coming from the one hosted on the page are mildly interesting but give no real indication of what people are using to search for stuff on the Internet!

For me your blog is a valuable source of information for someone who code for non-English UI. Sometimes it contains information that enable me to evade potential errors in the code. And some of the information be found is difficult to form effective queries in search engines.

Look up a blog for information takes max. 30 seconds a day, but save you potentially days for debuging. Doesn't sound like a bad deal, right?

That said, I have to say I found some blog entries quite interesting / thought provoking. Thank you for providing us some many stuff of dubious value.

Well, as I say in the about page, at times I myself tend to feel llke a misanthropic anthropomorphic three-foot tall bipedal gray aardvark....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.