What's the difference between Tiếng Việt, Tiếng Việt, and Tiếng Việt? (other than the obvious, I mean)

by Michael S. Kaplan, published on 2012/02/29 05:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/02/29/10274340.aspx


After I wrote Do you know what Հայերեն, தமிழ், اُردو, and Tiếng Việt have in common? yesterday, I had a colleague who read it on her phone ask me why the third letter in Tiếng Việt (Vietnamese) didn't look right even though the eighth letter looked just fine.

There are for exampe the three ways you might see that string and how each one encodes that third letter:

Tiếng Việt    U+0065 U+0302 U+0301
Tiếng Việt    U+00ea U+0301
Tiếng Việt    U+1ebf

The first one uses Unicode Normalization Form "D" (which Microsoft seldom uses)

On the other hand, the third one uses Unicode Normalization Form "C" (which Microsoft usually uses except in a few cases -- including Vietnamese.

That second form is the one whose origins are shrouded in Microsoft's hasty cover-up of the VNI debacle, also known as Code Page 1258.

But that code page is so seldom directly used now, that it cannot rightfully be considered the source.

The real source is that damn keyboard, the one I talked about in blogs like What to do with the Vietnamese keyboard on Windows? and On my "Vietnamese Plus" and "pseudo-Form V" constructs. Which is of course based on the code page....

If we want to escape it, it would require a task inspired by the single most complicated keyboard layout that has ever been in Windows (previously described in The evolving Story of Locale Support, part 6: Behind the Cherokee Phonetic layout in Windows 8).

And then, by using chained dead keys, create a genuine Unicode Normalization Form C version of the Vietnamese keyboard on Windows....

This cannot be done for Windows 8 even if it was approved, as way too much research and picking up Vietnamese hardware and writing so many dead key tables that I might crash some never before hit buffer problem.

But I'll put it on my list of things to do at some point. We've really screwed up more than our fair share of Vietnamese text -- in sorting, in keyboard, and in code page....


Simon Buchan on 29 Feb 2012 2:19 PM:

I'm guessing IMEing up Vietnamese isn't the way to go :). Though in more general terms, I'd like to see keyboard input get more consistent (and by the sound of it, easier on the developer!) The fact that the keyboard is deciding the normalization form is a leaky abstraction!

Tom Gewecke on 1 Mar 2012 11:07 AM:

@Simon  I think quite a few people using Mac's use an IM instead of a normal keyboard layout.  OS X comes with an IM called Unikey which lets you choose among telex, vni, and viqr modes to produce Unicode Vietnamese.  I don't know how well they work.

ar_niz on 2 Mar 2012 1:24 AM:

Hi.

Today I upgraded Win7 to Win8 consumer preview. Glad to see my locale Sindhi, unfortunately without a keyboard layout. I will request to please include Sindhi keyboard layout, as only the locale is not enough for people to use Sindhi effectively on their computers. Sindhi is spoken by 21m people (at least), and it would be shame to see cherokee (spoken by few thousand) keyboard there but not Sindhi.

Further, I am trying to install Urdu keyboard, but it doesn't get installed. When I add a new language such as Arabic or Persian, their associated keyboard layout is automatically included. But not so with Urdu. I tried to manually add the already available layout for Urdu, but after clicking on Save, it doesn't appear in the main list of languages.

Is it a bug? What can I do now?

Another problem is that I have installed MB Sindhi 2010 keyboard layout for Sindhi, designed by Majid Bhurgri. But it doesn't appear in the list of layouts for Arabic script. And I cannot see any other way to browse all available layouts to choose from. Why is that?

Duy Nguyen on 3 Mar 2012 9:04 AM:

Most Vietnamese uses IM (the mentioned unikey is quite popular). It works like chained dead keys, but after the main letter. For example, to type the third letter above, with VNI, it's <e> + <6> + <1>, with Telex it's <e> + <e> + <s>.


referenced by

2012/10/03 The temptation to channel Grumpy Code Reviewer can be almost overwhelming!

2012/09/18 How would *you* define debacle?

go to newer or older post, or back to index or month or day