On my "Vietnamese Plus" and "pseudo-Form V" constructs

by Michael S. Kaplan, published on 2010/01/12 08:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/01/12/9946880.aspx


Developer Jason (an enthusiastic reader of the Blog) asked:

We need to be able to convert UCS-2/UTF-16 to a user-specified SBCS/DBCS/MBCS code page. Currently, we achieve this by simply taking the UCS-2 string and passing it on to WideCharToMultiByte with dwFlags set to zero. When converting to the Vietnamese code page 1258, this process can’t find a representation for the Vietnamese character U+1ec5 (Latin e with circumflex and tilde) even though one actually does exist (albeit with a combining diacritic from code page 1258: 0xea 0xde).

Converting Vietnamese glyphs from the Unicode BMP to the corresponding glyph representation in the Vietnamese code page seems like a reasonable thing for us to be doing. My question is, should I be expecting WideCharToMultiByte to know this and successfully convert the character? I can’t be the first person to hit this issue and I imagine the mapping tables have been reasonably static, so it seems like perhaps there is something more that I should be doing. Is there, for instance, an expectation that the input string is normalized into some canonical form before calling WCToMB? Presumably decomposed form?

An interesting question that will really draw on information from several different blogs from this Blog:

  1. A few of the gotchas of MultiByteToWideChar
  2. The MB_PRECOMPOSED flag is stupid, and the MB_COMPOSITE ain't no genius either
  3. Getting intermediate forms
  4. Harder intermediate forms of characters
  5. Frost's The Form Not Taken

There are several people who tend to be dismissive about this code page, calling it at best incomplete and at worst broken. From a Unicode standpoint it certainly is, and arbitrary, to boot!

But there is a reasoning behind the code page, a point to which regular reader John Cowan's comment to blog #5 above is particularly relevant:

There's a Vietnamese-specific logic to CP 1258 that transcends the arbitrary Unicode normalization rules.  The breve, circumflex, and horn accents, unlike the rest, affect vowel quality.  If you look at a Vietnamese alphabet like the one at Wikipedia, you'll see that A WITH BREVE, A WITH CIRCUMFLEX, E WITH CIRCUMFLEX, O WITH CIRCUMFLEX, O WITH HORN, and U WITH HORN (as well as D WITH STROKE, which isn't Unicode-decomposable) are considered separate letters from their unaccented correspondents. Consequently, in 1258 they are encoded using seven precomposed characters.

On the other hand, the grave, acute, hook above, tilde, and dot below accents are tone marks, conceptually not part of the letters they appear on.  They're encoded using combining characters, since encoding them using precomposed characters would create a combinatorial explosion of 12 x 6 x 2 = 144 distinct vowel characters.  (The VISCII encoding actually does that, at the expense of filling the whole 0x80-0xFF space with letters and even usurping six of the control characters!)

Unsurprisingly, Vietnamese conventions always place the tone mark outside any breve, circumflex, or horn diacritic (and therefore following it according to Unicode rules).

Thus it is incorrect to say that U+1ec5 is not supported by cp1258; it may be true that U+1ec5 (ễ, aka LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE) is not supported as a discrete single code point, but U+00ea U+0303 (ễ, aka LATIN SMALL LETTER E WITH CIRCUMFLEX + COMBINING TILDE) is - and according to Unicode those two things are to be treated as the same thing. Given that the tilde is a tone mark in this particular case the language-specific way in which the various components of the letter are used makes sense whether the form follows Unicode normalization rules or not.

Which is this case it doesn't.

Thus the wider question of which Unicode normalization form to use that was one of the main points of Jason's inquiry is in fact a trick question: the answer is neither!

Instead, the Microsoft-specific normalization pseudo-Form V mentioned in #5 above is what would be needed here if one wanted to convert.

Now that is a big if in that last sentence.

Since Microsoft's Vietnamese keyboard layout produces text that will be perfectly represented on code page 1258, there are only three scenarios where one would that "pseudo-Form V" to convert out of Unicode:

For the third point the quick answer is to just not do that, if it is possible.

But of course even that is not always possible, so if the sad truth is that some component that cannot be changed is putting the data in some other form then some type of conversion between [probably] normalization Form C would be needed.

This is something that does not exist though the only requirement that a single byte code page such as 1258 cannot handle is the times when one code point would need to be converted to two, e.g.

and so on through all the other various letters covered by the code page.

Unfortunately a simple, table-based double byte code page could not properly support such a custom "Vietnamese Plus" code page mapping.

EXTRA CREDIT: Can anyone here discern and/or explain why, exactly? :-)

Thus one could build a DLL-based mapping (as in Custom code pages? Redux) and just keep these tables around in code if one wanted to. But one would obviously have to have some vested interest in wanting to (e.g. a need to support cp1258 data with data in Unicode that isn't currently in pseudo-form V.

I was most of the way to having this done (auto-generated) to post as a sample before it occurred to me that there might be very good reasons for a full-time Microsoft employee, even a pain in the ass one like me, to post such a thing.

Though if anyone wants to do it, note that I was using cp 51258, for obvious reasons.

If you wanted to create such a DLL-based code page and there is any way to create a standard usage out of a non-standard/unsupported code page, I would encourage you to do the same! :-)

Now for the record let me say this is an area where I do not really tend to agree with the Microsoft party line completely. I mean, I truly believe that Unicode is the best answer here in the long run, but I am hardly naive enough to believe that everyone has made that change yet and surprisingly [to some] not obnoxious enough to think it is acceptable to do nothing further to assist customers. Especially when we expect people to migrate and we know we aren't the most popular non-Unicode solution, the fact that we provide no assistance here and aren't even remotely apologetic as we vote to make Unicode less and less compatible with our own solution even as we make it harder to use is really not my style. To be honest, the fact that we do not have a better solution for integrating with Unicode in the Vietnamese case is also pretty bad -- not even the excuse of backcompat, the only explanation is that no wants to do the work because supporting Vietnamese correctly and more consistently with Unicode just doen't hit anyone's radar. So no one wantsd to use us and the problem perpetutates itself.

When you consider in particular the history of Microsoft in regard to VNI, it just makes Microsoft look worse. Perhaps there are even legal reasons related to the VNI thing that we are requitred to suck here that no one has told me about?

But anyway and either way, that is how things are right now, so my dissenting opinion is unlikely to reach any higher level than the blog post you are reading....


# John Cowan on 12 Jan 2010 11:03 PM:

Answer: in a DBCS it has to be possible to distinguish lead bytes of a multi-byte sequence from stand-alone bytes.  Unfortunately, "a" is not always a lead byte: sometimes (when there is no tone mark, indicating the mid-level tone) it's a stand-alone byte.

The nasty question: Given Unicode text that isn't in normalization form V, can it still be converted to CP1258  by a Windows interface, and if so, which interface?

# Michael S. Kaplan on 13 Jan 2010 7:41 AM:

John - exactly!

Unfortunately, there is no way to convert the non form V text to cp1258, which is why I think this code page should, in fact, exist.

Matthew Slyman on 16 Apr 2013 4:07 AM:

...This is why we need chained dead keys, arranged into a sort of directed-graph type system that allows the logical composition of characters (code-points) via various alternative key-sequences. Until we have that kind of functionality in a keyboard, these kinds of accents and diacritics are always going to be difficult to learn and manage!


referenced by

2012/10/03 The temptation to channel Grumpy Code Reviewer can be almost overwhelming!

2012/09/18 How would *you* define debacle?

2012/02/29 What's the difference between Tiếng Việt, Tiếng Việt, and Tiếng Việt? (other than the obvious, I mean)

2010/08/17 It would be like spelling it Anerica or something.

2010/06/04 Vietnam or Viet Nam or Việt Nam or ???

go to newer or older post, or back to index or month or day