Frost's The Form Not Taken

by Michael S. Kaplan, published on 2008/12/15 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/12/15/9215772.aspx

Now I won't go so far as to say that 1258 is a favorite code page, so I'll take that as sarcasm. :-)

When I read the questions here, I immediately thought of the Robert Frost poem The Road Not Taken:

Now the poem has been interpreted by others much greater than I, so I'll just go so far as to say that one popular interpretation is an ironic one -- that despite the speaker's bold proclamation at the end about how this different choice made all the difference, in truth it made very little difference, and the two paths were really pretty much equivalent.

I thought I'd twist that up a bit with Aaron's example that leads to a specific scenario where refusing to go down the road not taken can more or less kick your ass!

I'll start by pointing to a blog that comes close to helping here but in the end fails (Getting intermediate forms).

It's all well and good to talk about U+1eb7 (ặ), aka LATIN SMALL LETTER A WITH BREVE AND DOT BELOW.

If you decompose it once via the data in the Unicode Character Database, you get U+1ea1 U+0306, aka LATIN SMALL LETTER A WITH DOT BELOW + COMBINING BREVE.

And if you decompose it again, you get U+0061 U+0323 U+0306, aka LATIN SMALL LETTER A + COMBINING DOT BELOW + COMBINING BREVE.

And id you look at the code page table itself, you will see that it does have U+0103 (LATIN SMALL LETTER A WITH COMBINING BREVE).

Now this is one of those harder intermediate forms -- U+0103 U+0323 (ặ), aka LATIN SMALL LETTER A WITH BREVE + COMBINING DOT BELOW is, while definitely the road not taken by Unicode normalization, is actually the de facto road taken by (for lack of an official term) Microsoft's "Normalization Form V", as used by its code page 1258.

Note that this sequence will see both Unicode Normalization Forms C and KC convert to U+1eb7, and both Unicode Normalization Forms D and KD convert to U+0061 U+0323 U+0306.

Now there is no conversion built into Windows or .NET to get this form not taken that will look right using code page 1258. Though if I ever had an interview candidate who understood all about code pages, I suspect that writing a converter that could do such a job would make for a fascinating interview question!

So, getting back to Aaron's questions, I handle #1 above and point out how though there is no good specific way to do #2 I'd be very impressed by the person who wrote the code to do it!

For #3, code page 1258 is the only conventional ACP under Windows with this specific problem.

And as for #4, while technically UTF-8 (code page 65001) is unsupported by Windows Installer, as I pointed out in MSI Databases and Unicode, MSKLC was able to successfully use UTF-8 and support the setup packages for many Unicode-only languages such as Hindi and Lao and Tibetan. Which suggests that it can in fact be used for Vietnamese.

Note that there are some characters that even if you do manage to create your own implementation of the so my so called Microsoft Normalization Form V can't be represented by the code page, thus UTF-8 is really the only option that can support the Vietnamese language itself, in the long run.

And although using UTF-8 here will make that conversion code unnecessary, I'd still want to hire the person who came up with an elegant code solution there. :-)

Thinking of all the work that fonts do to support the fictional Form V as well as the other, more valid and less valid forms, it is unfortunate that this support was never made more widespread so that it would be easier to support languages like Vietnamese while waiting for everyone like MSI and others to move to Unicode.

Now Windows code page 1258 is clearly an example where The Road Not Taken may well look the same in fonts and rendering, but in terms of code pages and components that do not use Unicode and/or do not normalize will see the road not taken as one with very very tall weeds blocking the way of folks like Aaron whose actual work might not afford them the ironic detachment of Robert Frost when it turns out that the two paths aren't the same....

This blog brought to you by ặ (U+1eb7, aka LATIN SMALL LETTER WITH BREVE AND DOT BELOW)

There's a Vietnamese-specific logic to CP 1258 that transcends the arbitrary Unicode normalization rules. The breve, circumflex, and horn accents, unlike the rest, affect vowel quality. If you look at a Vietnamese alphabet like the one at Wikipedia, you'll see that A WITH BREVE, A WITH CIRCUMFLEX, E WITH CIRCUMFLEX, O WITH CIRCUMFLEX, O WITH HORN, and U WITH HORN (as well as D WITH STROKE, which isn't Unicode-decomposable) are considered separate letters from their unaccented correspondents. Consequently, in 1258 they are encoded using seven precomposed characters.

On the other hand, the grave, acute, hook above, tilde, and dot below accents are tone marks, conceptually not part of the letters they appear on. They're encoded using combining characters, since encoding them using precomposed characters would create a combinatorial explosion of 12 x 6 x 2 = 144 distinct vowel characters. (The VISCII encoding actually does that, at the expense of filling the whole 0x80-0xFF space with letters and even usurping six of the control characters!)

Unsurprisingly, Vietnamese conventions always place the tone mark outside any breve, circumflex, or horn diacritic (and therefore following it according to Unicode rules). The only place in which this causes a problem is the dot below, beccause Unicode arbitrarily wants all diacritics below to come before all diacritics above.

(ObTooLateNow: IMHO the horn diacritic shouldn't have been encoded separately in Unicode. It's not used anywhere but in Vietnamese, can only appear on o and u, and (like ogonek and cedilla, but unlike most other combining diacritics) always touches the letter that it's associated with. Using undecomposable characters would have cost only 3 codepoints.)