Getting intermediate forms

by Michael S. Kaplan, published on 2005/07/09 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/09/437063.aspx

Let's take for example U+1ec5, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE. Here is what it looks like (how good will depend on your OS and browser support!):

Now obviously that is pretty fully precomposed (in Unicode Normalization Form C). If it is fully decomposed, we get U+0065 U+0302 U+0303, a.k.a. LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT + COMBINING TILDE. Here is what it looks like (again, how good will depend on your OS and browser support!):

And here is where the problems come in. Because between these two extremes lies as third case: U+00ea U+0303 a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX + COMBINING TILDE. Here is what it looks like (again, how good will depend on your OS and browser support!):

Now if you convert that third case to NFC you will get the first case, and to NFD you will get the second. How does that happen?

Well, the rules for normalization are that you have to keep on performing the compression or decompression until you can't anymore.

Now this is not what I would call a perfect algorithm by any stretch of the imagination. But it is a quick and dirty way to get the information on a bunch of equal forms.

But it certainly leaves open the question of whether the operating system and/or the .NET Framework should expose this information at some point....

This post brough to you by "ễ" (U+1ec5, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE)

I'm pleasantly surprised to say that my system displays two of the three forms correctly, considering how outlandish that character is.

I wonder about one thing: if you reverse the fully decomposed form, that is, put the tilde before the circumflex, should it be displayed below it? Does Unicode have rules for the ordering of these combiners or do they just stack up as they come?

Two points, CornedBee -- This character is used in Vietnamese. :-) If you change the order of the diacritics then you are actually changing the order of the stacking (if they have the same canonical combining class). However, normalization will never violate canonical combining classes, so the reorder would always be undone otherwise....

I have found to my chagrin that blogger fails on all three of the above. I knew this after posting on Vietnamese but now it has been rubbed in.

1. To answer your question, I think the OS and .NET should expose this info. Maybe not all the intermediate forms, but definitely the conversion to NFC/NFD.

2. All looks fine in FireFox too, although the NFD has the distance between
the circumflex and the tilde a bit too big.

> Two of the three? IE does all three, flawlessly. :-)

Just out of curiosity, I've tried the page with a clean Win2000 SP4 installation (that is, IE 5.0): all the three characters are displayed identically.
I can't imagine how old a system must be to mess them up.

Oh, and BTW the middle character (3-part one) looks somewhat different than others in Notepad, regardless of the used font... What's wrong with that?

Yeah, it's Firefox's non-use of Uniscribe again. :)

It displays the first and (oddly) the last one right, but the second one has the circumflex off to the right a bit (though the tilde is right, for some reason).

Your three together in that last comment is completely off, though!

> Oh, and BTW the middle character (3-part one) looks somewhat different than others in Notepad, regardless of the used font... What's wrong with that?

It's because you need to use Uniscribe to get proper compositing of diacritical marks (and complex scripts in general). I believe in Windows XP (and possibly Win2k as well) ExtTextOut does support some of the features of a full Uniscribe implementation (which is why I assume #3 looks right), but for full support (as seen in Internet Explorer) you need to use Uniscribe.

Before reading Michael's blog, I'd never have known all that - yay Michael!!

This letter isn't esoteric at all, it's in the name Nguyễn, for example.
And this makes me think - Vietnamese must be a great testbed for stuff like Uniscribe, what with all the diacritics, some combining, some not (ễ is actually ê with a tilde tone marker - e and ê are different letters but e and ẽ are the same letter with different tone markers).

A. Skrobov: notepad works fine for me with Arial. In other fonts (Courier New, Times New Roman) the middle one doesn't exist, so Notepad falls back to Arial, thus the char looks different.

Oh, and IE shows all 3 chars identically.

(XP SP2, with Complex scripts and East-asian language enabled)

ễ <--- totally composed character displays fine in Safari
ễ <--- totally decomposed character displays correctly in the text entry box, but NOT in HTML in Safari, instead shoing e-circumflex and then a wtf-box
ễ <--- partially composed character also displays correctly in text entry box but not in html in Safari, again showing e-circumflex and a wtf-box.

Weird.

Vorn

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

2011/04/23 Solution: The Dead Keys Conundrum: An Encyclopedia Brown Mystery

2010/01/12 On my "Vietnamese Plus" and "pseudo-Form V" constructs

2009/05/27 The whole truth about MB_PRECOMPOSED and MB_COMPOSITE

2008/12/15 Frost's The Form Not Taken

2008/03/26 Vietnamese still ain't quite right

2007/01/31 A year later, and the Vietnamese keyboard isn't any better

2006/05/14 Harder intermediate forms of characters

2005/12/03 When even the bugs seem cool

2005/11/11 What to do with the Vietnamese keyboard on Windows?