Getting intermediate forms
by Michael S. Kaplan, published on 2005/07/09 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/09/437063.aspx
Unicode has a certain complexity to it that can at times be challenging.
Let's take for example U+1ec5, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE. Here is what it looks like (how good will depend on your OS and browser support!):
ễ
Now obviously that is pretty fully precomposed (in Unicode Normalization Form C). If it is fully decomposed, we get U+0065 U+0302 U+0303, a.k.a. LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT + COMBINING TILDE. Here is what it looks like (again, how good will depend on your OS and browser support!):
ễ
And here is where the problems come in. Because between these two extremes lies as third case: U+00ea U+0303 a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX + COMBINING TILDE. Here is what it looks like (again, how good will depend on your OS and browser support!):
ễ
Now if you convert that third case to NFC you will get the first case, and to NFD you will get the second. How does that happen?
Well, the rules for normalization are that you have to keep on performing the compression or decompression until you can't anymore.
So, there are two ways to get the information of that last case:
- You can cart around the decomposition info from the Unicode Character Database so you can get it all yourself.
- You can take the NFD string and start converting to NFC with one additional character at a time, thus:
Step 1: Convert the string to NFD; we now have: U+0065 U+0302 U+0303
Step 2: U+0065 + U+0302 to NFC == U+00ea; we now also have U+00ea U+0303
Step 3: U+00ea + U+0303 to NFC == U+1ec5; we now also have U+1ec5
Now this is not what I would call a perfect algorithm by any stretch of the imagination. But it is a quick and dirty way to get the information on a bunch of equal forms.
But it certainly leaves open the question of whether the operating system and/or the .NET Framework should expose this information at some point....
This post brough to you by "ễ" (U+1ec5, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE)
# concerned viewer on 9 Jul 2005 4:00 AM:
Just a note to say you've gone from U+1ec5 to U+1e35 - which is "ḵ" (LATIN SMALL LETTER K WITH LINE BELOW), viewable in Tahoma on this box.
# CornedBee on 9 Jul 2005 5:27 AM:
I'm pleasantly surprised to say that my system displays two of the three forms correctly, considering how outlandish that character is.
I wonder about one thing: if you reverse the fully decomposed form, that is, put the tilde before the circumflex, should it be displayed below it? Does Unicode have rules for the ordering of these combiners or do they just stack up as they come?
# Michael Kaplan on 9 Jul 2005 5:46 AM:
Two points, CornedBee -- This character is used in Vietnamese. :-) If you change the order of the diacritics then you are actually changing the order of the stacking (if they have the same canonical combining class). However, normalization will never violate canonical combining classes, so the reorder would always be undone otherwise....
# Michael S. Kaplan on 9 Jul 2005 5:51 AM:
Thanks, cv -- got that fixed. I don't know what I was thinking *there*. :-)
# Michael S. Kaplan on 9 Jul 2005 6:57 AM:
Two of the three? IE does all three, flawlessly. :-)
ễễễ
Heh...
# Suzanne McCarthy on 9 Jul 2005 2:58 PM:
I have found to my chagrin that blogger fails on all three of the above. I knew this after posting on Vietnamese but now it has been rubbed in.
# Mihai on 9 Jul 2005 4:32 PM:
1. To answer your question, I think the OS and .NET should expose this info. Maybe not all the intermediate forms, but definitely the conversion to NFC/NFD.
2. All looks fine in FireFox too, although the NFD has the distance between
the circumflex and the tilde a bit too big.
# A. Skrobov on 9 Jul 2005 7:15 PM:
> Two of the three? IE does all three, flawlessly. :-)
Just out of curiosity, I've tried the page with a clean Win2000 SP4 installation (that is, IE 5.0): all the three characters are displayed identically.
I can't imagine how old a system must be to mess them up.
Oh, and BTW the middle character (3-part one) looks somewhat different than others in Notepad, regardless of the used font... What's wrong with that?
# Dean Harding on 10 Jul 2005 7:39 PM:
Yeah, it's Firefox's non-use of Uniscribe again. :)
It displays the first and (oddly) the last one right, but the second one has the circumflex off to the right a bit (though the tilde is right, for some reason).
Your three together in that last comment is completely off, though!
# Dean Harding on 10 Jul 2005 9:24 PM:
> Oh, and BTW the middle character (3-part one) looks somewhat different than others in Notepad, regardless of the used font... What's wrong with that?
It's because you need to use Uniscribe to get proper compositing of diacritical marks (and complex scripts in general). I believe in Windows XP (and possibly Win2k as well) ExtTextOut does support some of the features of a full Uniscribe implementation (which is why I assume #3 looks right), but for full support (as seen in Internet Explorer) you need to use Uniscribe.
Before reading Michael's blog, I'd never have known all that - yay Michael!!
# Mihai on 10 Jul 2005 9:45 PM:
I have checked the page again, this time on XP, but still FireFox. Now everything is fine.
I guess it is not FireFox, after all :-)
# Michael Dunn_ on 11 Jul 2005 3:07 AM:
This letter isn't esoteric at all, it's in the name Nguyễn, for example.
And this makes me think - Vietnamese must be a great testbed for stuff like Uniscribe, what with all the diacritics, some combining, some not (ễ is actually ê with a tilde tone marker - e and ê are different letters but e and ẽ are the same letter with different tone markers).
# Jonathan on 11 Jul 2005 3:52 AM:
A. Skrobov: notepad works fine for me with Arial. In other fonts (Courier New, Times New Roman) the middle one doesn't exist, so Notepad falls back to Arial, thus the char looks different.
Oh, and IE shows all 3 chars identically.
(XP SP2, with Complex scripts and East-asian language enabled)
# Vorn on 12 Jul 2005 1:54 PM:
ễ <--- totally composed character displays fine in Safari
ễ <--- totally decomposed character displays correctly in the text entry box, but NOT in HTML in Safari, instead shoing e-circumflex and then a wtf-box
ễ <--- partially composed character also displays correctly in text entry box but not in html in Safari, again showing e-circumflex and a wtf-box.
Weird.
Vorn
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day