Harder intermediate forms of characters

by Michael S. Kaplan, published on 2006/05/14 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/05/14/597198.aspx

In the post Getting intermediate forms, I gave an example three character sequences that look the same and that are canonically equivalent according to Unicode:

ễ U+1ec5 LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE
ễ U+0065 U+0302 U+0303 LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT + COMBINING TILDE
ễ U+00ea U+0303 LATIN SMALL LETTER E WITH CIRCUMFLEX + COMBINING TILDE

In this case, it is easy to see that the first one is in normalization form C, the second is in normalization form D, and the third is somewhere in between.

However, there are more complicated situations, such as the following:

ą́ U+0105 U+0301 LATIN SMALL LETTER A WITH OGONEK + COMBINING ACUTE ACCENT
ą́ U+00e1 U+0328 LATIN SMALL LETTER A WITH ACUTE + COMBINING OGONEK
ą́ U+0061 U+0328 U+0301 LATIN SMALL LETTER A + COMBINING OGONEK + COMBINING ACUTE ACCENT
ą́ U+0061 U+0301 U+0328 LATIN SMALL LETTER A + COMBINING ACUTE ACCENT + COMBINING OGONEK

Now it is important to note that there is no single precomposed character that captures this letter, and further the method used previously does not give hints as to which of the first two is considered normalization form C and which of the second two is normalization form D.

So short of calling String.Normalize(NormalizationForm) for both NormlizationForm.FormC and NormlizationForm.FormD, or the NormalizeString function in Win32, how to find out? And how do these methods get their answer, anyway?

The secret is in the canonical combining class of each Unicode code point, defined in the Unicode Character Database's UnicodeData.txt. This value is marked below, in GREEN:

0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041

00E1;LATIN SMALL LETTER A WITH ACUTE;Ll;0;L;0061 0301;;;;N;LATIN SMALL LETTER A ACUTE;;00C1;;00C1

0105;LATIN SMALL LETTER A WITH OGONEK;Ll;0;L;0061 0328;;;;N;LATIN SMALL LETTER A OGONEK;;0104;;0104

0301;COMBINING ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING ACUTE;Oxia, Tonos;;;

0328;COMBINING OGONEK;Mn;202;NSM;;;;;N;NON-SPACING OGONEK;;;;

The meaning of the Canonical Combining Class values is:

Value

Description

0: Spacing, split, enclosing, reordrant, and Tibetan subjoined

1: Overlays and interior

7: Nuktas

8: Hiragana/Katakana voicing marks

9: Viramas

10: Start of fixed position classes

199: End of fixed position classes

200: Below left attached

202: Below attached

204: Below right attached

208: Left attached (reordrant around single base character)

210: Right attached

212: Above left attached

214: Above attached

216: Above right attached

218: Below left

220: Below

222: Below right

224: Left (reordrant around single base character)

226: Right

228: Above left

230: Above

232: Above right

233: Double below

234: Double above

240: Below (iota subscript)

Value	Description
0:	Spacing, split, enclosing, reordrant, and Tibetan subjoined
1:	Overlays and interior
7:	Nuktas
8:	Hiragana/Katakana voicing marks
9:	Viramas
10:	Start of fixed position classes
199:	End of fixed position classes
200:	Below left attached
202:	Below attached
204:	Below right attached
208:	Left attached (reordrant around single base character)
210:	Right attached
212:	Above left attached
214:	Above attached
216:	Above right attached
218:	Below left
220:	Below
222:	Below right
224:	Left (reordrant around single base character)
226:	Right
228:	Above left
230:	Above
232:	Above right
233:	Double below
234:	Double above
240:	Below (iota subscript)

From there, you can look at UAX #15: Unicode Normalization Forms to get the rules for Form C and Form D, both of which include the notion of fully decompsing the string and then re-ordering all non-zero combining classes from lowest to highest.

Thus looking back at the original list of four ways to express ą́, #1 is normalization form C and #3 is normalization form D, and this is true whether or not a language may have specific linguistic (e.g. phonemic or orthographic) reasons for expressing a particular preference for thinking of it as having the acute or the ogonek first.

This particular issue can cause model problems for anyone who is looking at the character from a linguistic standpoint, and it is important to not discount some of the strong feelings that people can have about how they believe the character is best represented, which can even influence decisions on keyboards to make them more accurately reflect the language.

Thankfully, both modern fonts and collation on Windows will treat all four strings as both appearing and being equal, so the fact that a user is expecting (or an input method is specifically designed for) one of the forms will not punish users of that input method, even if search engines currently will....

This post brought to you by " ̨" (U+0328, a.k.a. COMBINING OGONEK)

# Dan Manchester on 15 May 2006 1:16 PM:

Michael,

Thanks for the great blog. Your mention of intermedate forms reminded me of something I saw recently when working with Vietnamese text on Windows.

By way of background, I sometimes need to encode text via a legacy codepage. Word 2003's ability to do the needed conversion generally works out very well for me.

However, on the occasion in question, Word was unwilling to encode many of the accented characters--for example, an "e" with a circumflex and an acute accent--found in my Vietnamese text. I figured that these characters had to somehow be supported by codepage #1258, so I investigated further.

It turned out that the characters that Word wouldn't encode were generally pre-composed characters. However, after I manually decomposed them--for the aforementioned example, I swapped in an "e" with a circumflex and added a combining acute accent--Word produced a usable encoded version.

It seems like Word could do this decomposition itself without too much trouble. Is that a feature that has simply never been added? Or are there complexities here that I'm not considering?

# Igor Tandetnik on 15 May 2006 2:12 PM:

Re: modern fonts ... will treat all four strings as ... appearing ... equal

Actually, reading your post with IE6 on WinXP SP2, #1 and #3 look a little bit differently from #2 and #4. In the former the ogonek is correctly attached to the letter 'a'. In the latter, the ogonek is shifted one pixel to the right.

# Dean Harding on 15 May 2006 11:53 PM:

Igor: That's probably the font more than anything. I'd say that for #1 and #3, Uniscribe is finding a precomposed glyph 'a' with ogonek in the font, then "manually" putting the acute in place, while for #2 and #4 it's doing it the other way around.

And my guess is that ogonek on the precomposed glyph looks slightly different to the non-precomposed ogonek.

But that's just a guess...

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/04/23 Solution: The Dead Keys Conundrum: An Encyclopedia Brown Mystery

2010/01/12 On my "Vietnamese Plus" and "pseudo-Form V" constructs

2009/05/27 The whole truth about MB_PRECOMPOSED and MB_COMPOSITE

2008/12/15 Frost's The Form Not Taken

go to newer or older post, or back to index or month or day