by Michael S. Kaplan, published on 2006/05/14 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/05/14/597198.aspx
In the post Getting intermediate forms, I gave an example three character sequences that look the same and that are canonically equivalent according to Unicode:
In this case, it is easy to see that the first one is in normalization form C, the second is in normalization form D, and the third is somewhere in between.
However, there are more complicated situations, such as the following:
Now it is important to note that there is no single precomposed character that captures this letter, and further the method used previously does not give hints as to which of the first two is considered normalization form C and which of the second two is normalization form D.
So short of calling String.Normalize(NormalizationForm) for both NormlizationForm.FormC and NormlizationForm.FormD, or the NormalizeString function in Win32, how to find out? And how do these methods get their answer, anyway?
The secret is in the canonical combining class of each Unicode code point, defined in the Unicode Character Database's UnicodeData.txt. This value is marked below, in GREEN:
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
00E1;LATIN SMALL LETTER A WITH ACUTE;Ll;0;L;0061 0301;;;;N;LATIN SMALL LETTER A ACUTE;;00C1;;00C1
0105;LATIN SMALL LETTER A WITH OGONEK;Ll;0;L;0061 0328;;;;N;LATIN SMALL LETTER A OGONEK;;0104;;0104
0301;COMBINING ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING ACUTE;Oxia, Tonos;;;
0328;COMBINING OGONEK;Mn;202;NSM;;;;;N;NON-SPACING OGONEK;;;;
The meaning of the Canonical Combining Class values is:
Value
Description
0: Spacing, split, enclosing, reordrant, and Tibetan subjoined 1: Overlays and interior 7: Nuktas 8: Hiragana/Katakana voicing marks 9: Viramas 10: Start of fixed position classes 199: End of fixed position classes 200: Below left attached 202: Below attached 204: Below right attached 208: Left attached (reordrant around single base character) 210: Right attached 212: Above left attached 214: Above attached 216: Above right attached 218: Below left 220: Below 222: Below right 224: Left (reordrant around single base character) 226: Right 228: Above left 230: Above 232: Above right 233: Double below 234: Double above 240: Below (iota subscript)
From there, you can look at UAX #15: Unicode Normalization Forms to get the rules for Form C and Form D, both of which include the notion of fully decompsing the string and then re-ordering all non-zero combining classes from lowest to highest.
Thus looking back at the original list of four ways to express ą́, #1 is normalization form C and #3 is normalization form D, and this is true whether or not a language may have specific linguistic (e.g. phonemic or orthographic) reasons for expressing a particular preference for thinking of it as having the acute or the ogonek first.
This particular issue can cause model problems for anyone who is looking at the character from a linguistic standpoint, and it is important to not discount some of the strong feelings that people can have about how they believe the character is best represented, which can even influence decisions on keyboards to make them more accurately reflect the language.
Thankfully, both modern fonts and collation on Windows will treat all four strings as both appearing and being equal, so the fact that a user is expecting (or an input method is specifically designed for) one of the forms will not punish users of that input method, even if search engines currently will....
This post brought to you by " ̨" (U+0328, a.k.a. COMBINING OGONEK)
# Dan Manchester on 15 May 2006 1:16 PM:
# Igor Tandetnik on 15 May 2006 2:12 PM:
# Dean Harding on 15 May 2006 11:53 PM:
referenced by