by Michael S. Kaplan, published on 2006/06/28 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/28/648940.aspx
Peter Constable asked some Unicode folks:
I’m just curious to know why 0f77 and 0f79 were given compatibility decompositions rather than canonical decompositions? (I don’t see any obvious reason why canonical decompositions would not have been feasible.)
(Yes, I know this can’t be changed – that’s not my objective.)
Peter Constable
And Ken Whistler stepped up with a good historical look at these two characters (which in my humble opinion deserves a more permanent location for others to see!):
0F71;TIBETAN VOWEL SIGN AA;Mn;129;NSM;;;;;N;;;;;
0F76;TIBETAN VOWEL SIGN VOCALIC R;Mn;0;NSM;0FB2 0F80;;;;N;;;;;
0F77;TIBETAN VOWEL SIGN VOCALIC RR;Mn;0;NSM;<compat> 0FB2 0F81;;;;N;;;;;
0F78;TIBETAN VOWEL SIGN VOCALIC L;Mn;0;NSM;0FB3 0F80;;;;N;;;;;
0F79;TIBETAN VOWEL SIGN VOCALIC LL;Mn;0;NSM;<compat> 0FB3 0F81;;;;N;;;;;
0F80;TIBETAN VOWEL SIGN REVERSED I;Mn;130;NSM;;;;;N;;;;;
0F81;TIBETAN VOWEL SIGN REVERSED II;Mn;0;NSM;0F71 0F80;;;;N;;;;;
0FB2;TIBETAN SUBJOINED LETTER RA;Mn;0;NSM;;;;;N;;*;;;
0FB3;TIBETAN SUBJOINED LETTER LA;Mn;0;NSM;;;;;N;;;;;
NFD NFC
0F76 0FB2 0F80 0FB2 0F80
0F77 0F77 0F77 <-- discouraged (strongly)
0FB2 0F71 0F80 0FB2 0F71 0F80 0FB2 0F71 0F80 <-- preferred
0F78 0FB3 0F80 0FB3 0F80
0F79 0F79 0F79 <-- discouraged (strongly)
0FB3 0F71 0F80 0FB3 0F71 0F80 0FB3 0F71 0F80 <-- preferred
0F80 0F80 0F80
0F81 0F71 0F80 0F71 0F80 <-- discouraged
0F71 0F80 0F71 0F80 0F71 0F80 <-- preferred
Note that the preferred forms appear in both NFD and NFC, with the decomposed form for 0F81 resulting from the non-starter exclusion and the decomposed forms for 0F76 and 0F78 resulting from explicit addition to the script-specific composition exclusions.
If you gave 0F77 and 0F79 *canonical* decompositions, then:
0F77 --> <0FB2, 0F81> --> <0FB2, 0F71, 0F80>
0 0 0 0 129 130
0F79 --> <0FB3, 0F81> --> <0FB3, 0F71, 0F80>
0 0 0 0 129 130
NFD NFC
0F76 0FB2 0F80 0FB2 0F80
0F77 0FB2 0F71 0F80 ???? <-- discouraged (strongly)
0FB2 0F71 0F80 0FB2 0F71 0F80 0FB2 0F71 0F80 <-- preferred
0F78 0FB3 0F80 0FB3 0F80
0F79 0FB3 0F71 0F80 ???? <-- discouraged (strongly)
0FB3 0F71 0F80 0FB3 0F71 0F80 0FB3 0F71 0F80 <-- preferred
0F80 0F80 0F80
0F81 0F71 0F80 0F71 0F80 <-- discouraged
0F71 0F80 0F71 0F80 0F71 0F80 <-- preferred
Now you've made your life more difficult and normalization implementations maybe more complex. The decompositions <0FB2, 0F71, 0F80> have to be prevented from recomposing. They won't decompose partwise, because <0F71, 0F80> is blocked from recomposing, and <0FB2, 0F80> is also blocked from recomposing, but the sequence of 3 has, at least in principle, a target it should recompose to, unless blocked. Depending on how you set up your tables, you might or might not get this right, and in any case, you end up introducing the strongly discouraged characters as a source of valid sequences that you have to contend with in NFC and NFD, whereas under the current scheme you don't.
Also, this was all part of a very head-breaking set of problems for Tibetan when decompositions and canonical combining classes were being reviewed for the introduction of normalization in the first place.
In Unicode 2.0, 0F77 and 0F79 *were* given canonical decompositions, but they were *different* decompositions, to wit:
0F77 = 0F76 + 0F71 = 0FB2 + 0F80 + 0F71
0F79 = 0F78 + 0F71 = 0FB3 + 0F80 + 0F71
*and* they had funky fixed position class assignments, as well:
0F77 = 0F76 + 0F71 = 0FB2 + 0F80 + 0F71 (not in canonical order)
135 134 129 6 143 129
0F79 = 0F78 + 0F71 = 0FB3 + 0F80 + 0F71 (not in canonical order)
137 136 129 6 143 129
That was clearly hosed, as it broke all kinds of rules that we were trying to establish for normalization, including ensuring that all decomposition mappings produced sequences in canonical order and ensuring, as much as was possible, given the constraints in place, that the resulting sequences would follow the logic of the script *and* that NFC forms would decompose if that was what the users of the script preferred (hence the introduction of script-specific composition exclusions for several scripts, including Tibetan).
During that conversion from Unicode 2.0 to Unicode 3.0 with normalization, the UTC did the best it could with the mess for Tibetan. It was clear after the analysis that 0F77 and 0F79 should never have been encoded at all -- which was why they got those "strongly discouraged" labels -- but there was nothing to do about that mistake at that point. The compatibility decompositions
were the best compromise to keep them from contaminating the normalization processing of the rest of the Tibetan vowels.
--Ken
Anyway, like I said, this seemed to me like good historical information to put out there, and certainly to help show that Every character has a story!
This post brought to you by ཷ and ཹ (U+0f77 and U+0f79, a.k.a. TIBETAN VOWEL SIGN VOCALIC RR and TIBETAN VOWEL SIGN VOCALIC LL)