Every character has a story #21: U+0f77 U+0f79 (TIBETAN VOWEL SIGN VOCALIC [RR|LL])

by Michael S. Kaplan, published on 2006/06/28 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/28/648940.aspx


Peter Constable asked some Unicode folks:

I’m just curious to know why 0f77 and 0f79 were given compatibility decompositions rather than canonical decompositions? (I don’t see any obvious reason why canonical decompositions would not have been feasible.)

(Yes, I know this can’t be changed – that’s not my objective.)

Peter Constable

And Ken Whistler stepped up with a good historical look at these two characters (which in my humble opinion deserves a more permanent location for others to see!):

0F71;TIBETAN VOWEL SIGN AA;Mn;129;NSM;;;;;N;;;;;

0F76;TIBETAN VOWEL SIGN VOCALIC R;Mn;0;NSM;0FB2 0F80;;;;N;;;;;
0F77;TIBETAN VOWEL SIGN VOCALIC RR;Mn;0;NSM;<compat> 0FB2 0F81;;;;N;;;;;
0F78;TIBETAN VOWEL SIGN VOCALIC L;Mn;0;NSM;0FB3 0F80;;;;N;;;;;
0F79;TIBETAN VOWEL SIGN VOCALIC LL;Mn;0;NSM;<compat> 0FB3 0F81;;;;N;;;;;
0F80;TIBETAN VOWEL SIGN REVERSED I;Mn;130;NSM;;;;;N;;;;;
0F81;TIBETAN VOWEL SIGN REVERSED II;Mn;0;NSM;0F71 0F80;;;;N;;;;;

0FB2;TIBETAN SUBJOINED LETTER RA;Mn;0;NSM;;;;;N;;*;;;
0FB3;TIBETAN SUBJOINED LETTER LA;Mn;0;NSM;;;;;N;;;;;

 
                NFD             NFC
0F76            0FB2 0F80       0FB2 0F80
0F77            0F77            0F77            <-- discouraged (strongly)
0FB2 0F71 0F80  0FB2 0F71 0F80  0FB2 0F71 0F80  <-- preferred
0F78            0FB3 0F80       0FB3 0F80
0F79            0F79            0F79            <-- discouraged (strongly)
0FB3 0F71 0F80  0FB3 0F71 0F80  0FB3 0F71 0F80  <-- preferred
0F80            0F80            0F80
0F81            0F71 0F80       0F71 0F80       <-- discouraged
0F71 0F80       0F71 0F80       0F71 0F80       <-- preferred

Note that the preferred forms appear in both NFD and NFC, with the decomposed form for 0F81 resulting from the non-starter exclusion and the decomposed forms for 0F76 and 0F78 resulting from explicit addition to the script-specific composition exclusions.

If you gave 0F77 and 0F79 *canonical* decompositions, then:

0F77 --> <0FB2, 0F81> --> <0FB2, 0F71, 0F80>
  0         0     0          0    129   130
 
0F79 --> <0FB3, 0F81> --> <0FB3, 0F71, 0F80>
  0         0     0          0    129   130

                NFD             NFC
0F76            0FB2 0F80       0FB2 0F80
0F77            0FB2 0F71 0F80  ????            <-- discouraged (strongly)
0FB2 0F71 0F80  0FB2 0F71 0F80  0FB2 0F71 0F80  <-- preferred
0F78            0FB3 0F80       0FB3 0F80
0F79            0FB3 0F71 0F80  ????            <-- discouraged (strongly)
0FB3 0F71 0F80  0FB3 0F71 0F80  0FB3 0F71 0F80  <-- preferred
0F80            0F80            0F80
0F81            0F71 0F80       0F71 0F80       <-- discouraged
0F71 0F80       0F71 0F80       0F71 0F80       <-- preferred

Now you've made your life more difficult and normalization implementations maybe more complex. The decompositions <0FB2, 0F71, 0F80> have to be prevented from recomposing. They won't  decompose partwise, because <0F71, 0F80> is blocked from recomposing, and <0FB2, 0F80> is also blocked from recomposing, but the sequence of 3 has, at least in principle, a target it should recompose to, unless blocked. Depending on how you set up your tables, you might or might not get this right, and in any case, you end up introducing the strongly discouraged characters as a source of valid sequences that you have to contend with in NFC and NFD, whereas under the current scheme you don't.

Also, this was all part of a very head-breaking set of problems for Tibetan when decompositions and canonical combining classes were being reviewed for the introduction of normalization in the first place.

In Unicode 2.0, 0F77 and 0F79 *were* given canonical decompositions, but they were *different* decompositions, to wit:

0F77 = 0F76 + 0F71 = 0FB2 + 0F80 + 0F71
0F79 = 0F78 + 0F71 = 0FB3 + 0F80 + 0F71

*and* they had funky fixed position class assignments, as well:

0F77 = 0F76 + 0F71 = 0FB2 + 0F80 + 0F71  (not in canonical order)
 135    134    129     6     143    129
 
0F79 = 0F78 + 0F71 = 0FB3 + 0F80 + 0F71  (not in canonical order)
 137    136    129     6     143    129
 
That was clearly hosed, as it broke all kinds of rules that we were trying to establish for normalization, including ensuring that all decomposition mappings produced sequences in canonical order and ensuring, as much as was possible, given the constraints in place, that the resulting sequences would follow the logic of the script *and* that NFC forms would decompose if that was what the users of the script preferred (hence the introduction of script-specific composition exclusions for several scripts, including Tibetan).

During that conversion from Unicode 2.0 to Unicode 3.0 with normalization, the UTC did the best it could with the mess for Tibetan. It was clear after the analysis that 0F77 and 0F79 should never have been encoded at all -- which was why they got those "strongly discouraged" labels -- but there was nothing to do about that mistake at that point. The compatibility decompositions
were the best compromise to keep them from contaminating the normalization processing of the rest of the Tibetan vowels.

--Ken

Anyway, like I said, this seemed to me like good historical information to put out there, and certainly to help show that Every character has a story!

 

This post brought to you by      and      (U+0f77 and U+0f79, a.k.a. TIBETAN VOWEL SIGN VOCALIC RR and TIBETAN VOWEL SIGN VOCALIC LL)


no comments

go to newer or older post, or back to index or month or day