by Michael S. Kaplan, published on 2011/04/03 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/04/03/10149190.aspx
I had somebody asking me about the PRI (Public Review Issue) that just came out from Unicode. It all just seemed kind of confusing to her.
So I thought I'd talk about it here a bit. :-)
It is issue #181, and here is the text on it:
180 Changing General Category of Twelve Characters 2011.05.02 Status: Open
Description of Issue:
The UTC has decided to change the general category of twelve characters. The characters in question are these:
U+00AA FEMININE ORDINAL INDICATOR U+00BA MASCULINE ORDINAL INDICATOR U+1D62 LATIN SUBSCRIPT SMALL LETTER I U+1D63 LATIN SUBSCRIPT SMALL LETTER R U+1D64 LATIN SUBSCRIPT SMALL LETTER U U+1D65 LATIN SUBSCRIPT SMALL LETTER V U+1D66 GREEK SUBSCRIPT SMALL LETTER BETA U+1D67 GREEK SUBSCRIPT SMALL LETTER GAMMA U+1D68 GREEK SUBSCRIPT SMALL LETTER RHO U+1D69 GREEK SUBSCRIPT SMALL LETTER PHI U+1D6A GREEK SUBSCRIPT SMALL LETTER CHI U+2C7C LATIN SUBSCRIPT SMALL LETTER JThe UTC intends to change the general category of these characters from its current value of "Ll" to the value "Lm". The rationale is that superscript or subscript letters with decompositions to a single character should consistently have gc=Lm. Changing the general category for these twelve characters aligns them with the 122 other superscript or subscript letters whose General_Category is already "Lm".
This change for the General Category property implies some changes for dependent casing properties. In particular, in order to keep the derived Lowercase property values unchanged, each of the twelve characters will have the contributory property Other_Lowercase set to Yes. The property Case_Ignorable, which is a narrow-use property only relevant to some special casing boundary determination (see D136 and Table 3-15 in Chapter 3 of Unicode 6.0 for details), would change from No to Yes for these twelve characters. The changes are summarized in the following table:
Property Old Value New Value General_Category Ll Lm Other_Lowercase No Yes Lowercase Yes Yes Case_Ignorable No Yes The behavior of software may change for these twelve characters if it is dependent on a distinction between gc=Ll versus gc=Lm, or on the value of the Case_Ignorable property.
Feedback is being requested on the positive and negative effects, if any, these changes would have on existing implementations. A change in behavior may be considered positive, for example, if it results in a more uniform treatment of compatibility super/subscript characters and modifier letters. It may be considered negative if the change in properties produces an unexpected result or forces an unwanted change to software to compensate for the change.
Bug #1 is of course that the issue at http://www.unicode.org/review/pri181/ is claiming to be issue 180 -- though pRI #180 is actually the one I talked a little about in Address formats are hard, let's go shopping!, revisited (aka To me, 'good enough' just isn't good enough).
But that is probably a little copy/paste bug that they'll undoubtably fix soon. :-)
Anyway the questions I was being asked weren't about that....
They were about the nature of the four properties being discussed:
Property | Old Value | New Value | Meaning |
---|---|---|---|
General_Category | Ll | Lm | This is a useful breakdown into various character types which can be used as a default categorization in implementations. |
Other_Lowercase | No | Yes | Used in deriving the Lowercase property. |
Lowercase | Yes | Yes | Characters with the Lowercase property. Generated from: Ll + Other_Lowercase |
Case_Ignorable | No | Yes | Characters which are ignored for casing purposes. Generated from: Mn + Me + Cf + Lm + Sk + Word_Break=MidLetter + Word_Break=MidNumLet |
Of these four properties:
There is a fundamental identity issue here, of course. These twelve characters each have an obvious analogue in Unicode already:
Character | Code point | Name | Similar character |
Similar code point |
Similar name |
ª | U+00AA | FEMININE ORDINAL INDICATOR | a | U+0061 | LATIN SMALL LETTER A |
º | U+00BA | MASCULINE ORDINAL INDICATOR | o | U+006f | LATIN SMALL LETTER O |
ᵢ | U+1D62 | LATIN SUBSCRIPT SMALL LETTER I | i | U+0069 | LATIN SMALL LETTER I |
ᵣ | U+1D63 | LATIN SUBSCRIPT SMALL LETTER R | r | U+0072 | LATIN SMALL LETTER R |
ᵤ | U+1D64 | LATIN SUBSCRIPT SMALL LETTER U | u | U+0075 | LATIN SMALL LETTER U |
ᵥ | U+1D65 | LATIN SUBSCRIPT SMALL LETTER V | v | U+0076 | LATIN SMALL LETTER V |
ᵦ | U+1D66 | GREEK SUBSCRIPT SMALL LETTER BETA | β | U+03b2 | GREEK SMALL LETTER BETA |
ᵧ | U+1D67 | GREEK SUBSCRIPT SMALL LETTER GAMMA | γ | U+03b3 | GREEK SMALL LETTER GAMMA |
ᵨ | U+1D68 | GREEK SUBSCRIPT SMALL LETTER RHO | ρ | U+03c1 | GREEK SMALL LETTER RHO |
ᵩ | U+1D69 | GREEK SUBSCRIPT SMALL LETTER PHI | φ | U+03c6 | GREEK SMALL LETTER PHI |
ᵪ | U+1D6A | GREEK SUBSCRIPT SMALL LETTER CHI | χ | U+03c7 | GREEK SMALL LETTER CHI |
ⱼ | U+2C7C | LATIN SUBSCRIPT SMALL LETTER J | j | U+006a | LATIN SMALL LETTER J |
Pretty much no one would argue the clear relationship between the two characters, or the fact that in rich text one could actually make one look like the other -- e.g. in HTML with the <SUB</SUB> tag surrounding the letter.
obviously they are lowercase letters, why wouldn't they be Ll?
However, and this is where things get a little weird, people don't all feel as strongly about what an uppercasing operation ought to do.
And thus deciding a formal way is needed to make it 'lowercase, but not" was created. A caste system of sorts was designed, so that there was now a lower class of lowercase -- one that did not have all of the rights and privileges of normal ordinary lowercase letters.
To bring enough rigor to this "now complicated though it used to be simple" stuff that anyone using the data of the Unicode Character Database in its entirety would not get unexpected results on characters such as these, characters it had never heard of before.
On a somewhat regular basis, some automated process notes an anomaly like this one -- characters that seem to not be set up to return correct results. And this suggested change just came when a member company reported such a discontinuity....
The PRI is so that if anyone will be majorly broken by "fixing" these characters, that they speak up now....