It's like a lower class of Lowercase...

by Michael S. Kaplan, published on 2011/04/03 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/04/03/10149190.aspx


I had somebody asking me about the PRI (Public Review Issue) that just came out from Unicode. It all just seemed kind of confusing to her.

So I thought I'd talk about it here a bit. :-)

It is issue #181, and here is the text on it:


180 Changing General Category of Twelve Characters 2011.05.02
Status: Open
 

Description of Issue:

The UTC has decided to change the general category of twelve characters. The characters in question are these:

	U+00AA FEMININE ORDINAL INDICATOR
	U+00BA MASCULINE ORDINAL INDICATOR
	U+1D62 LATIN SUBSCRIPT SMALL LETTER I
	U+1D63 LATIN SUBSCRIPT SMALL LETTER R
	U+1D64 LATIN SUBSCRIPT SMALL LETTER U
	U+1D65 LATIN SUBSCRIPT SMALL LETTER V
	U+1D66 GREEK SUBSCRIPT SMALL LETTER BETA
	U+1D67 GREEK SUBSCRIPT SMALL LETTER GAMMA
	U+1D68 GREEK SUBSCRIPT SMALL LETTER RHO
	U+1D69 GREEK SUBSCRIPT SMALL LETTER PHI
	U+1D6A GREEK SUBSCRIPT SMALL LETTER CHI
	U+2C7C LATIN SUBSCRIPT SMALL LETTER J

The UTC intends to change the general category of these characters from its current value of "Ll" to the value "Lm". The rationale is that superscript or subscript letters with decompositions to a single character should consistently have gc=Lm. Changing the general category for these twelve characters aligns them with the 122 other superscript or subscript letters whose General_Category is already "Lm".

This change for the General Category property implies some changes for dependent casing properties. In particular, in order to keep the derived Lowercase property values unchanged, each of the twelve characters will have the contributory property Other_Lowercase set to Yes. The property Case_Ignorable, which is a narrow-use property only relevant to some special casing boundary determination (see D136 and Table 3-15 in Chapter 3 of Unicode 6.0 for details), would change from No to Yes for these twelve characters. The changes are summarized in the following table:

PropertyOld ValueNew Value
General_Category Ll Lm
Other_Lowercase No Yes
Lowercase Yes Yes
Case_Ignorable No Yes

The behavior of software may change for these twelve characters if it is dependent on a distinction between gc=Ll versus gc=Lm, or on the value of the Case_Ignorable property.

Feedback is being requested on the positive and negative effects, if any, these changes would have on existing implementations. A change in behavior may be considered positive, for example, if it results in a more uniform treatment of compatibility super/subscript characters and modifier letters. It may be considered negative if the change in properties produces an unexpected result or forces an unwanted change to software to compensate for the change.


Bug #1 is of course that the issue at http://www.unicode.org/review/pri181/ is claiming to be issue 180 -- though pRI #180 is actually the one I talked a little about in Address formats are hard, let's go shopping!, revisited (aka To me, 'good enough' just isn't good enough).

But that is probably a little copy/paste bug that they'll undoubtably fix soon. :-)

Anyway the questions I was being asked weren't about that....

They were about the nature of the four properties being discussed:

PropertyOld ValueNew ValueMeaning
General_Category Ll Lm  This is a useful breakdown into various character types which can be used as a default categorization in implementations.
Other_Lowercase No Yes Used in deriving the Lowercase property. 
Lowercase Yes Yes Characters with the Lowercase property. Generated from: Ll + Other_Lowercase
Case_Ignorable No Yes Characters which are ignored for casing purposes. Generated from: Mn + Me + Cf + Lm + Sk + Word_Break=MidLetter + Word_Break=MidNumLet

 Of these four properties:

There is a fundamental identity issue here, of course. These twelve characters each have an obvious analogue in Unicode already:

Character Code point Name Similar
character
Similar
code point
Similar name
ª U+00AA FEMININE ORDINAL INDICATOR a U+0061 LATIN SMALL LETTER A
º U+00BA MASCULINE ORDINAL INDICATOR o U+006f LATIN SMALL LETTER O
U+1D62 LATIN SUBSCRIPT SMALL LETTER I i U+0069 LATIN SMALL LETTER I
U+1D63 LATIN SUBSCRIPT SMALL LETTER R r U+0072 LATIN SMALL LETTER R
U+1D64 LATIN SUBSCRIPT SMALL LETTER U u U+0075 LATIN SMALL LETTER U
U+1D65 LATIN SUBSCRIPT SMALL LETTER V v U+0076 LATIN SMALL LETTER V
U+1D66 GREEK SUBSCRIPT SMALL LETTER BETA β U+03b2 GREEK SMALL LETTER BETA
U+1D67 GREEK SUBSCRIPT SMALL LETTER GAMMA γ U+03b3 GREEK SMALL LETTER GAMMA
U+1D68 GREEK SUBSCRIPT SMALL LETTER RHO ρ U+03c1 GREEK SMALL LETTER RHO
U+1D69 GREEK SUBSCRIPT SMALL LETTER PHI φ U+03c6 GREEK SMALL LETTER PHI
U+1D6A GREEK SUBSCRIPT SMALL LETTER CHI χ U+03c7 GREEK SMALL LETTER CHI
U+2C7C LATIN SUBSCRIPT SMALL LETTER J j U+006a LATIN SMALL LETTER J

 Pretty much no one would argue the clear relationship between the two characters, or the fact that in rich text one could actually make one look like the other -- e.g. in HTML with the <SUB</SUB> tag surrounding the letter.

obviously they are lowercase letters, why wouldn't they be Ll?

However, and this is where things get a little weird, people don't all feel as strongly about what an uppercasing operation ought to do.

And thus deciding a formal way is needed to make it 'lowercase, but not" was created. A caste system of sorts was designed, so that there was now a lower class of  lowercase -- one that did not have all of the rights and privileges of normal ordinary lowercase letters.

To bring enough rigor to this "now complicated though it used to be simple" stuff that anyone using the data of the Unicode Character Database in its entirety would not get unexpected results on characters such as these, characters it had never heard of before.

On a somewhat regular basis, some automated process notes an anomaly like this one -- characters that seem to not be set up to return correct results. And this suggested change just came when a member company reported such a discontinuity....

The PRI is so that if anyone will be majorly broken by "fixing" these characters, that they speak up now....


no comments

go to newer or older post, or back to index or month or day