It's like a lower class of Lowercase...

by Michael S. Kaplan, published on 2011/04/03 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/04/03/10149190.aspx

I had somebody asking me about the PRI (Public Review Issue) that just came out from Unicode. It all just seemed kind of confusing to her.

So I thought I'd talk about it here a bit. :-)

It is issue #181, and here is the text on it:

180 Changing General Category of Twelve Characters 2011.05.02

Status: Open

Description of Issue:

The UTC has decided to change the general category of twelve characters. The characters in question are these:
	U+00AA FEMININE ORDINAL INDICATOR
	U+00BA MASCULINE ORDINAL INDICATOR
	U+1D62 LATIN SUBSCRIPT SMALL LETTER I
	U+1D63 LATIN SUBSCRIPT SMALL LETTER R
	U+1D64 LATIN SUBSCRIPT SMALL LETTER U
	U+1D65 LATIN SUBSCRIPT SMALL LETTER V
	U+1D66 GREEK SUBSCRIPT SMALL LETTER BETA
	U+1D67 GREEK SUBSCRIPT SMALL LETTER GAMMA
	U+1D68 GREEK SUBSCRIPT SMALL LETTER RHO
	U+1D69 GREEK SUBSCRIPT SMALL LETTER PHI
	U+1D6A GREEK SUBSCRIPT SMALL LETTER CHI
	U+2C7C LATIN SUBSCRIPT SMALL LETTER J
The UTC intends to change the general category of these characters from its current value of "Ll" to the value "Lm". The rationale is that superscript or subscript letters with decompositions to a single character should consistently have gc=Lm. Changing the general category for these twelve characters aligns them with the 122 other superscript or subscript letters whose General_Category is already "Lm".

This change for the General Category property implies some changes for dependent casing properties. In particular, in order to keep the derived Lowercase property values unchanged, each of the twelve characters will have the contributory property Other_Lowercase set to Yes. The property Case_Ignorable, which is a narrow-use property only relevant to some special casing boundary determination (see D136 and Table 3-15 in Chapter 3 of Unicode 6.0 for details), would change from No to Yes for these twelve characters. The changes are summarized in the following table:

Property Old Value New Value

General_Category Ll Lm

Other_Lowercase No Yes

Lowercase Yes Yes

Case_Ignorable No Yes

The behavior of software may change for these twelve characters if it is dependent on a distinction between gc=Ll versus gc=Lm, or on the value of the Case_Ignorable property.

Feedback is being requested on the positive and negative effects, if any, these changes would have on existing implementations. A change in behavior may be considered positive, for example, if it results in a more uniform treatment of compatibility super/subscript characters and modifier letters. It may be considered negative if the change in properties produces an unexpected result or forces an unwanted change to software to compensate for the change.

Property	Old Value	New Value
General_Category	Ll	Lm
Other_Lowercase	No	Yes
Lowercase	Yes	Yes
Case_Ignorable	No	Yes

Bug #1 is of course that the issue at http://www.unicode.org/review/pri181/ is claiming to be issue 180 -- though pRI #180 is actually the one I talked a little about in Address formats are hard, let's go shopping!, revisited (aka To me, 'good enough' just isn't good enough).

But that is probably a little copy/paste bug that they'll undoubtably fix soon. :-)

Anyway the questions I was being asked weren't about that....

They were about the nature of the four properties being discussed:

Property	Old Value	New Value	Meaning
General_Category	Ll	Lm	This is a useful breakdown into various character types which can be used as a default categorization in implementations.
Other_Lowercase	No	Yes	Used in deriving the Lowercase property.
Lowercase	Yes	Yes	Characters with the Lowercase property. Generated from: Ll + Other_Lowercase
Case_Ignorable	No	Yes	Characters which are ignored for casing purposes. Generated from: Mn + Me + Cf + Lm + Sk + Word_Break=MidLetter + Word_Break=MidNumLet

Of these four properties:

The first is a part of the core identity of the character;
The second is a special property used to modulate the basic results of the General_Category so that code that is using this data to determine how to behave will generally behave in the way users expect;
The third and fourth are calculated properties, derived from the first two in order to make it easier for people to make use of the various properties.

There is a fundamental identity issue here, of course. These twelve characters each have an obvious analogue in Unicode already:

Character	Code point	Name	Similar character	Similar code point	Similar name
ª	U+00AA	FEMININE ORDINAL INDICATOR	a	U+0061	LATIN SMALL LETTER A
º	U+00BA	MASCULINE ORDINAL INDICATOR	o	U+006f	LATIN SMALL LETTER O
ᵢ	U+1D62	LATIN SUBSCRIPT SMALL LETTER I	i	U+0069	LATIN SMALL LETTER I
ᵣ	U+1D63	LATIN SUBSCRIPT SMALL LETTER R	r	U+0072	LATIN SMALL LETTER R
ᵤ	U+1D64	LATIN SUBSCRIPT SMALL LETTER U	u	U+0075	LATIN SMALL LETTER U
ᵥ	U+1D65	LATIN SUBSCRIPT SMALL LETTER V	v	U+0076	LATIN SMALL LETTER V
ᵦ	U+1D66	GREEK SUBSCRIPT SMALL LETTER BETA	β	U+03b2	GREEK SMALL LETTER BETA
ᵧ	U+1D67	GREEK SUBSCRIPT SMALL LETTER GAMMA	γ	U+03b3	GREEK SMALL LETTER GAMMA
ᵨ	U+1D68	GREEK SUBSCRIPT SMALL LETTER RHO	ρ	U+03c1	GREEK SMALL LETTER RHO
ᵩ	U+1D69	GREEK SUBSCRIPT SMALL LETTER PHI	φ	U+03c6	GREEK SMALL LETTER PHI
ᵪ	U+1D6A	GREEK SUBSCRIPT SMALL LETTER CHI	χ	U+03c7	GREEK SMALL LETTER CHI
ⱼ	U+2C7C	LATIN SUBSCRIPT SMALL LETTER J	j	U+006a	LATIN SMALL LETTER J

Pretty much no one would argue the clear relationship between the two characters, or the fact that in rich text one could actually make one look like the other -- e.g. in HTML with the <SUB</SUB> tag surrounding the letter.

obviously they are lowercase letters, why wouldn't they be Ll?

However, and this is where things get a little weird, people don't all feel as strongly about what an uppercasing operation ought to do.

And thus deciding a formal way is needed to make it 'lowercase, but not" was created. A caste system of sorts was designed, so that there was now a lower class of lowercase -- one that did not have all of the rights and privileges of normal ordinary lowercase letters.

To bring enough rigor to this "now complicated though it used to be simple" stuff that anyone using the data of the Unicode Character Database in its entirety would not get unexpected results on characters such as these, characters it had never heard of before.

On a somewhat regular basis, some automated process notes an anomaly like this one -- characters that seem to not be set up to return correct results. And this suggested change just came when a member company reported such a discontinuity....

The PRI is so that if anyone will be majorly broken by "fixing" these characters, that they speak up now....

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day

180	Changing General Category of Twelve Characters	2011.05.02
Status:	Open