by Michael S. Kaplan, published on 2006/08/10 05:12 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/08/10/694191.aspx
Earlier this year, I talked about the stability of the Unicode Character Database. And about how there are really not all that many changes that happen to items like the general category that the CharUnicodeInfo class depends on. And the same is true of other property values.
But in that battle of consistency vs. correctness, sometimes it is correctness that will win.
Yesterday at the Unicode Technical Committee meeting (being held at Adobe in Seattle), one of those occasional changes that from time to time will happen, happened.
It started because of something that Mark Davis (well, one of his colleagues) noticed:
We have a bug in 5.0 in that a case pair is split across two different script values. We should fix this in U5+, and add an invariant test to ensure that this doesn't recur.
2183;ROMAN NUMERAL REVERSED ONE HUNDRED;Lu;0;L;;;;;N;;;;2184;
2183 ; Common # L& ROMAN NUMERAL REVERSED ONE HUNDRED
2184;LATIN SMALL LETTER REVERSED C;Ll;0;L;;;;;N;;;2183;;2183
2184 ; Latin # L& LATIN SMALL LETTER REVERSED C
Roman Numerals are something I have talked about before (like in this post).
Now the truth is that almost all of the Roman Numerals in Unicode have two things in common:
In fact, the first point is true of all but the two characters above, and the second point is true of all but one of them.
But the honest truth is that is that putting them in the script value of Common never really made very much sense, because even if they are used within other scripts, they never really stop being Latin letters. And having these two charcters as a case pair and with Lu/Ll general categories but different scripts is really weird.
And having these two different from all of the rest since they all have the same issue is also bad.
So the plan? To update the script membership of the roman numerals (all of them) from Common to Latin, which is what they are. But since those two characters happen to have a case relationship, to leave be the difference in general category of the two characters mentioned above.
This is a change that will happen in a future version of the standard. And does allow for a slightly more correct result, albeit one that might cause some people to be concerned.
In fact, I anticipate at least one bug being reported based on someone noticing the change, and my predition will be that they are not really even using the characters in question! I will keep you posted on this, in any case. :-)
I'll talk more about other such changes in another post....
This post brought to you by ↄ (U+2184, a.k.a. LATIN SMALL LETTER REVERSED C)
referenced by
2007/01/31 Where is the character?