by Michael S. Kaplan, published on 2006/08/14 03:21 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/08/14/698983.aspx
The CARON has a long and unhappy history, one that is tied up with that whole Sk/Lm general category thing I talked about in this post.
Ken Whistler laid it out for the CARON just recently, starting with the meandering path through UnicodeData.txt:
02C7;CARON;Lm;0;L;0020 030C;;;;N;MODIFIER LETTER HACEK;Mandarin Chinese third
02C7;CARON;Sk;0;L;;;;;N;MODIFIER LETTER HACEK;Mandarin Chinese third tone;;;
02C7;CARON;Sk;0;ON;;;;;N;MODIFIER LETTER HACEK;Mandarin Chinese third tone;;;
02C7;CARON;Lm;0;ON;;;;;N;MODIFIER LETTER HACEK;Mandarin Chinese third tone;;;
So: gc=Lm --> Sk --> Lm
The difference for U+02C7 being that Unicode 1.1 mistakenly indicated that it was a spacing clone of a diacritic (by that 0020 030C decomposition), which was corrected in Unicode 2.0.
The name anomaly comes from the fact that U+02C7 was mapped to the 8859-2 0xB7 CARON, and the 10646 merger and SC2 rules made us change the name from MODIFIER LETTER HACEK as a result.
But the intent of U+02C6 and U+02C7 as encoded characters in Unicode, since the days of Unicode 1.0, has always been completely parallel -- which is why their General Category history is equally sorry.
It becomes a great example of making things look less stable than they [usually] are. But as I said in this post, the difference between the two general categories is that one is meant to be usable in identifiers, and the other is not. As Ken pointed out:
This whole problem with modifier letters and gc=Sk versus gc=Lm is like that proverbial pebble in the shoe, I'm afraid. Every few years it becomes a "problem" to sort out again, and ends up with a few more characters jiggered one way or another across that boundary.
He then talked about the history of Sk:
For those who care about the history here, gc=Sk didn't exist at all in the original set of General Category values invented by Mark. All the characters *named* MODIFIER LETTER WHATEVER
in UnicodeData-1.1.5.txt got the gc=Lm value.
Mark introduced gc=Sk in Unicode 2.0 to solve a different problem, which was the UTC then groping towards an identifier syntax that would do the right thing for Unicode strings based on Unicode character properties. See Section 5.14 Identifiers, pp. 5-25 to 5-27 in Unicode 2.0 if you can find a copy. In Unicode 2.0, identifiers were constructed on the [alphabetic] property, plus a number of additions and exceptions. But [alphabetic] itself, whose values were printed in the book, by the way, at pp. 4-14 to 4-15, claimed to include "modifier letters". That was problematical, and some of the modifier letters that clearly didn't look like they belonged in identifiers, were
drained from [alphabetic] by inventing the new General Category Sk (symbol modifier), so they got classed with the other symbols, rather than getting lumped with the letters and such under [alphabetic].
Incidentally, the only place in the Unicode 2.0 standard, other than
where General Category values are explicitly enumerated was in the discussion of locating text element boundaries (Section 5.13), where the addition of gc=Sk got overlooked and was not yet properly accounted for. The boundary specification assumed that "MODIFIER LETTERS" were, well, modifier letters, and the description even explicitly says:
Lm = Modifier Letter (includes spacing versions of non-spacing marks)
So that part of the 2.0 standard was inconsistent with the changes that had been made to deal with identifier syntax.
It was Unicode 3.0 that revised the identifier syntax to make the classes specifically be based on General Category values, rather than [alphabetic] with exceptions. And there were a significant number of General Category changes which were driven by this. See Appendix D of Unicode 3.0, which notes the then-significant issue of trying to establish convergence between the Unicode definition of identifiers and the ISO TR 10176 definition of identifiers, which was being bandied about at that point as essential for formal programming languages. Page 979 of TUS 4.0:
General Category. A series of General Category changes were made to assist the convergence of the Unicode definition of identifier with ISO TR 10176.
Post Unicode 3.0 was when Mark staked out more territory in character properties, took over PropList.txt and started producing derived properties, using sets of tools for checking consistency, and introducing more properties of the Other_XYZ type to enable more robust derivation rules. The period between Unicode 3.0 and Unicode 4.0 saw all kinds of jiggering of General Category values that resulted from this, including the long list of proposed changes in L2/02-267.
Most of the revisions that resulted were undoubtedly improvements, but in the area of "MODIFIER LETTERS" things have just gotten more confused, in my opinion. In part this has resulted from an essential disconnect between the people proposing new characters for new scripts and additions of miscellaneous abstruse and oddball stuff for Latin, and the people maintaining and extending the Unicode Character Database.
And then at the end of this description came the most amusing summary:
To make this perhaps too pointedly ad hominem, but nevertheless fairly accurate, Michael Everson does not fully understand character properties or their interactions as demonstrated by Mark Davis' manifold property tools, and Mark Davis does not fully understand the functioning of modifier letters in newly encoded scripts and the numerous technical extensions for the Latin script. This tends to leave both of them, and the UTC as well, scratching their heads over the "Is it Lm? Or is it Sk?" decision that inevitably has to be made for all of these additions.
I guess you could say that not only does every character have a story; the truth is that some of them inspire monologues! :-)
This post brought to you by ˇ (U+02c7, a.k.a. CARON, f.k.a. MODIFIER LETTER HACEK)
2006/10/12 Who is the Hacek Girl?
go to newer or older post, or back to index or month or day