by Michael S. Kaplan, published on 2011/07/25 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/07/25/10187633.aspx
The other day Karl Williamson asked a great question about Unicode stuff:
Subject: Why are the shorter cjk names second in PropertyAliases.txt?
The comments say that the first field is the short name, and the 2nd field is the long name. But in the case of the cjk properties, the short name is longer than the long name. Why?
And Ken Whistler was around to give a classic answer to this "drive on a parkway, park on a driveway" question for Unicode geeks.
So, in the fine tradition of:
I thought maybe I'd immortalize it here for all of those like me who delight in these sorts of tales.
The reponse was:
a) To make Karl ask questions? ;-)
b) Because there always has to be an exception to any rule (including this rule)?
c) Because the UTC said to do so?
Well, the short answer is "c". See Consensus 120-C23 in the minutes from UTC #120, L2/09-225.
But wait, you will then ask why the PropertyAliases.txt for the CJK properties aren't actually like what was agreed in Consensus 120-C23? For that you need the long answer.
To bring the normative CJK properties into the PropertyAliases.txt context, where property identifiers appropriate for regex and such are defined, the UTC originally decided that it would make sense to just prepend "cj" to the front of the existing Unihan tags from the Unihan Database, so that they would be more self-identifying as "CJK" properties to the uninitiated, and then make the resulting labels (cjkIRG_GSource, cjkRSUnicode, etc.) be both the long and the short identifier in PropertyValues.txt. Then, the Unihan tag itself (without the "cj" prefix) would be added as a (third) alias, because people might well be using the exact Unihan tag for matching, and it wouldn't make sense to disable that. And the UTC wanted the labels with the "cj" on them in the short (abbreviated) field, because everybody agreed that there was no utility to attempting to shorten the Unihan tags further; they would just turn into unmnemonic gobblydegook.
Then when the consensus was actually acted upon, and PropertyAliases.txt with the changes started to see review, it dawned on everybody that, rather than have the official Unihan tag be a third alias, it should be the "long" name in the PropertyAliases.txt. You only needed two values: "cjkIRG_GSource" and the original tag value "kIRG_GSource". So why have two identical labels and also have to add the original tag as a third alias?
But then the conflict of two contradictory purposes kicked in. The first field is supposed to be "short(er)". But the second field is supposed to be "official(er)". See UAX #44, Section 5.8.1:
The long symbolic name alias is self-descriptive, and is treated as the official name of a Unicode character property. For clarity it is used henever possible when referring to that property in this annex and elsewhere in the Unicode Standard.
Because the Unihan tags in the Unihan Database have longstanding status, predating by at least a decade the decision to tack "cj" onto the front of them for PropertyAliases.txt, and because the Unihan tags are used everywhere in the Unihan Database and its documentation, it became clear that those had to be the official name of those properties.
In this case, the UTC tossed up the dice, and they came down as "official(er)" trumps "short(er)" for the CJK properties.
So there is the long answer. Now I suppose people will want a short(er) version of that answer added to Section 5.8.1 of UAX #44, so this strange aberration will be seen for the Solomonic judgement it actually was. ;-)
Personally, I would have chosen "e", which is "all the above" after the response from Ken is made "d".
Or maybe, to avoid the next time and the time after that for the same question to be asked, I might change the original text to come up with descriptions that are true "except for CJK". Because "Unicode except for CJK" is like saying "quarters liveable except for no Oxygen"....
go to newer or older post, or back to index or month or day