Every character has a story #32: There are CJK Compatibility Ideographs, and then there are CJK Compatibility Ideographs

by Michael S. Kaplan, published on 2010/11/17 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/11/17/10092336.aspx

Now the line from Finding Forrester is what ran through my head. The one near the end. Earlier, Jamal had missed two free throws despite having a nearly perfect record for free throws, mainly to prove that he was not going to "dance" for the bigwigs on the board. After the climactic scene, Forrester asks him "So, those free throws....did you miss them, or did you miss them?"

When I read the line that became the title of this blog, I knew I had to write this blog....

You see, Unicode has a history.

The whole Every Character Has a Story category is all about the colorful histories of some of the characters in Unicode. And what real characters they can be.

The experts in Unicode are also "characters" in another sense, and perhaps a new series with biographies called Every Unicode Expert is a Character. But masybe that can be a subject for another day.

Anyway, the other day, a complaint made while looking at the current state of a particular bit of the standard without the benefit of history, helped us get from Ken Whistler the real story of compatibility characters in Unicode. It was some great info, and I asked for (and received) permission to extract this information and write about it here.

So without further adieu....

> >> FA47 is a "compatibility character", and would have a compatibility
> >> mapping.
> > Faulty syllogism.
> Formally correct answer but only because of something of a design
> flaw in Unicode. When the type of mapping was decided on, people
> didn't fully expect that NFC might become widely used/enforced,
> making these distinctions appear wherever text is normalized in a
> distributed architecture.

O.k., I'm gonna have to intervene again. *hehe* Yes, there is a design flaw here, but Asmus' explanation is also somewhat faulty, because it flattens out the history in a way that is liable to be misunderstood.

There is a *reason* why "when the type of mapping was decided on" that "people didn't fully expect that NFC might become widely used/enforced" -- but it wasn't that they were goofing up in understanding the implications of normalization. Rather, at that point in Unicode history NFC didn't *exist* yet, nor had the normalization algorithm been designed.

Here, for the benefit of the standards geeks out there, are the relevant higlights of the historical timeline involved.

June, 1992

  The canonical mappings for the CJK Compatibility characters were *printed* (with off-by-one errors for some of them!) in Unicode 1.0, volume 2 (= Unicode 1.0.1).
  Actually, at the time, we didn't know they were "canonical" mappings, because that concept hadn't formally been invented yet, but the intention was clear. They were the mappings from the "CJK compatibility ideographs" to the "real" unified Han ideographs in the standard. The CJK compatibility characters were all considered to be duplicates in the source standards that didn't follow the unification rules.
July, 1996

  The formal definitions of "canonical decomposition" and "compatibility decomposition" were first published in Unicode 2.0. There wasn't a data file for the CJK Compatibility Ideographs block, but the canonical mappings were *printed* (correctly, this time) on pp. 7-470 to 7-472 of the standard.
August 4, 1998

  The first published version of UnicodeData.txt that contained the canonical mappings for the CJK Compatibility Ideographs was UnicodeData-2.1.5.txt for Unicode 2.1.5. (Actually, they got into UnicodeData-2.1.4.txt on July 9, 1998, but that wasn't a published version of the data file.)
July 23, 1999

  This was the publication data of the first approved version of UAX #15 (Revision 15), and so is the first published definition of NFC. (Of course UAX #15 had been in draft for some time earlier than that, so the term "NFC" can be tracked back in the drafts to mid-1998.)
September, 1999

  Release of Unicode 3.0 -- the first release of Unicode formally tied to the Unicode Normalization Algorithm. (The revision of UAX #15 for the release was actually Revision 18, dated November 11, 1999.)
March 23, 2001

  UAX #15, Version 3.1.0. This was the version of the Unicode Normalization Algorithm that specified the composition version to be Version 3.1.0 and locked down normalization forever more.
  So essentially, there was a 9 year period between when the first mappings were defined for the CJK Compatibility Ideographs and the date beyond which it became impossible to reinterpret or change a canonical mapping because of the lockdown of normalization.

  The problems resulting from the normalization for CJK Compatibility Ideographs only started to become visible to people *after* the lockdown, and when Unicode normalization started to become a regular feature of actual processing.

  And it wasn't because "people didn't fully expect that NFC might become widely used/enforced" -- or at least not the people in the UTC. The UAX #15 text published with Unicode 3.0 already stated: "The W3C Character Model for the World Wide Web requires the use of Normalization Form C for XML and related standards..."

  And it wasn't because of some oversight about the canonical appings involving the CJK Compatibility Ideographs per se. That same UAX #15 for Unicode 3.0 also stated: "With *all* normalization forms singleton characters (those with singleton canonical mappings) are replaced." So the ground facts for the FA10 --> (NFC/NFD/NFKC/NFKD) 585C normalization pattern were well-established and explicitly stated in 1999.

> > FA47 is a CJK Compatibility character, which means it was encoded
> > for compatibility purposes -- in this case to cover the round-trip
> > mapping needed for JIS X 0213.
> > However, it has a *canonical* decomposition mapping to U+6F22.
> And that, of course, destroys the desired "round-trip" behavior if it is
> inadvertently applied while the data are encoded in Unicode. Hence the
> need to recreate a solution to the issue of variant forms with a different
> mechanism, the ideographic variation sequence (and corresponding
> database).

  Yes, that is basically correct. But, this architectural "design flaw" actually results from two additional requirements that accrued to the Unicode Standard well after its initial design:

1. The requirement to be able to carry "round-trip" behavior through distributed environments.

  In the original design, the notion of how one would deal with legacy data was conceived of primarily as a controlled and contained conversion issue. An application/system would convert legacy data to Unicode, and if it needed to convert back, it could use compatibility characters for round-trip conversion. The system would know how and when it could normalize, because it controlled the data and the conversion.

2. The requirement to be able to maintain CJK variant glyph distinctions in plain text data.

  Again, that was not at all a part of the original Unicode Standard design.

  So the essential nature of the problem is that these new requirements have mostly accrued to Unicode implementations *after* 2001, more or less at the point when the lockdown of Unicode normalization made it impossible for normalization to be adjusted in any way to account for them.

  Hence the need to construct an *alternative* approach involving variation selectors, which would be robust and invariant under normalization transformations.

> > The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.
> > Easily verified, for example, by checking the FA47 entry in
> > NormalizationTest.txt in the UCD.
> While correct, it's something that remains a bit of a gotcha.

  Yeah, well, the basic gotcha is that no matter how many times I say it or what the Unicode Standard says, people will continue to just assume "compatibility character" implies "compatibility decomposition". For everybody on the list, I recommend frequent re-reading of Section 2.3, Compatibility Characters, of the standard:


whenever somebody mentions "compatibility" in discussion of Unicode. Yes, I suspect that people will find their heads hurting -- but this subject *is* complex, and generalizations that people make about "compatibility characters" are often wrong when they don't pay attention to the details.

> Especially now that Unicode has charts that go to great
> length showing the different glyphs for these characters,

  Well, even there the issue is complicated, because there are CJK Compatibility Ideographs, and then there are CJK Compatibility Ideographs. They fall into at least 3 important classes:

  1. Ones which really are *unified* ideographs, despite their names.
  2. Ones which are *pronunciation* variants from KS X 1001, and which are *not* intended to show different glyphs.
  3. Ones which are *graphical* variants from other legacy standards, and which *are* intended to show different glyphs.

  And even class 3 has subtypes, because some show variants that are distinguished only in one legacy standard, whereas some are themselves cross-mapped between more than one legacy standard -- putatively because each legacy standard shows the same variant glyph.

  It is class 3 that may be adversely affected *visually* by the application of normalization in a distributed environment.

> I would suggest adding a note to the charts that make clear that these
> distinctions are *removed* anytime the text is normalized, which, in a
> distributed architecture may happen anytime.

  The CJK Compatibility Ideographs already have warnings attached to them in the standard. They are repeatedly documented as "only for round-trip compatibility with XYZ" and "They should not be used for any other purpose."

  However, I think your point is a valid one. Now that the clear answer for maintaining legacy CJK glyph variant distinctions in a distributed environment is via ideographic variation sequences as registered in the IVD, it would make sense to beef up the CJK Compatibility Ideograph documentation with better pointers (and with accompanying rationale text) to UTS #37 and the IVD, and to post stronger warning labels in the code charts for CJK Compatibility Ideographs.

  Perhaps someone would like to make a detailed proposal to the UTC for how to fix the text and charts? ;-)

Well, I won't go that far.

But I will capture the conversation so people can learn something about the meaning of compatibility characters in Unicode.

Andrew West on 18 Nov 2010 1:44 AM:

This great information didn't happen to come from a certain list that "is really best to avoid" by any chance?

Michael S. Kaplan on 18 Nov 2010 6:13 AM:

Yes -- but that doesn't change the signal-to-noise ratio there. :-p

go to newer or older post, or back to index or month or day