by Michael S. Kaplan, published on 2005/03/18 04:36 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/18/398489.aspx
Richard Cooke, while talking about ideographs that looked identical that were added to CJK Unified Ideographs Extension B (13MB) of Unicode as a part of Unicode 3.1, said:
For an example of even closer graphical variants (some might even say *exactly* identical forms), compare [U+20a37] and [U+200ae] ... which I mentioned to Mr. Jenkins a few weeks ago. As he pointed out, they both have T-source numbers, and were perhaps deunified because they're separate in CNS 11643 ...
They do indeed have separate T-source numbers. Now the "T-source" expression is referring to the kIRG_TSource field in unihan.txt, which is described as follows:
################################################################################
#
# kIRG_TSource
# Status Normative
# Category IRG Mappings
# CompletionLevel Complete
# Separator N/A
# Syntax [1-7F]-[0-9A-F]{4}
#
# The IRG "T" source mapping for this character in hex. The IRG "T"
# source consists of data from the following national standards
# and lists from Taiwan.
#
# T1 CNS 11643-1992, plane 1
# T2 CNS 11643-1992, plane 2
# T3 CNS 11643-1992, plane 3 (with some additional characters)
# T4 CNS 11643-1992, plane 4
# T5 CNS 11643-1992, plane 5
# T6 CNS 11643-1992, plane 6
# T7 CNS 11643-1992, plane 7
# TF CNS 11643-1992, plane 15
#
################################################################################
Ken Whistler jumped in to explain the choice of characters in Richard Cook's example:
For those who find themselves mystified by Richard's well-picked example of U+20A37 and U+200AE, here is the explanation.
The U+200AE variant is ordered by the slash radical (cf. U+2F03 KANGXI RADICAL SLASH), which is a downwards brush stroke, from upper right to lower left.
The U+20A37 variant is ordered by the cliff radical (cf. U+2F1A KANGXI RADICAL CLIFF), the first (topmost) stroke of which is a horizontal brush stroke, from left to right.
This distinction is a very fine point of Chinese character structure, and is probably never the basis for a meaningful distinction between characters. However, because of the radical assignments and the way dictionaries are ordered, you can end up with variants showing this distinction, in different locations in dictionaries or character encodings.
Examples showing the ordinary cliff radical can be seen at 5382..53B5, 3542..3554, 20A2D..20AD2.
Examples showing the slash variant of the cliff radical can be seen at 4E55, 20086, 2008B, 20098, 200A2, 200AC, 200AE, 200B0, 200C3..200C5.
But the variant form may also be seen among characters filed under the cliff radical itself: 20A2C, 20ABB. Or the same two strokes may be seen in the upper-left corner of a character filed under a different radical altogether: 8CAD, 8D28.
As Ken indicates, this is an interesting case, which actually helps show how hard it is to try to catalog ideographs solely on the basis of the radical upon which an ideograph is based (without the context of the meaning or etymology of the ideograph. You can end up with two or more ideographs that by all appearances seem to be completely identical yet have different sources from within both Unicode and the original national/regional standards that contained the ideograph.
From a purely visual standpoint there is no distinction and giving them different codepoints make as much sense as giving different mappings to the letter "M" as used in the word metropolitan and the word Michael. However, when mapping from CNS-11643 the need to map them two different characters in Unicode is essential and mapping them to the same character would make as much sense as mapping two entirely different letters together.
It gets to the heart of the importance of interoperability with the various national/regional standards in use by various countries/regions in the world.
This post brought to you by "⼃" and "⼚" (U+2f03 and U+2f1a, a.k.a. KANGXI RADICAL SLASH and KANGXI RADICAL CLIFF)