Can you name that TUNE?

by Michael S. Kaplan, published on 2006/10/02 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/10/02/779974.aspx


The IRG (Ideographic Rapporteur Group) is an organization that I have spoken about previously in the post For every expert.... As its own homepage states, "It focuses on the development of ideographic characters (Han characters used in China, Japan, Korea and other parts of Asia) in the ISO 10646 standard. Its mission is to submit ideographic characters for inclusion in the ISO 10646 standard."

The most important goals of the IRG are discussed in the Unicode FAQ in answer to the question "Who is responsible for future CJK characters?":

The development and extension of the CJK characters is being done by the Ideographic Rapporteur Group (IRG), which includes official representatives of China, Hong Kong (SAR), Macao (SAR), Singapore, Japan, South Korea, North Korea, Taiwan and Vietnam, plus a representative from the Unicode consortium. For more information, see the IRG home page.

The IRG is very carefully cataloging, reviewing, and assessing CJK characters for inclusion into the standard. The only real limitation on the number of CJK characters in the standard is the ability of this group to process them, because the characters are increasingly obscure (no person — living or deceased — knows more than a fraction of the set already encoded).

What is also underscored both in the FAQ and in my previous post is the complex issues related to both the huge number of characters and the complex rules governing principles such as Han unification and source separation that led to the need for a group such a the Ideographic Rapporteur Group in the first place.

We are talking about a script used by muliple langauges by hundreds of millions of people across several continents that admits to not only the over 70,000 ideographs currently encoded but also the need to encode additional ideographs contained in many historical documents.

Han ideographs are therefore quite complicated in terms of both the principles used to determine what to encode but the huge number of actual Han that need review to decide what must be encoded.

Which, believeit or not, brings me back to Tamil....

There are some who are working on TANE (discussed previously) who believe that treating Tamil as a complex script is incorrect, and that therefore the encoding of Tamil in Unicode suffers from an impotant design flaw.

Of course, calling anything a "complex script" is not really a definition in Unicode, it is one used by Microsoft that relates to text processing, as I discussed in Keeping it simple, with complex scripts. The actual definition that most TANE supporters object to is actually the linguistic definition of Tamil as an abugida, or more importantly such a definition guiding the encoding (as discussed several times in the past). Though as that last point notes, the encoding only needs to be based on principles that are descriptive of the script; one does not have to believe that the encoding is based on the linguistic principles to which one describes.

In other words, the notion that the Tamil encoding is insufficient on such grounds is incorrect.

It gets slightly more interesting when the additional claim is made that Tamil needs its own IRG (I suppose they mean a TRG), which of course would never be needed anyway (given the fact that there are not thousands of unencoded Tamil letters waiting to be encoded in Unicode) but certainly would not be needed for a script that is being called a simple script which the TANE supports would rather see encoded in a visual order rather than as an abugida.

Though I suspect that the desire for the formation of a TRG is inspired by the belief that if a TRG were made from the various cited industrial/NB interests (TVU /INFITT/KTS /TSCII/etc.) that this TRG's encoding requests would be simply granted without comment from Unicode or ISO/IEC JTC1/SC2/WG2 (the organization responsible for ISO 10646).

Which is of course not true, and not how the IRG works. Were there ever a need for a TRG that was communicated, it would work on specific principles and not on a global "re-encode as desired" platform....

There is an additional point made by Andrew West in a comment to the TUNE/TANE issue that is particularly relevant to the arguments that have been made by them:

"That Chinese could get 27000 characters when there govt. put there foot down"

What these people may not realise is that in recent years the Chinese government has tried very hard to get nearly a thousand precomposed Tibetan "brdarten" syllables encoded into ISO/IEC 10646 (see N2558, N2621 and N2624), in order to change the encoding model of Tibetan (this is exactly analogous to the Tamil situation); but Unicode and other national bodies stood firm, and they failed. The Chinese government has since been forced to implement their alternative syllabic encoding model in the PUA on Planes 0 and 15 (actually it is more complicated than that, as the government specifies two implementation levels -- Level 1 supporting the PUA precomposed syllables only, and Level 2 supporting PUA .precomposed syllables and standard combining Tibetan). I believe that the Tibetan case provides a strong precedent for not accepting the TUNE/TANE re-encoding of Tamil.

One would hope that this message would be heard by those working on TUNE/TANE. Neither Korea nor China (nor anyone else) get what they want if what they want does not follow Unicode/ISO 10646 stability guidelines. Neither Tamil Nadu nor India will not be given preferential treatment in this regard....

 

This post brought to you by (U+0bb0, a.k.a. TAMIL LETTER RA)


no comments

referenced by

2006/11/04 On Thokks who don't give a Frigg, under the mistletoe

go to newer or older post, or back to index or month or day