For every expert...

by Michael S. Kaplan, published on 2005/05/22 19:09 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/22/420879.aspx


How does the old saying go? For every expert there is an equal and opposite expert.

Never did that anonymous quote seem more accurate to me than my first UTC (Unicode Technical Committee) meeting. Peter Vogel and I were having a conversation about it once (in Amsterdam, I think -- we were both speaking at CttM, if memory serves). He listened to me talk about how Unicode had quarterly meetings lasting four days each and he was astounded -- "they're just characters, right? It's not like this is nuclear physics," he said. Oh Peter, if only it were so simple....

(And incidentally, at least one of the principle participants has a PhD in nuclear physics, ironically enough given Peter's words!)

Anyway, yesterday when I was talking about CYRILLIC LETTER UK in the never-ending series to help prove that Every Unicode Character Has a Story, and Larry Osterman commented:

A theme I've noticed in these "Every character has a story" posts:

None of the writers seem to be native speakers of the language (based solely on the unscientific basis of looking at their apparently western names). But all of them seem to be able to expound authoritatively about the experience that the native speakers of the language expect.

How is it that the participants in the Unicode Consortiums forums can be so confident that they know what a native speaker of a language expects to find?

I know that Mark Crispin and I have had some interesting discussions about this exact issue - Mark used to feel rather strongly about the values in some of the codepoints for the Chinese and Japanese character sets that were collapsed because he was told that native speakers were quite incensed about the collapse. And then he actually went and asked the native speakers (he speaks Japanese fluently (and I believe Mandarin as well)) and found some rather different opinions.

A fair observation, Larry -- one that I think is worth a few words. And your "unscientific analysis" is for many of the posts I have done accurate (though some of them did indeed include comments from native speakers). :-)

But of the rest, many of them (e.g. Michael Everson) actually work extensively with native speakers and experts, translating their knowledge and expertise into the character proposal form that has to be filled out for Unicode and ISO 10646's WG2. And many others are experts on the technical aspects of character encoding, whether their original expertise is in linguistics (e.g. Ken Whistler) or typography (e.g. Kamal Mansour) or really just about any aspect of character encoding that comes into play. Still others are experts in various experts in specific modeern scripts or usages of scripts in historical contexts. And on the Unicode List itself, anyone can join -- which occasionally leds to the kinds of silliness that I have posted about here in the past. :-)

The proposals themselves do come from either native speakers of languages that use the script to be encoded (or for dead languages from the academic experts looking to see them encoded), either directly or through representatives like Michael Everson. A lot of times the quotes I am giving in the threads are being spoken due directly to comments from native speakers on usage, or in the case of the old-timers from actual memory on the reasoning behind decisions that were made in prior meetings in the last decade....

So it really is not the case of a bunch of California, USA tech. companies dictating language policy as people would like to paint the picture; earnest work to discover how best to encode given the needs of natve speakers is a crucial piece of the mix.

In the specific case that Mark Crispin was likely referring to is what is known as Han unification. The actual bulk of the work is done by the Ideographic Rapporteur Group, a group that works under ISO 10646 and has members from many places including mainland China, Hong Kong, Macao, Taiwan, Taipei Computer Association, Singapore, Japan, South Korea, North Korea, Vietnam and the US. Unicode usually also likes to send someone to help with the coordination between ISO 10646 and Unicode, usually John Jenkins (who as is speaking bios indicate is, in addition to being involved with development of the Unihan database since 1991, is also incidentally one of the world's few experts on the Deseret alphabet).

The principles of Han unification are based on the fact that Unicode is designed to encode characters, not glyphs. A good discussion can be found in the Wikipedia article on Han unification, which I will quote a little bit from:

The process of Han Unification was controversial, with most of the opposition coming from Japan. Opponents of Han unification state that it steamrolls over thousands of years of cultural tradition, misses many of the subtleties that are one of the most important features of these languages, and renders serious literature and academic research in these languages impossible. Proponents of Han unification point out that the unification process is in the hands of specialists from China, Korea, and Japan, and that the objections to unification of specific characters are made without regard to their histories. Characters which some Japanese today consider completely distinct were historically the same, and were taught as the same in Japanese schools until the 1950s. As for historical research, Unicode now encodes far more characters than any other standard, and far more than were listed in any dictionary, with many more being processed for inclusion as fast as the scholars can agree on their identities.

Some characters used only in names are not included in Unicode. This is not a form of cultural imperialism, as is sometimes feared. These characters are generally not included in their national character sets either.

It is also worth noting that according to the Source Separation Rule, which was definitely in force all of the time that people were most vocal in their objections to Han unification. The rule basically states that graphemes present in national character code standards are added to Unicode explicitly, even where they can be composed of characters already available.

The Wikipedia article also discusses much of the contrary argument, but you can look at it there if you like. :-)

The full data on the ideographs that are in Unicode is provided in Unihan.txt, a file in the Unicode Character Database, which can be queried online from the following page. A full description of the fields can be found in Unihan.html, including (among other things) the source information from which the character's inclusion was derived, the kIRG_*Source fields (replace the * with G, H, J, KP, K, T, U, V for PRC/Singapore, Hong Kong, Japan, North Korea, South korea, Taiwan, Unicode compatibility, or Vietnam standards, respectively.

I myself act as a liaison between Unicode and INFITT (the International Forum for Information Technology in Tamil) although this role is really just a miniature version of what Michael Everson does for language and scripts all over the world -- I help add the information needed to provide solid proposals, so that the native Tamil speakers in Tamilnadu, Malaysia, Singapore, Sri Lanka, and elsewhere can bring their expertise and it can be translated into working proposals that can be accepted by Unicode.

Interestingly, on at least one occasion I was accused by a native speaker of Tamil of being a "shill" or plant provided by Unicode to change the agenda of INFITT and to sabatoge its efforts. Luckily no one else supported this view (and it was repudiated by the INFITT leadership). In truth, thus far we are "batting 1000" in that every proposal I have been asked to present by INFITT's Executive Committee that has been supported by the majority of INFITT's WG02 has been accepted by Unicode. So although I would never think of myself as a "native speaker" I have I think been able to act as a good liaison to/from them and good advocate for them. Which is definitely my intent. :-)


# Larry Osterman [MSFT] on 23 May 2005 1:36 AM:

You're right, Mark's problems were with the Han Unification issue. As I understand it, he used to spend a fair amount of time railing against it, until he spent time speaking to a wider group of japanese speakers, when he realized that it wasn't as dreadful as it had been portrayed.

# Michael S. Kaplan on 23 May 2005 1:45 AM:

Well, I must admit that Han unification was pretty controversial. The original ISO plan for an international character standard was along the lines of an ISO-2022 style "shift in/shift out" plan to move between the scripts. I am *very* glad that such a plan never made it to completion!

referenced by

2006/10/02 Can you name that TUNE?

2006/04/13 Unicode isn't advanaced mathematics

2005/06/18 Font substitution and linking #3

go to newer or older post, or back to index or month or day