A syllabary does not need to be encoded as one

by Michael S. Kaplan, published on 2005/06/13 22:41 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/06/13/428755.aspx

In abecedaria, Suzanne had an interesting post last week entitled The Tamil Syllabary chez Diderot. It was a very interesting post that talked about how a syllabarium is the way that (for example) young people might well learn to read and write Tamil. In it she mentions at the end:

Unicode, however, is oblivious to Diderot and Taylor. Tamil is encoded much like the other Indic scripts, as an abugida, where the primary form of the consonant is considered to be the form which includes the inherent a vowel. It should not be surprising that there have been requests to reencode Tamil, this time as either an alphabet or a syllabary but not an abugida.

I wanted talk about some of the problems with what some people might want here. :-)

There are many good reasons for so many of the Indic scripts to be encoded along lines they were, especially when one considers the way that Sanskrit can at times need to be represented with any of them. It is not surprising that people want to re-encode Tamil, any more than it should be surprising that a Swedish user might wish to have U+00e5 (å) moved to be after the letter "Z" to make collation easier. But that is because in both cases the user is naively assuming that it is the job of the encoding philosophy and order should be modified for a particular scenario. A scenario that Unicode was not designed to handle, and where making such changes would violate one or more of the fundamental stability guarantees that makes Unicode a standard that so many implementers can rely on.

All of the Dravidian scripts are in a similar bind, where a syllabarium might seem like a superior encoding strategy. But the truth is that superior in this case would mean that it would take longer for platforms to implement it (look how much longer it took to support Sinhala and how much longer it is taking to see good implementations of Mongolian and Tibetan and Khmer). One of the many strengths for Malayalam and Telugu and Tamil is that by using a model similar to so many other scripts of South Asia that support was much easier to see happen. And it was much quicker to market.

Look at it another way -- it is much harder to implement Thai and Lao with their 'visual' encoding scheme, especially when it comes to operations like collation. A logical ordering would have been much easier for everyone to write implementations.

Coulda, woulda, shoulda -- honestly the fewer innovations in this space, the easier it is to see implementations appear!

And I know whereof I speak here. I have seen the impact on a native speaker of a language to see that language supported on Windows. If you tell that user that they must wait for a year or five or even ten years, then the impact is precisely the opposite. Sometimes it can  be devastating.

Today, the Tamil community is split into two basic groups -- those who are okay with how Tamil is encoded in Unicode, and those who want to see it re-encoded as a syllabary. The fact that such a re-encoding violates some of the fundamental stability guarantees in both Unicode and ISO 10646 does not change the opinion. And why would someone object to a scheme that may well match the way they learned to read and write Tamil? It is the most sensible thing in the world for a person to want! Especially if they do not have the burdens of all of the issues I talked about earlier in this post to worry about.

Luckily for all of us, the ones who make the actual decisions are a collection of standardization experts, of typographers, of linguists, of architects, and of developers. They get together and weigh all of the issues, as well as the benefits and costs of each proposal. It keeps Unicode, the fullest and most complex encoding standard ever created, a stable and workable solution for the problem of encoding all of the languges in the world....


This post brought to you by "" (U+09b8, a.k.a. BENGALI LETTER SA)

# Suzanne McCarthy on 13 Jun 2005 11:12 PM:

Thanks, Mike, Once I saw that Tamil could be keyboarded as a syllabary I wasn't so worried. However, some Tamil wonder why Tamil wasn't encoded as an alphabet with the consonant pulli as the primary form of the consonant. Did you read the Tamil pulli post?

Anyway I know all this was already set up by the ISCII encoding scheme but the discussion by Ganesan on the Unicode mail list last month has been interesting.

# Michael S. Kaplan on 14 Jun 2005 12:01 AM:

Yep, I read the post -- I was going to comment on that one too, sometime soon. :-)

referenced by

2006/11/04 On Thokks who don't give a Frigg, under the mistletoe

2006/10/02 Can you name that TUNE?

go to newer or older post, or back to index or month or day