And if your language starts playing a different TUNE

by Michael S. Kaplan, published on 2006/08/31 03:46 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2006/08/31/733302.aspx


Warning to readers: this post is completely and totally my own opinions based on my efforts to assist with Tamil's representation in Unicode, and truly have nothing to do with Microsoft's opinions on the matter (whatever they are). If you quote anything from my words here as being 'According to Microsoft' then be aware that you are a complete moron whose only saving grace is that being a moron is a venial and not a mortal sin. You have been warned!

As I write this post, the lyrics of Roger Waters wash over me:

And if the cloud bursts, thunder in your ear
You shout and no one seems to hear.
And if the band you're in starts playing different tunes
I'll see you on the dark side of the moon.

It is quite ironic that these words, (from the song Brain Damage, on Pink Floyd's Dark Side of the Moon), seem to so easily link to the insanity that I have seen from afar related to Tamil Unicode - New Encoding (TUNE).

You can see the introduction page for Tamil Virtual University's Request For Comments here.

What this standard amounts to is an attempted re-encoding of the Tamil script using Unicode's PUA (Private Use Area) in an attempt to make Tamil into a simple script (rather than a comple one), to build collation support directly into the order of the code points in the encoding, to encourage ISV's like Adobe to support Tamil.

The fact that this ignores the rules in Unicode related to the re-encoding of scripts that already exist, the fact that collation is never designed as a part of the order of code points in any language (even English!), the fact that INFITT (the INternational Forum for Information Technology in Tamil) and it's 'WG02' Unicode working group (of which I am a member along with several native Tamil speakers from around the world) is on record as disagreeing with the bulk of the claims and assertions made by TUNE supporters, the fact that the Unicode Technical Committee is on record as considering many of the fundamental aspects of TUNE to be entirely unsupportable -- all of these things are ignored.

After WG02 made its feelings clear on the matter, the TUNE supporters had their own working group created (WG08) and although I am officially the liaison between INFITT and Unicode, have never been given any communication related to TUNE to present to the Unicode Consortium (I have been told this is due to my obvious bias against TUNE, though no one from WG08 has communicated to Unicode through other means, either).

So yes it is a request for comments, but one in which if the comments are negative, the commenter can expect little more than to be ignored, or dismissed due to bias.

So there are two kinds of people here -- those who agree with TUNE, and those who are wrong.... :-(

Tamil Nadu has had a similar appoach to 8-bit standards, where they rejected the TSCII standard that was widely used outside of Tamil Nadu and instead formulated their own TAB/TAM standards. Historically their recent efforts in areas such as encodings and keyboards have not been as well received by members of the Tamil Diaspora as other orthographic changes in the language and script in the last 30-35 years.

Anyway, back to being out of TUNE. :-)

For those who are in Tamil Nadu:

"...TVU is organizing a one day conference for obtaining the public opinion and to deliberate on the comments received. The proposed one day conference will have an inaugural session, a session for open discussion in the forenoon. The conference will be held in the Clive Hall at Taj Coromondal Hotel., Nungambakkam during 9.30 a.m. on 2nd September 2006."

If any of my readers are in Tamil Nadu and would like to attend this one day conference, please let me know what happens (and if you contribute anything be sure not to mention you agree with anything I say, given my bias and all!). Given the step backwards that I truly believe this whole effort represents, I am truly hoping that those in Tanil Nadu and TVU who are championing the new standard can be finally convinced that they are out of TUNE....

The lunatic is in their head
The lunatic is in their head
they raise the blade, they make the change
They rearrange it; it's insane
they lock the door
and throw away the key
there's someone in their head but it's not me

And if bad standards thunder in their ear 
We shout and TVU doesn't seems to hear.
And if the standard they're in starts implementing TUNE 
We'll see them on the dark side of the moon.

This post brought to you by க் (U+0b95 U+0bcd, a.k.a. TAMIL LETTER KA + TAMIL SIGN VIRAMA, a.k.a. TAMIL KA puLLi, a.k.a. TAMIL LETTER K)
A letter that is separataely encoded in TUNE, along with several hundred othere)


# Mike Dimmick on Thursday, August 31, 2006 12:22 PM:

You'd have to argue that this proposal is an artifact of the general poor support for complex scripts. If, say, French were encoded using only base letters and composing diacritics I suspect support would be greater, but of course for legacy reasons that isn't the case.

In some ways it appears - and certainly could appear to a native of Tamil Nadu - that Unicode's solution is idealistic, and this proposal reflects the reality that few developers are going to the trouble of making their software work correctly for south asian scripts. This may be because they're unaware of the issues, or even if they are, simply not interested in the market. The market may be large (Tamil Nadu has a larger population than the UK, where I am) but is not particularly wealthy (GDP of $56bn compared to the UK's $1,833bn, making GDP per capita of $901).

However, we have to invoke Raymond's 'what if everyone did this' rule - if all distinct glyphs in all scripts are encoded with no combining characters, then 16 bits is not enough. The BMP is probably not enough (correct my terminology - I mean UTF-16 with the high and low surrogates, which is enough to encode up to U+10FFFF IIRC).

# Mihai on Thursday, August 31, 2006 2:56 PM:

<<this proposal is an artifact of the general poor support for complex scripts>>
I think in most cases the lack of support is just "because ISVs don’t really care”
If a script is “by nature” complex, then a new encoding standard only shifts the problem from the text layout engine to the keyboard (for instance). And this will mess up other things, like searching.
It is a bit like the addition of the fi ligature for Latin (U+FB01). If I type ‘f’ then ‘i’ some applications will show it as a ligature using the text shaping engine (Notepad), others will not (Word). But there is no keyboard producing U+FB01, and no application handles properly searching (search for ‘f’ and find “half of U+FB01”), or case conversion (Uppercase(U+FB01) = <U+0046 U+0049>).
So, is TUNE going to solve anything in the area of ISV support? I bet not!
Is current Unicode support for Tamil going to help? Maybe. The OSes are committed to Unicode, there are a lot of complex scripts already supported, and more and more are added every day. If an application does not properly support Hindi today (for instance), is not because Hindi is complex, but because the application does not use the system API properly.
Just look back to DOS, or Win 3.x, or Win 9x. Each one has its own limitations. Long ago very few ISVs supported anything but Latin 1. Then some added support for other single byte encodings, but no DBCS. Now is common and easy to do DBCS, and there are dents done in the complex script support. It takes time? Yes. But is it easier to convince an American ISV to support Unicode, or some proprietary obscure Tamil encoding?
Call me in 30 years if I am wrong. I hope this is enough time :-)

# Shriram on Friday, September 01, 2006 2:43 PM:

First of all, this meeting is called for the purpose of a meeting of minds. A forum sanctioned by the highest levels of the Government to find a common ground.

The author is definitely wrong in saying that dissenting voices are ignored. I have been in meetings where TUNE has evoked a lot of emotions. Not all of them in support.  All these have been heard and taken into account by the committee. I expect todays session to be stormy too and it should.  

The TUNE effort is led by academics with well distinguished track records and have maintained the highest levels of transparency.

Why not listen to the deliberations before rushing to judgements?

I am sure that there is potential for the opponents and supporters of this proposal to sit and discuss the way ahead in the future provided we leave our egos and bias at the door.

Also reg the market for tamil, please take into account the Tamil diaspora which is well spread over the world.

Lastly if ever there is a time to go for the TUNE/Mobile keypad standardization route it is now. With a favorable and committed people in key ministries in state and centre the timing is perfect. It just needs a concerted effort

# Michael S. Kaplan on Friday, September 01, 2006 4:47 PM:

Hi Shriram,

Since Unicode's own opinions on the matter are being ignored in the conversation, as are it's policies and procedures -- and that these facts, even though communicated directly, have not stopped the TUNE momentum -- I do not have to wait to see what they decide to know that they have no interests in the facts.

I do take into account Tamils around the world -- they are the ones who have rejected TAB/TAM and who have also rejected TUNE. As such, they are in a better postion to ensure the future of Tamil....

# Baskaran on Thursday, September 07, 2006 5:43 AM:

Hi Shriram,

Tinkering with a standard (Unicode) is never a good idea. Remember, we (read Indian languages) never had anything called standard either for encoding or keyboard except for something that were in very limited use, such as TAB/TAM, ISCII etc.

One should not be trying to RE-define a new standard, when something has been already accepted worldwide. We are already suffering with lack of standards and the TUNE is just going to make the situation worse.

It is understandable that the code point order is not the same as natural order, but what people don't understand is that the collation order is independent of code chart order. This is because, people take these issues emotionally, while these should be approached with technical points.

Many of the people supporting TUNE are independent software developors and have several tools including for word-processing and fonts. Developing a font for Tamil Unicode block is difficult as the rendering engine needs to be intelligent enough to adjust the glyph positionings. Thus, (I believe that) these people think TUNE as a best alternative, as it eliminates the need for any intelligence making it easier for them to develop fonts.

# CAPital on Sunday, September 24, 2006 1:46 AM:

When you say:

///This post brought to you by க் (U+0b95 U+0bcd, a.k.a. TAMIL LETTER KA + TAMIL SIGN VIRAMA, a.k.a. TAMIL KA puLLi, a.k.a. TAMIL LETTER K)///

I right away notice that there is NO Tamil sign called VIRAMA!

In fact Tamil language doesn't add the dot to form the 'ka' sound.  the form you have presented is the original form.  the letter without the dot is NOT the basic letter.  So you can see Unicode is already broken its rules of ONLY encoding the basic characters!

______
CAPital

# Michael S. Kaplan on Sunday, September 24, 2006 1:55 AM:

Actually, no -- that is not how abugidas work. But see this post and this one for more info on the approach that was taken in the encoding, and why it is okay for you to not agree with the approach and still be able to work in Unicode....

# Michael S. Kaplan on Sunday, September 24, 2006 2:01 AM:

See also this post from over a year ago, which makes some additional related points, ones that I wish those who want to set Tamil implementations back by years would pay more attention to....

# CAPital on Sunday, September 24, 2006 2:13 AM:

I am not trying to say that only the ka + dot should have been encoded and not the other.  Then it would be almost impossible to display the other.

All I'm saying is Unicode already broke its rules.  So to other European and East Asian langues.

IF it did NOT break its rules for ANY language, then I wouldn't even say a word.

So it looks like whoever had the power, broke the rules according to their "standardization" policies.

# Michael S. Kaplan on Sunday, September 24, 2006 2:21 AM:

Actually, it did not. It encoded an abugida.

The rules that you envisage are not actually rules of Unicode, which may be the problem here? You are expecting promises to be kept that were never made. :-(

# CAPital Z on Sunday, September 24, 2006 10:15 AM:

In the sets of Latin, there are encoded characters as à, á, â, ã, å [for the European languages].  It is encoded as a single character.  Meanwhile, the same glyphs are present seperately, like a, ̀, ́, ˆ, ˜, ˚.

So only for Tamil like scripts [South Asian languages] similar encoding is denied.

East Asian languages did encoded their all characters.  And you know the famous Chinese governments stand about all chinese character encoding.

Most recently, the new language to be included in Unicode, Balinese break the rules of Unicode.  As you said, it did not only encode the "basic abugida" but did others as well.  You may have already seen it in the new Unicode Charts.

Even in Tamil, your point is right that basic abugida is enough to display the most characters.  But the "ku கு, kuu கூ"  [and similar for all other characters] are almost totally different than the basic abugidas.

I'm not saying TUNE is the best thing, but Tamil do lack the efficiency of what other similar anguages have in Unicode.


______
CAPital

# Michael S. Kaplan on Sunday, September 24, 2006 10:36 AM:

CAPital, this is hardly breaking the rules of Unicode.

In some cases, there were legacy standards that pre-dated Unicode which had to be represented. And other cases where you see rule breaking, the actual proposals give the justifications (as do the block descriptions in the Unicode book, in many cases).

Did you look at the links I put in? They explain many of the reasons why strategies like TUNE are simply too late and do involve a re-encoding of a script already encoded, and would set bzck Tamil computing by 5-10 years or more.

Even just looking at your blog (and the many other Tamil blogs out there), it would invalidate all of this existing data!

# CAPital Z on Wednesday, September 27, 2006 9:32 PM:

East Asian Language encoding dilema and Balinese are not pre-dated problems.

Anyhow, yes whatever written in current unicode will be unreadable.  But that's the evolution right?.  Latin encoding had so many perfections  during the course of Computer.  Tamil is just a baby in computing.  So now the Tamil doesn't get the chance to improve just because, what is already there has to be THE ONE.

So Tamil can never improve itself in the future.  Because Unicode Consortium will never accept any modification to what is already there!

# Michael S. Kaplan on Wednesday, September 27, 2006 10:24 PM:

Tamil improving itself has nothing to do with its encoding, because language is not just encoding.

But Unicode is on record as rejecting these schemes, so eventually the illogic of waste is the factor....

referenced by

2006/11/04 On Thokks who don't give a Frigg, under the mistletoe

2006/09/05 At the TONE, it will not be TUNE, but TANE

go to newer or older post, or back to index or month or day