At the TONE, it will not be TUNE, but TANE
by Michael S. Kaplan, published on 2006/09/05 00:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2006/09/05/740606.aspx
I figure that since I initially posted about TUNE in And if your language starts playing a different TUNE that actually mentioned a meeting that would be happening in Tamil Nadu over the weekend that I should post some follow-up information on what happened....
This Yahoo group has a thread or two with lots of summary info, excerpted below.
Due to a last minute attempt to reach out the tamil community on the proposed unciode encoding a ONE day conference was organised by the Tamil Virtual University.
Here a summary
1. The name TUNE has been changed to TANE - TAmil New Encoding.
2. To make revision on the proposed code chart of TANE- Symbols and numerals
3. To obtain consensus from countries outside Tamil Nadu.
4. To seek support from software developers such as Microsoft and vendors.
5. To chart a migration time table from current encoding to TANE
6. TANE to be made available FREE for the community for easy access in and outside tamil nadu. ( the last tools offered free by GOI was hosted by a server from NOIDA and they were cumbersome) - Prof Balakrishnan comment.
There were few members of TU present at the event. Thiru Ramki, Naa Elangovan, Badri, Anbarasan, Logasundaram ( he is not in this list) members of KTS - Anto Peter and Ananth - KTS President and Sec respectively.
Here is an extract of email send by Elangovan of cadgraf to the Infitt.
To-days meeting was chaired by Dr.V.C.K and inaugurated by Mr.Dayanidhi Maran, Minister for IT and Communications. The Technical Committee meeting was headed by Dr.M.Anandhakrishnan. It was well attended by Microsoft, IBM, CDAC, IISC, IIT, Anna University, KTS, etc; The minister requested for a unanimous decision of all concerned for the most efficient and most appropriate 16bit encoding for Tamil as early as possible as the E-Governance project of the Indian Govt. may have to be rolled out from the beginning of 2007. There was good technical participation from all sectors. At the end, Dr.M.A. announced that there is a near consensus to proceed with the 16 bit encoding for Tamil which need not be the exact replica of TUNE. This will be called 16 bit Tamil encoding. However this will be publicised in all web sites to seek wider consultations world-wide and from all vendors for the next 3 months. The final encoding will be based on the feed backs and suggestions received. The official version of the recommendation will be submitted by Prof.MA to the TN Govt. for their further action
I am happy to see that now there is More time and scope for all technocrats all over the the world to participate and offer their suggestions.
So, the name of the new encoding is now TANE rather than TUNE (isn't Tane the Māori god of birds? I may be misremembering something here!).
I heard from several sources there that "Unicode and MichKa were a Western monopoly," which is amusing to me given the fact that Unicode is 100% locked in with ISO 10646 and all - I wonder how the NBs in Japan and other countries would react to being called a Western monopoly.
It was also interesting to be named so directly and prominently in a meeting where a bunch of people don't like me very much. Thankfully several others pointed out the good work that INFITT WG02 has been doing that I have been helping with.
That's the secret -- if you are going to make enemies, be sure that you also make friends!
When it came down to an actual vote, only two voted to keep the current encoding; 43 voted to form a new 16-bit encoding.
And there are two interesting strategic points I'll put up without commenting on as I think they speak for themselves (I have heard them both many times before over the last few years):
- That Chinese could get 27000 characters when there govt. put there foot down; similarly Japanese got 500+ for a character encoding system. Similarly if GOI and TN Govt. put there pressure we will get this. If they don't give, we will use PUA and over few years it will become defacto standard.
- There is huge cost savings in E-Governance. TN Govt has to spend Rs.3000 crores for there E-Governance initiative where 1GB is needed for each citizen. Adopting TUNE over Unicode will result in saving of Rs.1500 Crores because of Storage savings alone.
Well, I suppose we are all living in interesting times. It will be interesting to see what happens next.
This post brought to you by ௧ (U+0be7, a.k.a. TAMIL DIGIT ONE)
# Andrew West on Tuesday, September 05, 2006 5:03 AM:
"That Chinese could get 27000 characters when there govt. put there foot down"
What these people may not realise is that in recent years the Chinese government has tried very hard to get nearly a thousand precomposed Tibetan "brdarten" syllables encoded into ISO/IEC 10646 (see N2558, N2621 and N2624), in order to change the encoding model of Tibetan (this is exactly analogous to the Tamil situation); but Unicode and other national bodies stood firm, and they failed. The Chinese government has since been forced to implement their alternative syllabic encoding model in the PUA on Planes 0 and 15 (actually it is more complicated than that, as the government specifies two implementation levels -- Level 1 supporting the PUA precomposed syllables only, and Level 2 supporting PUA .precomposed syllables and standard combining Tibetan). I believe that the Tibetan case provides a strong precedent for not accepting the TUNE/TANE re-encoding of Tamil.
# Dean Harding on Tuesday, September 05, 2006 7:42 AM:
I don't understand their complaint anyway. I think it's a GOOD thing to have your script considered "complex". Just look at the cool stuff you can do with Segoe Script when English is being considered "complex"!
I've followed some of the "discussion" on the unicode list, and I gotta agree - they all seem to have their hands on their ears and saying "la la la"
# Mike Dimmick on Tuesday, September 05, 2006 9:11 AM:
Unicode is very much structured as a glyph-additive system, so far as I can tell. That is, the glyph produced by <(base character) (combining character)> is very similar graphically to superimposing the combining character glyph(s) on the base character glyph, with some adjustments for spacing. Glyph addition - where each glyph (or a couple of glyphs each side of a consonant for some Indic vowels) adds (or subtracts, in case of virama/pulla! - but this is still an added symbol, not removing part of a glyph) a recognisable sound or modifies the sound in a known way - works for many scripts.
This model does not work for Far East scripts where the same glyphs are used by multiple languages but pronounced differently, hence why the CJK Hanzi characters are simply listed as 'CJK UNIFIED IDEOGRAPH' in the catalogue - there's no 'sound' or concept that can be listed as there is for other scripts, making the diagrammatic reference chart the sole source for which character is which. The 27,000 characters are not solely for Chinese but for Japanese and Korean as well (possibly other languages that use Hanzi too) and you'll hear plenty of complaints from all three groups that they should never have been unified in this way. You can't use components of each glyph as building blocks to make a more complex glyph because the components don't have any real identity or value of their own.
The only thing then to explain is why the Latin alphabet is pretty consistently encoded with precomposed characters. The answer basically is history. Western and Far East encodings have an awful lot of history. ISO 646 goes way back (as ANSI X3.4 in 1968), the regional variants (mapping parts of the 7-bit set to precomposed modified base characters) date back to 1972. The ISO 8859 series started life in ECMA - I can't find a date for the first edition, but the second edition of ECMA 94 (which became parts 1 to 4 of ISO 8859) is dated 1986.
The preamble to ECMA 94 states that the reason for including the precomposed characters was because modified characters were typically encoded as <(base character) (Backspace) (modifier)> which causes really horrible processing problems. Ironically one of the standards listed in the preamble, ISO 6937-2, is considered 'difficult to use for processing as some graphic characters are represented by one and others by two [byte] combinations'. Here, however, the modifier preceded the base character and was therefore a bit odd.
It does feel a bit weird to be 'restricting' a script to a smaller number of characters when Unicode offers such a large range of code points, but there are simply so many scripts to encode that it pays, when not having to offer compatibility with an older standard, to be conservative with allocations than going overboard with largesse and running out of code points prematurely.
The idea of using the PUA invokes Raymond's common question, 'what if everyone did this?' If everyone did this, it would be impossible to write documents, or encode fonts, containing two scripts that both used the PUA. The word Private in Private Use Area means that it's for the end-user's private use, governments certainly should not be trying to use this area.
# Baskaran on Thursday, September 07, 2006 5:50 AM:
Huge cost savings of about 1500 crores (by having 512 slots in PUA) due to reduced storage space and each citizen requiring about 1 GB space - laughable and ridiculous.
Can they explain the basis for these numbers? Can someone tell them the decrease in the storage costs over these years? These government officials think that people are fools.
# Michael S. Kaplan on Thursday, September 07, 2006 11:16 AM:
I don't think this sort of issue is on the government officials as much as consultants who are paid to do a review and who cast the results in a particular light in order to please those who commissioned the study. If you know what I mean.
Though every version of the work of which I have seen the methodology has had some rather severe technical flaws in it....
# P.Chellappan on Tuesday, September 12, 2006 2:12 AM:
Just because Storage Costs are becoming cheaper, it does not mean that I should use 50% more space to store my data.
Just becuase CPU speeds are increasing, it does not mean that I should inefficiently process my data.
Processing Speeds and Storage Space is constantly being increased, but the fact remains that at any given point in time, the present Tamil Unicode encoding will make sure that one uses 50% more space and be less efficient by 50% in data processing. The latter is more critical as far as I am concerned, as it is very important for real time processing.
# Michael S. Kaplan on Tuesday, September 12, 2006 2:21 AM:
Well, in truth the statistics given as part of the "TUNE proof" are very suspect, and the methodology of the testing was never published so that no one can reproduce it....
Not to mention the issues that Venkatarangan raises here
, which give some of the actual results of what such an encoding would do to Tamil if it ever were approved (which it cannot be, due to the violation of Unicode stability policies involving the re-encoding of scripts).
# Richard Wordingham on Saturday, September 16, 2006 12:38 PM:
16-bit Tamil in the BMP (outside the PUA) might just be possible, even now. It could even be consistent with current Unicode - the characters would canonically decompose to their current Unicode encodings. These new codes would be scattered through the BMP, though. It would be interesting to see what this, plus support for the old encoding, would do to the processing time advantages claimed for encoding aksharas. (I would allow the timing tests to use only the new encoding, provided they gave the same results as using the old encoding.)
I appreciate the new encoding would not be allowed in NFC or NFD, but I am not a fan of compulsory normalisation, especially to something as quirky as NFC. (NFC gets really quirky when a character has two accents.)
# Michael S. Kaplan on Saturday, September 16, 2006 12:56 PM:
You do understand that this is not an opinion shared by the UTC, right? Since re-encoding Tamil is a violation of the stability guidelines?
# Richard Wordingham on Saturday, September 16, 2006 4:47 PM:
I've understood the argument against scrapping the currrent encoding. But what stability guideline is breached by adding TAMIL LETTER K with a *decomposition* to <TAMIL LETTER KA, TAMIL SIGN VIRAMA>? One would naturally add it to the composition exclusions. Thus one would be adding decompositions, not compositions, and these additions would be consistent with the stability guarantees on normalisation. We could also add TAMIL LETTER TANE KA, with a compatibility decomposition to TAMIL LETTER KA, if it were important for processing speed that all the aksharas in k- (excluding the KSSA set) have regularly related numerical values.
Adding such characters in the BMP is as close as Unicode can come to adopting KANE. This is an option that should be explored.
The simple data storage solution, of course, is SCSU, though that is incompatible with KANE encoded in the BMP. Of course, an extended SCSU would be very relevant if KANE principles were accepted as above, but with the new code points in a supplementary plane rather than the BMP.
Extended SCSU would have 32786 character windows as well as 128 character windows, primarily for handling large character sets such as those of Egyptian and Tangut. However, only Doug Ewell and I showed any interest in it.
# Michael S. Kaplan on Saturday, September 16, 2006 5:17 PM:
If you do not see the problems that adding alternate duplicate encodings add to the stability, security, usability, and overall implementation of any script in Unicode, let alone the violation of the linguistic principles of the language and script, all due to an unproven and inaccurate set of claims about efficiency that have been refuted by experts both inside and outside of Unicode and inside and outside of Tamil Nadu, then I am not sure that I will be able to help you see it....
And that is ignoring what this re-encoding (which IS a violation of principles that the UTC has officially stated -- Unicode will NOT re-encode scripts. Period.).
But I care too much about Unicode as a standard and Tamil as a language to injure either one in this way. I will be a part of any constructive solution, but what are you are suggesting is a destructive one, even if it is less destructive than either TUNE or a TANE within Unicode.
# Richard Wordingham on Saturday, September 16, 2006 8:13 PM:
By 'the problems that adding alternate duplicate encodings add to the stability, security, usability, and overall implementation of any script in Unicode', are you saying that the principle of canonical equivalence does not cure the problems, e.g. that the typically three or five encodings of a Latin vowel with two diacritics (once as one character, once or twice as two characters and once or twice as three characters) still present problems for Unicode-based systems?
Where has the claim that encoding all the Tamil aksharas as 16 bits speeds grammatical analysis been refuted?
Encoding Tamil aksharas appears to be in accord with at least some native perceptions of Tamil, and would be justified by the precedent of the Ethiopic script. It's non-Tamil languages in the Tamil script that may make encoding solely as aksharas unviable.
Personally I don't like the idea of assigning codepoints to the Tamil aksharas, and putting them in the BMP strikes me as greed as well as bloat.
# Michael S. Kaplan on Saturday, September 16, 2006 8:31 PM:
Canonical equivalance does not help if the principal agent of canonical equivalence (normalization) cannot be used to equalize the different text that is canonically equivalent.
And re-encoding scripts is still not allowred by Unicode.
When you speak to people in Tamil Nadu, they understand the technical problems with their proposal and it becomes clear that this is more of a political step than a technical one.
But as to the refutations, I do know of published ones offhand, but I do know that all technical reviews of the "proofs" have pointed out major flaws in the mechanisms used to prove the points that call their validity into serious question. Enough so that the proofs are no longer available for review?
I'll post more on Tamil in the future, though I will likely be a bit more constructive about it since if talking about Tamil is journalism, talking about TUNE/TANE is like mukraking....
# Richard Wordingham on Sunday, September 17, 2006 10:02 AM:
The non-standard NFCM (defined below) is a perfectly adequate normalisation for performing text comparison, and indeed NFC and NFD would also both work just as normal. By NFCM I mean NFC modified by ignoring the entries in the composition exclusions. It's a true normalisation, not a folding like NFKC and NFKD, but it is not stable.
The only re-encoding that would go on would be if there needed to be a fixed relationship between the encoding of the graphically minimal aksharas and the others. And that would be the introduction of canonically equivalent forms, much as one might suggest filling in the gaps in the superscript digits and mathematical letters so that minor mark-up could be converted to characters by calculation rather than by look-up. Apart from that, what is proposed is simply the addition of precomposed characters. (Unicode policy on that is, 'No. Just wait for a better renderer to come along. Don't bloat Unicode.')
Does the 'No Re-encoding' rule apply to scripts between Mumbai and Hong Kong? The meaning of <U+1000, U+1039, U+101B> is changing - one will have to replace the sequence by <U+1000, U+103C>. The ubiquitous (in its script) <U+1039, U+200C> will have to be replaced by <U+103A>. The New Tai Lue consonants will be re-encoded for use with the old vowel system and the full set of subscript consonants (N3121 - but the decision in principle predates the approval of the New Tai Lue script.) The effective removal of Vietnamese U+0340 and U+0341 by making them canonically equivalent to U+0300 and U+0301 might also be dubbed re-encoding.
I see the lack of any improvement in processing speed as sufficient argument against bloating Unicode with Tamil askharas.
# Michael S. Kaplan on Sunday, September 17, 2006 10:29 AM:
The standards that need stable and predictable results cannot use the non-existent "NFCM" which is not a part of normalization. And since it cannot be used and since normalization is defined by Unicode, it is not an answer.
Actually, most of what you are saying is either incorrect or misleading (mathematical letters have intrinsically different properties than letters and are not letters in any real sense, as an example, and the suggested Myanmar changes are an incredibly controversial issue that involves sequences that CANNOT be encoded properly in the current model which does apply to Tamil or any of the TUNE/TANE arguments).
And because of this, the whole comment is misleading and confusing in the context of TUNE/TANE. I'd rather avoid misleading people if possible, which is why keeping us on the actual topic here would make more sense than bringing up less than relevant examples of exceptional cases that do not establish precedents for Tamil to use.
Now as to processing speed, I also disagree with you.
English processing for many operations could be make significantly faster if we re-ordered the letters used in English to intersperse the uppercase and lowercase letters. That is NOT an argument to re-order ASCII that anyone in their right minds would accept. And it does not apply to Tamil either, even if proof did exist. It is an attempt to use an out of scope problem as an argument for an improper change.
The same implies to the need to capture the dozens of phonemes that exist in English by encoding characters rather than relying on the five that exist in the alphabet. Encoding such characters would make spellcheckers, thesauri, and such MUCH more efficient. Does this mean that we need to encode all those new characters, due to Unicode's vicious refusal to allow efficient processing in these tools? No, it does not -- for the same reason as the Tamil change is out of scope.
The fact that the Tamil arguments are all unproven (and that the original flawed proofs were withdrawn) is just a bonus in showing the underlying actions of those proposing the change; the fact is that the entire attempt is out of order.
# Richard Wordingham on Sunday, September 17, 2006 11:54 AM:
Two strings are canonically equivalent iff their NFD normalised forms are the same, iff their NFC normalised forms are the same, and iff their 'NFCM' normalised forms are the same. (NFCM normalised forms cannot be hardcoded without committing to a specific version of Unicode, because that normalisation is not stable.) NFCM is relevant because, under the computational efficiency argument, it, or something like it, would be the preferred text normalisation for Tamil language-sensitive processing.
Most of the proposed Myanmar changes are unnecessary. If Eric Muller had completed his action on Indic conjoining, and it had been remembered that Indic scripts occur in further India as well as India, it would have been obvious that the Burmese conjoining combinations could have been handled by the full panoply of <VIRAMA>, <ZWJ, VIRAMA> and <VIRAMA, ZWJ> and that the new, medial consonants were unnecessary. Unfortunately, the IETF objects to ZWJ. TALL AA scrapes through logically by the argument that the shaping rule is not stable, as with TALL S. And before you ask, the Karen have got subscript LA wrong. Karen spelling seems seriously weird.
I use italic mark-up for normal mathematical variables, so I type variable x as U+0078 with mark up rather than U+1D465, and variable h as U+0068 rather than U+210E. (Perhaps I need to bite the bullet and fight for some mathematical keyboards to be installed, but would we have the fonts in a Windows 2000 + Word 2000 installation? At home I believe I only have the Mathematical Alphanumerical symbols in Code2001.) Now, would it really hurt to add U+1D455 as canonically equivalent to U+210E? Or should we expect totally confusing assignments, as with U+2071.
As to English spelling reform, I do wonder why there is not yet full support for i.t.a. Support for IPA is there of course, including a tie for the proper representation of the initial and final sounds of 'church' and 'judge' (U+035C). Two complete re-shapings of the alphabet for phonetic English spelling are of course already encoded - Shavian and Deseret. Or are you referring to the lack of a single character for the sequence <U+0054, U+035C, U+0283>?
# Michael S. Kaplan on Sunday, September 17, 2006 12:56 PM:
COMMENT TRACK MAINTENANCE:
NFCM DOES NOT EXIST. Please desist from discussing things that are not in Unicode as documented concepts with enobled names that assumes there are. Future comments may be moderated to enforce this is it continues, nd will be marked as such.
Please discuss math when it is being discussed, or feel free to suggest a topic if you like (I have discussed math in the past and would be happy to do so in the future -- including discussion of fonts in Windows supporting math).
Since just about nothing you are talking about is about either TUNE or TANE, could you please desist? Topics in the suggestion box or comments on the Unicode list seem much more appropropriate than "littering" coments in unrelated posts?
# CAPital on Sunday, September 24, 2006 1:14 AM:
When people argue that 'Z' comes before 'a', they intentionally hide the fact that 'A' to 'Z' are in order as well as 'a' to 'z'.
Why there is a need to put both cases [upper & lower] for English, when they could have simply made English as a complex script and make 'Z' appear out of 'z' using a mapping or wise versa?
Why 'k' and 'c' is encoded, since both almost sounds the similar?
Why 'M' and 'N' is encoded, since both almost look and sounds the same?
Why all characters without accents and with accents were encoded in Unicode for European languages? Couldn't that have made complex script and appear out of the mappings?
Why there are so many 'Extended' and 'Supplement' characters encoded for almost every languages in Unicode's chart?
Tamil do not have thousands of glyphs.
Why is it when it comes to Tamil, that people don't see the advantage of being NOT a complex script and characters are in the natural order?
# Michael S. Kaplan on Sunday, September 24, 2006 1:51 AM:
At the time thatASCII was encoded, the notion of encoding complex scripts did not exist, and neither did the technology exist to encode things in such a way. But times change, and we move forward....
# Peter Lund on Wednesday, October 11, 2006 9:24 PM:
CAPital: it would make more sense for you to complain about C/G, i/j, u/v where the difference lies more or less in diacritics. Or complain about the ligature w, for that matter.
Besides, the Latin alphabet is not just used for English, many of them having many legacy texts people would really like to be able to express in Unicode, preferably without too much conversion work from existing computerized representations. In some, c and k do not sound the same at all (one might always be an 's' and the other always a 'k', for example). Even in English, they can both represent many different sounds (the k in know is completely silent, the c in cut sounds just like a k, the c in ceiling sounds just like an s, the c in church together with the h different again).
Please consider a donation
to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
go to newer or older post, or back to index or month or day