by Michael S. Kaplan, published on 2011/01/23 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/01/23/10119125.aspx
Now just as a by-the-way, I am aware that the title of this blog might at first glance be considerd specious by people who both know who Gallagher is and who know aout the long history of the Srivasian "laugh" pun. I will explain why it isn't at the end of this blog.
For many years, Sinnathurai Srivas has been trying to convince whoever he can that Unicode is missing out on some fundamental issues by not encoding two different pronunciations of what visualy is the same letter, and how this impacts so many future technological problems that Tamil would be hitting.
He said it differently than this -- for example talking about how Tamil uses a scientific system that Unicode ignores, and going down the rathole of pointing out how the way Unicode encodes things would cause actual meaning to change since only one of two possible options exist for encoding. His favorite example was that sometimes one means ksh, other times one means x, and by only encoding it one way we are unscientifically limiting the language.
Or he'd put it in a slightly different and more "native" way and point out that "In Tamil sri mean laugh. In Tamil sRi is a religious symbol" to imply that putting these two words and making them the same is an insult -- laughing at a religious symbol, essentially.
People would point out that this is not a problem that Unicode tries to solve, that we have both Polish as someone from Poland and polish as something we use to make something smooth/clean. And even the fact that the rules in English can cause the latter word to be capitalizd (e.g. when its at the beginning of a sentence) does not make anyone want to treat the two forms of "o" as different letter in Unicode.
If we want to say the sentence "Hitler didn't polish off the Polish" then we are okay doing that, and that is how the language works.
This example not only takes a more formal and less formal word example, but by invoking Hitler there was hope it might end the conversation. Sadly, this was not the case....
If we were funnier we'd dig up an old Gallagher bit talking about stupid English is, like this one:
But no one really though of that, I guess. Or thought of mentioning it, at least.
This approach may have led (and may have already led -- or may one day lead) to a Tamil comedy routine along similar lines. If you are a Tamil comedian and choose to do this, make sure to thank me (or at least Gallagher) in the liner notes!
What he was trying to express was his own subconscious frustration over the fact that Tamil does not have (and has never had) a 100% correspondence between graphemes and phonemes -- i.e. one sound per letter shape.Few languages have this (Latvian comes close, though), and he is hardly the first person to express the frustration within their own language or the language of another. Examples like Vowel "harmony", enforced by political interests? show that people who do not fully understand this concept who nevertheless have the power to make changes to a script sometimes will, in fact.
Srivas never learned this lesson.
Anyway, none of that matters. Because just two days ago Srivas sent the following message to Unicode's Indic list (a list with less traffic than the general Unicode list but which makes up for it with per capita mesage silliness):
Subject: [indic] Scalable but simple Indic and Asian writing systems
English alphabet do not have some of the very basic alphabet and I'm proposing to
add new characters to English through Unicode Consortium because I find it difficult
to transliterate between English and other language.
Example, the "th" represented in English is not acceptable. It has to be a single
and fundamental character. Then, there are at least 3 different basic sounds
relating to this "th" such as 'thick', 'this', 'mother'. this means we need to add
three more characters to English alphabet. Similarly there are many other alphabet
require attention, with regard to English.
Further, the Asian, (primarily Indic) languages are very complex and random. These
need to be made logical and simple, but represent all that is required in
contemporary use in a simple way and also allow for expansion with simplicity in
mind. So I'm going to make proposals to UC on highly scalable and simplified Indic
and simplified Asian writing systems.
Please comment.
And now we have come full circle.
Unable to convince us that Unicode has been destroying Tamil, he is now pointing out how without more letters encoded to handle every different pronunciation in English, there will be severe problems trying to handle English too. Thus he is going to make proposals to stop that same problem in Asian/Indic languages.
But this ignores the different pronunciations in different parts of the world of the same words due to dialectcal difference. It ignores peroidic vowel shifts that occur. It ignores the fact that different languages using the same script can use the same letters pronounced differently (and never forget that Unicode encodes scripts and not languages).
The very first message I have from Srivas in my archives is from over eight years ago. It was about this same issue and the need to encode not only the ksh in riksha (ரிக்ஷா) but the x in Luxmi (லூக்ஷ்மி). The only progress we have made is that the old arguments woud point out Tamil language puns (with examples that conflate formal and informal language that could cause offense) and the new arguments start from the use of the English language that aren't serious or amusing.
He would have been better off finding videos like that Gallagher snippet. :-)
Though nothing would have been any different.
There are literally several dozens phonemes in the English language (exact number varying with dialect), and none of them are encoded separately in Unicode, which handles the Latin script and not the English language.
The things that look the same but are pronounced differently would, if added to Unicode, make all kinds of advanced natural language processing tasks easier, at the price of making simple input more difficult (and then you still have to deal with everything that wasn't input correctly like the tons and tons of existing data), all to solve a problem that Unicode ever signed up to solve in the first place....
Of course people looking at the timeline will note that by citing a Srivas pun from 10 years ago and providing a Gallagher clip in today's blog, I appear to be implying a timeline of 10 years in the title on that basis even though the youtube clip is actually a collection of appearances that is much older. but actually, the clips are from HBO specials from slightly more than 10 years before the initial Srivas sri/sRi "laugh" puns. So really it was just falsely assuming that my blog title implied a timeline from point A to point B that would have led to this misinterpretation. In fact the humor of Gallagher predates the puns of Srivas by over a decade, something that I knew because I know my Gallagher and have the 3-disc Smashing Watermelon Collection to prove it!.
carlos on 23 Jan 2011 8:36 AM:
The Icelandic alphabet is approximately English plus ð and þ for the "th" sounds. Problem solved!
Michael S. Kaplan on 23 Jan 2011 11:02 AM:
:-)
There are the other couple dozen phonemes to deal with, too!
Doug Ewell on 24 Jan 2011 7:52 AM:
> And even the fact that the rules in English can cause the latter word to be capitalizd (e.g. when its at the beginning of a sentence) does not make anyone want to treat the two forms of "o" as different letter in Unicode.
I suppose you might have heard that the same guy who tried to derail the RFC 4646 and 5646 efforts as "cultural imperialism" is now claiming the use of Unicode in IDNs is a U.S.-backed conspiracy to subjugate the French language, because Unicode does not distinguish between uppercase letters used in proper names (majuscules) from those that simply come at the start of a sentence (capitales), preventing French from being written correctly in domain names.
John Cowan on 25 Jan 2011 11:41 AM:
Doug: Sigh. And here I had successfully forgotten him completely.
Carlos: See home.ccil.org/.../essential.html
referenced by