And then there is the virama....

by Michael S. Kaplan, published on 2005/04/09 00:11 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/04/09/406765.aspx


The Virama is a fascinating sign. It has a simple job -- it surpresses the inherent vowel that the preceding Indic letter contains.

I was very pleased once I understood this concept (I was dealing with Tamil at the time). And the collation rules also seemed quite intuitive to me -- a letter with its inherent vowel surpressed comes before that same letter that still has the vowel. It seemed intuitive because if the vowel was surpressed then it would "weigh less" than if it was not, right?

And I went out in the world with an understanding that I thought would spread to a dozen other scripts that had Viramas in them.

If you know the actual truth you probably have some insight into why I consider my notions of having lingistic aptitude to be delusions....

Like I said, in the Tamil script, it is U+0bcd, and it is known as the Pulli.

And க் (U+0b95 U+0bcd, Tamil Ka + Pulli) sorts before (U+0b95, Tamil Ka) alone, in the Tamil language.

But on the other hand, in the Devanagrai script, it is U+094d, it is known as the Halant.

And क् (U+0915 U+094d, Devanagrai Ka + Virama) sorts after (U+0915, Devangari Ka) alone, in the Hindi language.

Ah, but in the Bengali script my insight worked again! It is U+09cd, and it is known as the Hasant.

And ক্ (U+0995 U+09cd, Bengali Ka + Virama) sorts before (U+0995, Bengali Ka) alone, in both the Bengali and Assamese languages.

But my hopes are dashed in the Malayalam script, where it is U+0d4d, and it is known as the Chandrakkala.

And ക് (U+0d15 U+0d4d, Malayalam Ka + Chandrakkala) sorts after (U+0d15, Malayalam Ka) alone, in the Malayalam language.

And so on.

Any time I have talked to a native speaker of one of these languages, they have told me that the way that the language sorts simply feels natural to them. And I realize that the real problem was seeing what I thought was a technical reason for a set of principles that often do not have a logical reason that is so easily found.

It reminds me of  section of that Douglas Adams book Mostly Harmless:

   "I know that astrology isn't a science," said Gail. "Of course it isn't. It's just an arbitrary set of rules like chess or tennis or -- what's that strange thing you British play?"

   "Er, cricket? Self-loathing?"

   "Parlimentary democracy. The rules just kind of got there. They don't make any kind of sense except in terms of themselves. But when you start to exercise those rules, all sorts of processes start to happen and you start to find out all sorts of stuff about people. in astrology the rules happen to be about stars and planets, but they could be about ducks and drakes for all the difference it would make. It's just a way of thinking about a problem that lets the shape of the problem begin to emerge. The more rules, the tinier the rules, the more arbitrary they are, the better. It's like throwing a handful of fine graphite dust on a piece of paper to see where the indentations are. It lets you see the words that were written on the paper above it that has now been taken away and hidden. It lets you see the words that were written on the piece of paper above it that's now been taken away and hidden. The graphite's not important. It's just the means of revealing their indentations. So you see, astrology's nothing to do with astronomy. It's just to do with people thinking about people.

I think my attempt to find patterns in the chaos were an immature attempt to keep me from feeling foolish for being fascinated by a subject that is no more based on scientific principles than astrology is. But it is an interesting 'in" to learning about some aspects of language. Of which I have learned many.

This site isn't about science. Its just to do with a wanna-be linguist thinking about language.

And sorting it all out....

 

This post brought to you by U+0a4d, a.k.a. GURMUKHI SiGN VIRAMA


# Paul Bartrum on Saturday, April 09, 2005 3:02 AM:

I guess that some problems just don't have a correct answer. Often it doesn't even matter, as long as you are consistant. For example, it doesn't matter what side of the road you drive on, as long as you are driving on the same side as everyone else. The same applies to units of measure, traffic light colours, spelling, to name a few.

# Michael S. Kaplan on Saturday, April 09, 2005 5:56 AM:

Excellent analogy, Paul! Kinda hits the nail on the head. The rules just have to make sense in terms of themselves....

# Dean Harding on Sunday, April 10, 2005 9:40 PM:

Well, when you think about it, why does anything sort the way it does? I mean, why is "A" the first letter of our alphabet, and "Z" the last?

# Michael S. Kaplan on Sunday, April 10, 2005 10:03 PM:

Exactly! I think it was a subconscious fear that I may be fascinated by something that therefore seemed to lack purpose that drove me to try to find meaning even when mening did not exist! :-)

I moved past that, though. Since I realized that there is still meaning and purpose in the functionality....

# Dean Harding on Sunday, April 10, 2005 10:55 PM:

By the way, I noticed you've had a couple of quotes from The Hitchhiker's Guide to the Galaxy - I can't wait for the movie, either :) I started re-reading the "trilogy" again, just to be ready :)

# Ambarish Sridharanarayanan on Monday, April 11, 2005 7:25 PM:

I remember there being a controversy on the unicode lists with some people opining that consonants without the vowel sound, such as க் (U+0b95 U+0bcd) should have a single code-point, and that there should be vowel signs for 'A', similar to vowel signs corresponding to all other vowels. In other words, the controversy was whether the vowel 'A' was really inherent at all.

I think this relates to the "naturalness" of the sorting order - as a native Tamil speaker, I don't think the vowel is inherent at all; we're taught that the fundamental consonants are those with the Pulli (such as க்), and other consonants (such as க) are derived from these (by addition of vowels). I can understand that it's different in other Indic languages, but personally, I think Unicode may have got it wrong (at least for languages like Tamil), in determining that letters with the vowel sound (such as க, U+0b95) are basic and those without, such as (க், U+0b95 U+0bcd) are derived.

# Michael S. Kaplan on Monday, April 11, 2005 10:08 PM:

Ambarish -- indeed there was! Some of that was the firm belief that they are independent letter, other parts were the practical reason that people believed it was to great of a feat technologically to deal with them as composite characters made of multiple code points.

Luckily it was proven to the latter group of people that we were able to handle this issue from a technical standpoint....

# Michael S. Kaplan on Monday, April 11, 2005 10:20 PM:

To those in the first category, it is harder to give a satisfying answer, though it is fair to say that the cost of implementing support for all of the languages of India is made easier by using a generally consistent model across all of them than if each one had different models. And the ease of implementation has definitely worked to Tamil's favor, in this particular case, fwiw.

# Ambarish Sridharanarayanan on Tuesday, April 12, 2005 6:22 PM:

@michkap: Thanks for your response.
To address your second point re. how a consistent model makes it more feasible for implementors, considering the market for Indic languages is quite immature - that's from the point of view of corporate entities and pragmatism. I continue to believe that when one is designing a universal (and hopefully permanent) standard, one should aim more towards correctness and completeness than pragmatism. I'm sure a lot of decisions re. Unicode and i18n in general would have been taken differently if standards organisations stuck to ease of adoption.

# Michael S. Kaplan on Tuesday, April 12, 2005 6:39 PM:

That is possible, although after seeing the positive impact that Unicode Tamil support has had on prople using it, it is hard to believe that Microsoft being able to support it so fully is truly a bad thing....

# Michael S. Kaplan on Tuesday, April 12, 2005 9:15 PM:

For example, see http://groups-beta.google.com/group/anbudan/about zn tell me that is not incredibly cool!

# Ambarish Sridharanarayanan on Wednesday, April 13, 2005 7:32 PM:

Really cool link! There have been quite a few other websites using Unicode Tamil liberally, not to speak of Tamil bloggers. Infinitely better than dynamic fonts or the whole ISCII/TSCII/TAB/TAM mess.

BTW, in case it wasn't apparent, I couldn't agree with you more that Unicode is A Good Thing (TM) for Tamil, and thanks are due all companies including Microsoft for implementing and supporting languages where the revenue stream probably doesn't justify the decision. This topic has gotten old already, so ...

<wistful>... I just wish the Unicode consortium had addressed this subtle difference among the various Indic languages.</wistful>

# Michael S. Kaplan on Wednesday, April 13, 2005 8:58 PM:

Like I mention here:

http://blogs.msdn.com/michkap/archive/2005/04/12/407456.aspx

it's not just about revenue streams. :-)

# s.sureskumar on Thursday, December 20, 2007 5:53 AM:

you send besic english vocabulary meening of tamil

s sureskumar


referenced by

2008/09/18 UCS-2 to UTF-16, Part 3: It starts with cursor movement (where MS simultaneously gets better and worse)

2007/12/16 Why my IUC31 talks were presented on Vista (even though running on a MacBook Pro)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

2006/09/20 The puLLi suppresses the inherent vowel. Or does it?

2005/06/05 Does Bengali sorting work?

go to newer or older post, or back to index or month or day