Learning to spell in Bengali (when one doesn't know the language)

by Michael S. Kaplan, published on 2007/12/02 10:31 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/12/02/6639141.aspx

From the standpoint of Indic languages, Unicode has quite the struggle on its hands from the standpoint of acceptance by many native speakers.

I'm going to blather a bit about some of them here, and tell a fun story or two as well.

It does actually start with the ISCII (Indian Script Code for Information Interchange, which had a few "interesting" architectural features, including among other things:

You might wonder why I put the word interesting in quotes.

Well, I did that because both of these "features" have specific problems with them that make the resulting encoding less intuitive....

First of all there is the transliteration scheme for the names, which is hardly universally accepted even with Hindi, let alone with the attempt to stick that same transliteration scheme on every other Indic script. Meaning that the way one would transliterate one's name into English may or may not match the scheme that is being used for the names.

And I have already talked lots (e.g. on this post and some of this one) about the abugida issue and the fact that not everyone would look at their language that way.

Unicode picked up most of these aspects of ISCII for the original Indic proposals.

Okay, now I am going to downshift this into something more about language and less about Unicode, and a fun little project that actually took months to accomplish on the calendar (though significantly less time in terms of hours actually spent -- this was an occasional thing).

It all started when I was meeting with Goldie Chaudhuri from over in SQL Server several months back (we were talking about some language issues vis-a-vis Unicode).

Goldie is of course not really any sort of "Indian" name I had ever heard before, but she apparently grew up in Florida so that seemed normal and certainly less odd than other names like Dweezil or Moon Unit; in any case it did not remind me of this transliteration issue or even a Unicode or a language one (beyond the vague sense of her having a Bengali last name, which I don't think I said anything about at the time).

It was in the afternoon and she had accidentally dropped her cardkey (I found it in her chair after she had gone). It actually had the the name Godhuli on it, not Goldie.

And then I vaguely recalled an article from a few months before entitled Words and Music that had the word godhuli in it. I tried to find the article online without luck (I later found it, right here from The Washington Post, the quote was Then think of the meaning of some of the words and phrases in these new languages. Think of the word for dusk in Hindi -- godhuli , which translates literally into "the dust kicked up by cows coming home from pasture."

but instead at the time found this random blog post that I kept the link for because of the story it told (the last name similarity with Goldie's was coincidence, I was on an article mission at that moment, looking for godhuli, not Godhuli!):

In A Strange and Sublime Address by Amit Chaudhuri, Sandeep, an only child living in a Bombay high-rise, spends a summer visiting his Uncle's house in Calcutta with his mother. On Sundays, his uncle sings aloud to himself during his leisurely preluncheon bath, the notes echoing in the enclosed space of the bathroom 'like rays of trapped light darting this way and that in a crystal'.

He usually sang old, half-remembered compositions that had been popular thirty of forty years ago in a Bengal where the radio and the windup gramophone were still new and incredible machines breaking the millennial silence of the towns and villages:

Godhulir chhaya pathe
Je gelo chini go tare.

Knocking on the bathroom door, Sandeep made a pest of himself by asking: " Chhotomama, what does godhuli mean?"

Lost in the general well-being of cleansing himself, his uncle replied patiently: "The word go means 'cow',and the word dhuli means 'dust'. In the villages, evening's the time the cowherds bring the cattle home. The herd returns, raising clouds from the road. Godhuli is that hour of cow dust. So it means 'dusk' or 'evening'."

As Chhotomama explained, his voice emerging from behind the steady sound of water, Sandeep saw it in his mind like a film being shown from a projector - the slow-moving, indolent cows, their nostrils and their shining eyes, the faint white outline of the cowherd, the sense of the expectant village (a group of scattered huts), and the dust, yes the dust, rising unwillingly from the cows' hooves and blurring everything. The mental picture was set in the greyish-red colour of twilight. It was strange how one word could contain a world within it.

Strange indeed! What is a word but a seemingly random arrangement of letters of the alphabet (which themselves are seemingly random shapes), or a seemingly random modulated sound? And yet, one single word can encompass whole Universes and more. The word godhuli does not just indicate a time of day, but conjures up a complete way of life.

Understanding words and phrases and the concepts they encapsulate brings us a long way toward understanding the people, and societies and cultures, who employ them. It's just amazing how, long after the dusk has given way to night, the dust from the cows' hooves has settled, and even the village itself has crumbled to dust, this word will remain, yielding its secrets to the deserving.

So in this weird and wacky language of mine (English) if I knew enough about etymology then I might have stories like that for words and names. And if I were a good enough writer, they would be very inspiring stories here.

But I don't, and so I don't, and I'm not, and so they're not.

Though I did order a copy of the book that was quoted!

Anyway, I started thinking about the fact that the name (Godhuli) was not really the name; it was instead a commonly accepted transliteration of a name, and of a word.

I decided I should see the name in Bengali.

Don't ask me why I decided this. I did the same thing with Tamil words almost two years ago trying to deal with the nearly insurmountable differences between  transliterations. I am just weird this way.

Of course at this point I had no idea if Goldie even knew the written Bengali language at all (she grew up in Florida, remember?), and I decided my weird funky language "projects" like this were unlikely to be too interesting, so I did not start by asking her.

The Tamil version of his project was quite unscoped beyond a vague desire to build up a good strong comparison/contrast of different transliteration schemes, and in the end not very much fun because it even led to arguments and the occasional threat of violence between various people. So the smaller scope of this Bengali project appealed to my peace-loving nature.... :-)

I took the Bengali Unicode chart and treated the transliteration precisely and literally as if it were some sort of secular version of Upanishadesque gospel and worked backwards into the Unicode characters.

Keep in mind I know almost no Bengali whatsoever here as I decide to do this.

The first cut (knowing I had at least one mistake in there, maybe more):


Of course at this point I needed a native speaker to provide some corrective assistance; I could do no more just futzing with letters myself.

I sent it off to Goldie (explaining what I was trying to do in as few words as I could manage -- writing small is hard work!) for her opinion on this "guess".

I think she looked at what I was trying to do and decided it was perhaps a little weird, but it sounded like it would make a good blog post some day (and she likes the blog) so she decided to play along....

Her comment on the accuracy of the guess:

Wrong in three places. The first one you could probably figure out on your own, the second is because of the English transcription – you split one sound into two, and the third is a lack of emphasis. Though the third looks REALLY wrong, and I’d venture to guess that the last letter can’t actually be the last letter of a word. I don’t read enough to actually know.

After Goldie read this post, she realized that her computer she was looking at my guess on then did not have complex script support installed, so two of the three problems did not actually exist -- what she saw and what you might be seeing are not the same thing....

I guess she decided to go by the "teach a man to fish" philosophy, not giving away all the answers. That's cool, I can work with that. :-) 

Okay, the split was wrong -- I did the DA Hasant HA U instead of the DHA U. But I was basing it on my naive sense of the fact that the pronunciation separated the two letters into two separate syllables -- god-hu-li. So I did have a basis for my choice, even though it turned out to be dead wrong.

To learn something new, you have to be willing to be wrong, they say. Right?

I did have a bunch of work to do so my next guess was probably a week or three later (this sort of project by necessity runs in a low priority thread):


And her response (there was a pause here too -- this was like a "Chess by mail" game almost!):

Closer – you figured out the first vowel this time, and the consonants are all correct (at least the base character). The last two characters are now TOTALLY off – you were much closer before. The vowels are tricky though, I imagine the Unicode approach to learning language doesn’t cover the nuances of when to use...don’t know the English words, but when to use modifying kars vs full characters.

She was talking about when to use VOWEL II vs. LETTER II -- the dependent vs. the independent form of the vowel.

I thought that might be wrong, but she had originally been saying the last letter couldn't be the last letter in a word. I was improvising....

My next guess was much closer, and I think I remarked that this was getting more like that game Mastermind (anyone else remember that game?) with me guessing and her giving oblique hints:


She thought it was just about correct (just one vowel in the middle that was off), but wanted to consult with her parents, who had originally given her the name, based on a poem (or maybe a song?).

Her parents were initially interested in this strange game going on in the background of all of our lives but suddenly were more concerned about something that they realized while verifying their own opinions. 

They let her know that the way she had learned to spell her name all her life was wrong. Oops!

The real hazard of cultural assimilation. :-)

The details on the"mistake" in the name, as well as mistakes in my last name guess:

The first phoneme is the same sound as in the Chinese PM “Chou En-lai.” I don’t know how else it would be transcribed, but the BENGALI LETTER CHA is an aspirated sound – you want the non aspirated version. I could point you to the IPA if that would be easier?

Turns out I spelled my first name all wrong though. Got both vowels wrong (the tricky ones), and have been taught it wrong my entire life. Both the dictionary and Rabindranath Tagore agree on the spelling, though the root words that my name is derived from are spelled the way I thought my name was actually spelled. I led you astray, my apologies. You get a chance first to figure out where I was wrong, before I try and nudge you towards the right spelling.

Apparently colloquial Bengali doesn’t differentiate much between long vowels and short vowels either.

The final string that was intended for the first name:


Now of course I never would have guessed the UU vs. U thing but since my first guess had used the II vs. I thing correctly I got some mileage out of her [unintentionally] leading me astray. She did have to deal with the fact that she had her name spelled wrong for decades, after all. So we were all learning. :-)

And of course she had not ever spelled her full name in English as Godhuuli, and she had actually been thinking it was Godhuulii which wasn't right at all and would have looked even worse using this transliteration scheme in Unicode.

Then the full name was easy, using the first name and the assonistic (her word, not mine) rhythmic/rhyming kind of thing that she thought the names shared, minus all of the parts that ended up being slightly different because the name had been spelled a little off all those years.

গোধূলি চৌধুরী (GA O DHA UU LA I   CA AU DHA U RA II)

Notice the lack of a CHA there. That transliteration thing again. What a native speaker thinks of as a CHA vs. CHAA was actually a CA and CHA thing.

As a side check, each of these words individually can find a bunch of web pages in Google whereas my earlier attempts cannot, which perhaps helps verify that they are right....

Now this same thing I then tried with the names of other folks.

Like former NLS Test/PM colleague Sushmita (means smiling, source Tamil).

Or NLS PM colleague Poornima (means full moon, source Sanskrit, I think?).

Though I won't publish my guesses on those two or others since without the feedback loop to make sure I am not unintentionally butchering their names, the fact that my first guesses are certain to be wrong keeps me from being comfortable posting them and you never know who would really enjoy this kind of thing (as I said I ran into problems in the past in this regard -- not everyone finds my little language fetishes to be normal!).

The problem here with Unicode? The names are not the letters they learned, it is not how a native speaker would think of the language.

I try to imagine if I was required to spell Kaplan with component pieces like o + preceding vertical line below instead of p or something. I would not think the scheme had much to do with the English language.

So acceptance of Unicode is facing an uphill battle with the native speaking target of some of the languages that use these scripts. And it is not hard to imagine why they would feel that we do not understand their language -- and where is the motivation for them to be interested in our implementation that so clearly fails to understand them?

It is obviously a bit too late to change things in this space (where a bit is defined as 3.1 versions in length -- much more time than 12 parsecs!), but even today people are still trying.

Example -- I have just been talking with someone in Malaysia telling me about another trip to Chennai happening early next year to discuss the latest "add the pure consonants for Tamil -- i.e. all of the consonants with the built in puLLi" proposal. I was asked if I would be attending -- I don't know for sure, though, to be honest.

Man, this was a long post. Hope somebody feels this glimpse into my "neither work nor play" life was entertaining.... :-)


This post brought to you byand (U+098a and U+09c2, aka BENGALI LETTER UU and BENGALI VOWEL SIGN UU)

# John Cowan on 2 Dec 2007 2:33 PM:

The names aren't really a transliteration scheme: they are just names, and are basically always derived from Sanskrit, whether the language is a descendant of Sanskrit (like Bengali) or not (like Tamil).  But if it comes to that, the way that Indic names *are* transliterated has always been rough and ready, not conforming to any real scheme.

# Michael S. Kaplan on 2 Dec 2007 3:17 PM:

The naming scheme may be a convenience for implementers, but it is hard for people who feel there is a bit too much Hindification going on their lives to have it foisted upon them by Unicode. :-(

And it assumes that these are the names that would be used in all of these languages (which as far as I can tell is not true)....

# Suraj Barkale on 3 Dec 2007 10:55 AM:

The letter my name ends with (U+0934 DEVANAGARI LETTER LLLA and U+0947 DEVANAGARI VOWEL SIGN E) gives grief to all Marathi speakers.

Me & my friends transliterate U+0934 as L while chatting online and spell letter. It is really weird to read Devnagari transliterated in Roman script.

However, as I have used transliteration extensively (while chatting online), I am using http://sarasvati.sourceforge.net/ for directly entering Devnagari script using phonetic transliteration (I like this word :)

Most of the Indic names derive from Sanskrit (e.g. Poornima means night of the full moon at least in Marathi & Hindi) and don't transliterate well in Roman script. e.g. my first name should really be written Sooraj

# Mihai on 3 Dec 2007 1:45 PM:

Let's not forget that transcribing the sound of a word into English is more of an art than a science even if the word is English to begin with :-)

# John Cowan on 4 Dec 2007 1:32 PM:

Unicode character names are not intended to be localized names of letters.  Every Indic script is used to write more than one language, and there is no guarantee that the names of the letters are the same in every language.  Consequently, they are all Sanskritized (not Hindified) uniformly, since Sanskrit is one language that all these scripts have been used to write at one time and place or another.

# Michael S. Kaplan on 4 Dec 2007 2:57 PM:

I can kind of understand the approach myself, but when (as in this case) there are common conventions (in this case Bengali and Assamese use many of the same conventions), the approach just ends up looking like laziness, whether it is or not....

And it definitely contributes to how people on the subcontinent views 10646/Unicode, which to be honest is not the most flattering view. :-)

Part of moving on is the acknowledgment that not everything is perfect as at a minimum people want their concerns to be acknowledged.

But I had a great time building the strings in the examples, in any case (with just enough examples of native conventions not matching to make me realize that the problem continues even among the educated!

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/03/23 Learning to spell in Bengali (when one has a cool input method)

2008/10/01 Parents, to be perfectly blunt, suck at names, sometimes

2008/02/06 The utility of a feature like font fallback in Uniscribe can often be somewhat obviated by its design flaw

2007/12/05 A Strange and Sublime HASANT

go to newer or older post, or back to index or month or day