A script, by any other name

by Michael S. Kaplan, published on 2006/01/05 05:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/05/509100.aspx


Over the last week I have received two emails and had one colleague ask me about a particular issue, so I thought I would bump up the priority and try to cover it....

It was not too many years ago that Dr. Om Vikas and the standards people in India put together a presentation regarding issues in Indic scripts in Unicode. The report included a table whose source was reported as being from the 1991 census (presumably the census in India, of course):

Language Script Population Percent
Hindi Devanagari 33,72,72,114 41.6
Bangla Bengali 6,95,95,738 8.6
Telugu Telugu 6,60,17,615 8.1
Marathi Devanagari 6,24,81,681 7.7
Tamil Tamil 5,30,06,368 6.5
Urdu Urdu 4,34,06,932 5.4
Gujarati Gujarati 4,06,73,814 5.1
Kannada Kannada 3,27,53,676 4.0
Malayalam Malayalam 3,03,77,176 3.7
Oriya Oriya 2,80,61,313 3.5
Punjabi Gurumukhi 2,33,78,744 2.9
Assamese Assamese 1,30,79,696 1.6
Kashmiri Urdu/Devanagari 32,00,000 0.4
Sindhi Urdu/Devanagari 21,22,848 0.3
Nepali Devanagari 20,76,645 0.25
Konkani Devanagari 17,60,607 0.20
Manipuri Manipuri 12,70,216 0.15
Sanskrit Devanagari 49,736 0.0006

Now the Script column is not really part of the census data (the census data can be found quoted in many other online sources!), but the use of this column does show some interesting patterns -- for example referring to the Arabic script as Urdu rather than Arabic.

Although the information was being presented to Unicode, it was also being put on the web and reported throughout India (especially to the various language speakers), which can explain such differences to people who do not really think of it as Arabic, ever....

But the most interesting item (in my opinion) is the way Assamese is known as using the Assamese script (according to Unicode, both Bengali and Assamese are written with the Bengali script).

This is not a solitary mention; the ISCII code pages produced by the Government of India actually have separate code pages for Bengali and Assamese, even though the sum total of the difference between Bengali and Assamese is a preferred glyph and an extra letter. The ISCII code pages are supported on Windows as of Windows 2000 (code page numbers 57006 for Assamese and 57003 for Bengali).

I tried to imagine what would have happened in Unicode if the same thing had been done for the Latin or Cyrillic or Arabic scripts -- imagine if we had to re-encode every single letter that is shared between the dozens or even hundreds of languages that use those letters. Seventeen planes may not be enough by the time we end up with the huge numbers of letters we would need across every language. And phishing problems would be unmanagable!

Obviously politics enter into the situation here, along with that unfortunate situation where a script and a language share the same name and all of the confusion that causes. I mean, there are enough problems with the perception/reality of trying to describe one language in terms of another without adding the burden of giving them the same name, right?

I didn't think we were in politics, but every time I talk myself into believing that I am definitely convinced otherwise when I see the way so many of these things are positioned....

 

This post brought to you by "" (U+09f1, a.k.a. BENGALI LETTER RA WITH LOWER DIAGONAL)


Abhinaba Basu [MSFT] on 5 Jan 2006 6:27 AM:

When I first saw the list I got confused because I knew for sure that Assamese and Bengali share the same script. However, I got the point when I saw who presented the data :)

For geo-political reasons Assamese is said to have Assamese script. I did'nt know the exact difference though, now I've figured out the weird ৱ that I saw on Bangla keyboard.

The funny part is that Bangla (Bengali) is associated with India. It's the national language of Bangladesh and to be geo-politically correct should have been associated with Bangladesh. When the united Bengal got divided into two West Bengal (part of India) and Bangladesh lot of interesting things happen. One of them is that India and Bangladesh have national anthem written by the same Nobel Laureate poet Rabindra Nath Tagore. I do not know of any other contry to have this :)

# Somebody on 6 Jan 2006 5:22 AM:

IIRC the Nasta'liq script is normally used in India, not the usual Arabic script?

referenced by

2010/04/02 We're off script now, brothers and sisters....Let me here you say YEH!

2008/03/11 Where's the Beef^H^Hngali?

2006/02/14 Every character has a story #18: U+06cc and U+064a (ARABIC LETTER FARSI YEH and ARABIC LETTER YEH)

go to newer or older post, or back to index or month or day