by Michael S. Kaplan, published on 2006/01/05 05:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/05/509100.aspx
Over the last week I have received two emails and had one colleague ask me about a particular issue, so I thought I would bump up the priority and try to cover it....
It was not too many years ago that Dr. Om Vikas and the standards people in India put together a presentation regarding issues in Indic scripts in Unicode. The report included a table whose source was reported as being from the 1991 census (presumably the census in India, of course):
Language | Script | Population | Percent |
---|---|---|---|
Hindi | Devanagari | 33,72,72,114 | 41.6 |
Bangla | Bengali | 6,95,95,738 | 8.6 |
Telugu | Telugu | 6,60,17,615 | 8.1 |
Marathi | Devanagari | 6,24,81,681 | 7.7 |
Tamil | Tamil | 5,30,06,368 | 6.5 |
Urdu | Urdu | 4,34,06,932 | 5.4 |
Gujarati | Gujarati | 4,06,73,814 | 5.1 |
Kannada | Kannada | 3,27,53,676 | 4.0 |
Malayalam | Malayalam | 3,03,77,176 | 3.7 |
Oriya | Oriya | 2,80,61,313 | 3.5 |
Punjabi | Gurumukhi | 2,33,78,744 | 2.9 |
Assamese | Assamese | 1,30,79,696 | 1.6 |
Kashmiri | Urdu/Devanagari | 32,00,000 | 0.4 |
Sindhi | Urdu/Devanagari | 21,22,848 | 0.3 |
Nepali | Devanagari | 20,76,645 | 0.25 |
Konkani | Devanagari | 17,60,607 | 0.20 |
Manipuri | Manipuri | 12,70,216 | 0.15 |
Sanskrit | Devanagari | 49,736 | 0.0006 |
Now the Script column is not really part of the census data (the census data can be found quoted in many other online sources!), but the use of this column does show some interesting patterns -- for example referring to the Arabic script as Urdu rather than Arabic.
Although the information was being presented to Unicode, it was also being put on the web and reported throughout India (especially to the various language speakers), which can explain such differences to people who do not really think of it as Arabic, ever....
But the most interesting item (in my opinion) is the way Assamese is known as using the Assamese script (according to Unicode, both Bengali and Assamese are written with the Bengali script).
This is not a solitary mention; the ISCII code pages produced by the Government of India actually have separate code pages for Bengali and Assamese, even though the sum total of the difference between Bengali and Assamese is a preferred glyph and an extra letter. The ISCII code pages are supported on Windows as of Windows 2000 (code page numbers 57006 for Assamese and 57003 for Bengali).
I tried to imagine what would have happened in Unicode if the same thing had been done for the Latin or Cyrillic or Arabic scripts -- imagine if we had to re-encode every single letter that is shared between the dozens or even hundreds of languages that use those letters. Seventeen planes may not be enough by the time we end up with the huge numbers of letters we would need across every language. And phishing problems would be unmanagable!
Obviously politics enter into the situation here, along with that unfortunate situation where a script and a language share the same name and all of the confusion that causes. I mean, there are enough problems with the perception/reality of trying to describe one language in terms of another without adding the burden of giving them the same name, right?
I didn't think we were in politics, but every time I talk myself into believing that I am definitely convinced otherwise when I see the way so many of these things are positioned....
This post brought to you by "ৱ" (U+09f1, a.k.a. BENGALI LETTER RA WITH LOWER DIAGONAL)
Abhinaba Basu [MSFT] on 5 Jan 2006 6:27 AM:
# Somebody on 6 Jan 2006 5:22 AM:
referenced by