Where's the other Urdu?

by Michael S. Kaplan, published on 2010/07/18 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/07/18/10039492.aspx


What are the languages of India? is a rather loaded question.

Not in the Have you stopped beating your wife yet? sense. But perhaps to some the two questions have a similar order of magnitude.

In the constitution of India, it is clear that the official languages of the country are Hindi (in the Devanagari script) and English (in the Latin script). 

But a part of the constitution allows the recognition of official languages in individual states, and since the states had their borders largely decided based on language it seemed best to leave it to the states to work to define the official languages within the states.

With that said, there is a list of languages that have a special significance, whose latest incarnation is described here in Wikipedia:

The Eighth Schedule to the Indian Constitution contains a list of 22 scheduled languages. At the time the constitution was enacted, inclusion in this list meant that the language was entitled to representation on the Official Languages Commission, and that the language would be one of the bases that would be drawn upon to enrich Hindi, the official language of the Union. The list has since, however, acquired further significance. The Government of India is now under an obligation to take measures for the development of these languages, such that "they grow rapidly in richness and become effective means of communicating modern knowledge." In addition, a candidate appearing in an examination conducted for public service at a higher level is entitled to use any of these languages as the medium in which he answers the paper.

There are obviously benefits to being on this rather exclusive list -- this number 22 is out of either nearly 500 or over 1500 languages in India (depending on whose count you accept).

The list (table modified from here) in order of population is:

Language Millions of speakers per last census Locale within India
defined in Windows?
State(s) giving the language official status
Hindi 422 Yes Andaman and Nicobar Islands, Arunachal Pradesh, Bihar, Chandigarh, Chhattisgarh, the national capital territory of Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttar Pradesh and Uttarakhand
Bengali 180 Yes Andaman & Nicobar Islands, Tripura, West Bengal
Telugu 74 Yes Andaman & Nicobar Islands, Andhra Pradesh, Puducherry
Marathi 72 Yes Maharashtra, Goa, Dadra & Nagar Haveli, Daman and Diu, Madhya Pradesh, Karnataka
Tamil 61 Yes Tamil Nadu, Andaman & Nicobar Islands, Puducherry
Urdu 52 No Jammu and Kashmir, Andhra Pradesh, Delhi, Bihar, Uttar Pradesh
Gujarati 46 Yes Dadra and Nagar Haveli, Daman and Diu, Gujarat
Kanada 38 Yes Karnataka
Malayalam 33 Yes Kerala, Andaman and Nicobar Islands, Lakshadweep, Puducherry
Oriya 33 Yes Orissa
Punjabi 29 Yes Chandigarh, Delhi, Haryana, Punjab
Assamese 13 Yes Assam
Maithili 12 No Bihar
Santhali 6.5 No Santhal tribals of the Chota Nagpur Plateau (comprising the states of Bihar, Chattisgarh, Jharkhand, Orissa)
Kashmiri 5.5 No Jammu and Kashmir
Konkani 2.5 Yes Goa, Karnataka, Maharashtra, Kerala
Nepali 2.5 No Sikkim, West Bengal, Assam
Sindhi 2.5 No non-regional language
Manipuri 1.5 No Manipur
Bodo 1.2 No Assam
Dogri 0.1 No Jammu and Kashmir
Sanskrit 0.05 Yes non-regional language

 Now I threw that third column in to point out that not every decision made in regard to Windows has a pure population reason behind it. I could have used other list items like version of Windows where support was added if I wanted to show even more interesting and/or strange trends, but I figure this one is enough for present purposes.

Now of all of these languages the only one that cannot be displayed at all using the built in fonts in Windows 7 is Santali, which is written with the Ol Chiki script. But I was told that literacy rates among speakers is low, so perhaps that 6.5 million number shouldn't be thought of purely in terms of "theoretical potential customers". Though of course other numbers would change on this list as well, with that metric. :-)

Microsoft Windows and Office don't seem all that well aimed at the "silent majority" (~93%) in India who don't speak English, but we'll leave that interesting issue for another day....

There are only a few real anomalies on this list:

And the most unusual of the anomalies on this list? It can be seen in Urdu, which as I mentioned in Giving the people Urdu, we are! can really be thought of as the same underlying language as Hindi, with both of them grown in different directions.

Directions that have helped to fuel the differences between india and Pakistanfor lo these many years, in fact.

Yet in Windows, where an Urdu - Pakistan locale exists, no Urdu - India one is to found!

Though space has been reserved for it, as charts in both Locale IDs Assigned by Microsoft and Language Identifier Constants and Strings indicate (technically the same could be said for Manipuri - India and Nepal - India and Sindhi - India, now that I look at the lists!). I'm not sure whether that counts as transparency or some people publishing the wrong lists!

I was asked by five different people while I was in India about what is holding up an Urdu - India , but to be honest I have no earthly clue. I was told that the folks in the subsidiary have asked for it, but I was unable to verify that bit of information at the time this blog was written.

The bulk of the data in the locale would be identical to Urdu - Pakistan, but there are incredibly good reasons to really want Urdu - India to be separate and not ask people to use "the wrong one".

So, ignoring everything else but the customer requirement for a moment, I am going to use the method described in Where are the other Tamils? and create a custom locale for ur-IN. :-)

Here is the code:

using System;
using System.Globalization;

namespace CustomLocales {
    class CustomLocales {
        [STAThread]
        static void Main() {
            CultureInfo ci = new CultureInfo("ur-PK", false);
            RegionInfo ri = new RegionInfo("en-IN");
            CultureAndRegionInfoBuilder carib = new CultureAndRegionInfoBuilder("ur-IN", CultureAndRegionModifiers.None);
            carib.LoadDataFromCultureInfo(ci);
            carib.LoadDataFromRegionInfo(ri);
            carib.CultureEnglishName = "Urdu (India)";
            carib.CultureNativeName = "اُردو (بھارت)"; // Ignore the way it looks, the string is right! :-)
            carib.CurrencyEnglishName = ri.CurrencyEnglishName;
            carib.CurrencyNativeName = "روپیہ";
            carib.RegionNativeName = "بھارت";
            carib.NumberFormat.CurrencySymbol = "Rs.";
            carib.ThreeLetterWindowsLanguageName = "URI"; // Instead of URD as ur-PK has
            carib.IetfLanguageTag = carib.CultureName;
            carib.Save("ur-IN.ldml");
            carib.Register();
        }
    }
}

In the course of putting all that together, someone pointd out an interesting issue in the Urdu (Pakistan) locale. It's native currency name in Windows 7 is

روپيه

which includes U+064a, ARABIC LETTER YEH. This seems like a bug since U+06cc, ARABIC LETTER FARSI YEH almost certainly seems like it would be prefered by Urdu-speaking people in either country.

But in any case the following slightly different string was recommended to me:

روپیہ

so I chose that one in the case of the above code; if you disagree then of course you can change the string, as well as the ThreeLetterWindowsLanguageName I used....

If I am right about the built-in ur-PK data, someone should put in a bug to get that fixed in some future version of Windows, by the way. Any former NLS testers reading this? :-)

If it exists then it is a subtle bug, since as I mentioned in Every character has a story #18: U+06cc and U+064a (ARABIC LETTER FARSI YEH and ARABIC LETTER YEH), in the initial and medial forms the two letters look identical (and this is obviously the medial form since it is the penultimate chacracter in the string).

Anyway, just take the code, save it to a file as ur-IN.cs, and then compile it from the command line with the following line of code:

csc /r:sysglobl.dll ur-IN.cs

And once you do that, the landscape in Regional and Language Options will change a little bit:

And there we go! :-)

Now ideally one would be able to use the reserved LCID value mentioned in those other articles, but that is not an option in this case.

But no solution is perfect....

Sometimes it really still is about opening it all up and getting out of the way, as best as we can....


Pavanaja U B on 18 Jul 2010 11:01 AM:

I tried your code. It did not work for me. It did compile but I did not see the locale. I even tried re-booting. I am using Win7 Ultimate 64bit. Compiled using VS2010.

Michael S. Kaplan on 18 Jul 2010 1:18 PM:

You also have to run the EXE you compile. :-)

Pavanaja U B on 18 Jul 2010 9:17 PM:

I get this error -

Unhandled Exception: System.Globalization.CultureNotFoundException: Culture is not supported.

Parameter name: name ur-IN is an invalid culture identifier.

  at System.Globalization.CultureInfo..ctor(String name, Boolean useUserOverride)

  at CustomLocales.CustomLocales.Main()

Michael S. Kaplan on 18 Jul 2010 9:32 PM:

Whoops, that was a typo in my code -- the CultureInfo line should be creating a ur-PK culture to grab all the data that is the same between the two cultures like day names and month names and such -- I can't create a ur-IN until it is done being created!

Pavanaja U B on 19 Jul 2010 6:32 AM:

Ok. It worked now. Let me add one observation -you must start the VS Command prompt in Administrator mode for this to work.

Michael S. Kaplan on 19 Jul 2010 7:25 AM:

Ah yes, that is true -- it is creating the custom culture file in a directory that the limited user has no permission to and creating reg entries that the limited user is not allowed to write....

Doug Ewell on 21 Jul 2010 4:16 PM:

Santali is not written exclusively in Ol Chiki; in fact, that is a minority script for that language (behind Bengali, Devanagari, and even Latin).

Urdu on 9 Nov 2010 11:00 PM:

Urdu is a national language of Pakistan and is not prefer in India because Indian think its emerged from Arabic which is the language of Quran. History and information of Urdu s available for you at http://www.urdureading.com/

Michael S. Kaplan on 10 Nov 2010 8:10 AM:

52 million speakers says quite a bit about preference....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/12/21 The evolving Story of Locale Support, part 13: Divvying up locales, yet again!

go to newer or older post, or back to index or month or day