Where's the other Urdu?

by Michael S. Kaplan, published on 2010/07/18 08:20 -07:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/07/18/10039492.aspx

What are the languages of India? is a rather loaded question.

Not in the Have you stopped beating your wife yet? sense. But perhaps to some the two questions have a similar order of magnitude.

In the constitution of India, it is clear that the official languages of the country are Hindi (in the Devanagari script) and English (in the Latin script).

But a part of the constitution allows the recognition of official languages in individual states, and since the states had their borders largely decided based on language it seemed best to leave it to the states to work to define the official languages within the states.

With that said, there is a list of languages that have a special significance, whose latest incarnation is described here in Wikipedia:

The Eighth Schedule to the Indian Constitution contains a list of 22 scheduled languages. At the time the constitution was enacted, inclusion in this list meant that the language was entitled to representation on the Official Languages Commission, and that the language would be one of the bases that would be drawn upon to enrich Hindi, the official language of the Union. The list has since, however, acquired further significance. The Government of India is now under an obligation to take measures for the development of these languages, such that "they grow rapidly in richness and become effective means of communicating modern knowledge." In addition, a candidate appearing in an examination conducted for public service at a higher level is entitled to use any of these languages as the medium in which he answers the paper.

There are obviously benefits to being on this rather exclusive list -- this number 22 is out of either nearly 500 or over 1500 languages in India (depending on whose count you accept).

The list (table modified from here) in order of population is:

Language	Millions of speakers per last census	Locale within India defined in Windows?	State(s) giving the language official status
Hindi	422	Yes	Andaman and Nicobar Islands, Arunachal Pradesh, Bihar, Chandigarh, Chhattisgarh, the national capital territory of Delhi, Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttar Pradesh and Uttarakhand
Bengali	180	Yes	Andaman & Nicobar Islands, Tripura, West Bengal
Telugu	74	Yes	Andaman & Nicobar Islands, Andhra Pradesh, Puducherry
Marathi	72	Yes	Maharashtra, Goa, Dadra & Nagar Haveli, Daman and Diu, Madhya Pradesh, Karnataka
Tamil	61	Yes	Tamil Nadu, Andaman & Nicobar Islands, Puducherry
Urdu	52	No	Jammu and Kashmir, Andhra Pradesh, Delhi, Bihar, Uttar Pradesh
Gujarati	46	Yes	Dadra and Nagar Haveli, Daman and Diu, Gujarat
Kanada	38	Yes	Karnataka
Malayalam	33	Yes	Kerala, Andaman and Nicobar Islands, Lakshadweep, Puducherry
Oriya	33	Yes	Orissa
Punjabi	29	Yes	Chandigarh, Delhi, Haryana, Punjab
Assamese	13	Yes	Assam
Maithili	12	No	Bihar
Santhali	6.5	No	Santhal tribals of the Chota Nagpur Plateau (comprising the states of Bihar, Chattisgarh, Jharkhand, Orissa)
Kashmiri	5.5	No	Jammu and Kashmir
Konkani	2.5	Yes	Goa, Karnataka, Maharashtra, Kerala
Nepali	2.5	No	Sikkim, West Bengal, Assam
Sindhi	2.5	No	non-regional language
Manipuri	1.5	No	Manipur
Bodo	1.2	No	Assam
Dogri	0.1	No	Jammu and Kashmir
Sanskrit	0.05	Yes	non-regional language

Now I threw that third column in to point out that not every decision made in regard to Windows has a pure population reason behind it. I could have used other list items like version of Windows where support was added if I wanted to show even more interesting and/or strange trends, but I figure this one is enough for present purposes.

Now of all of these languages the only one that cannot be displayed at all using the built in fonts in Windows 7 is Santali, which is written with the Ol Chiki script. But I was told that literacy rates among speakers is low, so perhaps that 6.5 million number shouldn't be thought of purely in terms of "theoretical potential customers". Though of course other numbers would change on this list as well, with that metric. :-)

Microsoft Windows and Office don't seem all that well aimed at the "silent majority" (~93%) in India who don't speak English, but we'll leave that interesting issue for another day....

There are only a few real anomalies on this list:

Kashmiri, whose font support is available for both the Arabic script and the Devanagari script, would really have to wait for built-in locale support until after the political situation is resolved in a less tense way;
Sanskrit having a locale is obviously also a very political thing, in the other direction;
Sindhi mostly uses the Devanagari script in India and has had those extra needed characters added both to Unicode and in Microsoft fonts, a small point of embarrassment for me personally (and for Microsoft) since the inclusion of the characters in Unicode when the relevant version of ISO/IEC 10646 had not yet added them making them out of sync for a version was done on the basis of Microsoft requesting them (through the people in the UTC meeting at the time, which included me) for the sake of support in the next version of Windows and Office (though the locale was never officially added -- either in the Devanagari script for India or the Arabic script for Pakistan);

And the most unusual of the anomalies on this list? It can be seen in Urdu, which as I mentioned in Giving the people Urdu, we are! can really be thought of as the same underlying language as Hindi, with both of them grown in different directions.

Directions that have helped to fuel the differences between india and Pakistanfor lo these many years, in fact.

Yet in Windows, where an Urdu - Pakistan locale exists, no Urdu - India one is to found!

Though space has been reserved for it, as charts in both Locale IDs Assigned by Microsoft and Language Identifier Constants and Strings indicate (technically the same could be said for Manipuri - India and Nepal - India and Sindhi - India, now that I look at the lists!). I'm not sure whether that counts as transparency or some people publishing the wrong lists!

I was asked by five different people while I was in India about what is holding up an Urdu - India , but to be honest I have no earthly clue. I was told that the folks in the subsidiary have asked for it, but I was unable to verify that bit of information at the time this blog was written.

The bulk of the data in the locale would be identical to Urdu - Pakistan, but there are incredibly good reasons to really want Urdu - India to be separate and not ask people to use "the wrong one".

So, ignoring everything else but the customer requirement for a moment, I am going to use the method described in Where are the other Tamils? and create a custom locale for ur-IN. :-)

Here is the code:

using System;
using System.Globalization;

namespace CustomLocales {
    class CustomLocales {
        [STAThread]
        static void Main() {
            CultureInfo ci = new CultureInfo("ur-IN", false);
            RegionInfo ri = new RegionInfo("en-IN");
            CultureAndRegionInfoBuilder carib = new CultureAndRegionInfoBuilder("ur-IN", CultureAndRegionModifiers.None);
            carib.LoadDataFromCultureInfo(ci);
            carib.LoadDataFromRegionInfo(ri);
            carib.CultureEnglishName = "Urdu (India)";
            carib.CultureNativeName = "اُردو (بھارت)"; // Ignore the way it looks, the string is right! :-)
            carib.CurrencyEnglishName = ri.CurrencyEnglishName;
            carib.CurrencyNativeName = "روپیہ";
            carib.RegionNativeName = "بھارت";
            carib.NumberFormat.CurrencySymbol = "Rs.";
            carib.ThreeLetterWindowsLanguageName = "URI"; // Instead of URD as ur-PK has
            carib.IetfLanguageTag = carib.CultureName;
            carib.Save("ur-IN.ldml");
            carib.Register();
        }
    }
}

In the course of putting all that together, someone pointd out an interesting issue in the Urdu (Pakistan) locale. It's native currency name in Windows 7 is

روپيه

which includes U+064a, ARABIC LETTER YEH. This seems like a bug since U+06cc, ARABIC LETTER FARSI YEH almost certainly seems like it would be prefered by Urdu-speaking people in either country.

But in any case the following slightly different string was recommended to me:

روپیہ

so I chose that one in the case of the above code; if you disagree then of course you can change the string, as well as the ThreeLetterWindowsLanguageName I used....

If I am right about the built-in ur-PK data, someone should put in a bug to get that fixed in some future version of Windows, by the way. Any former NLS testers reading this? :-)

If it exists then it is a subtle bug, since as I mentioned in Every character has a story #18: U+06cc and U+064a (ARABIC LETTER FARSI YEH and ARABIC LETTER YEH), in the initial and medial forms the two letters look identical (and this is obviously the medial form since it is the penultimate chacracter in the string).

Anyway, just take the code, save it to a file as ur-IN.cs, and then compile it from the command line with the following line of code:

csc /r:sysglobl.dll ur-IN.cs

And once you do that, the landscape in Regional and Language Options will change a little bit:

And there we go! :-)

Now ideally one would be able to use the reserved LCID value mentioned in those other articles, but that is not an option in this case.

But no solution is perfect....

Sometimes it really still is about opening it all up and getting out of the way, as best as we can....

comments not archived

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/12/21 The evolving Story of Locale Support, part 13: Divvying up locales, yet again!

go to newer or older post, or back to index or month or day