The evolving Story of Locale Support, part 8: [Finally] taking care of some [more] languages in Pakistan

by Michael S. Kaplan, published on 2011/11/15 07:01 -05:00, original URI:

Back in the middle of 2002, Abdul-Mqajid Bhurgri wrote a white paper for Microsoft, entitled Enabling Pakistani Languages through Unicode.

It was not so much about Microsoft's own support of Pakistani languages, which if you go back to 2002 was fairly scant -- we supported an Urdu - Pakistan locale (added for Windows 2000) but with no specific sort order (meaning it had the sort order from the default table, intended primarily for the Arabic language. Even though we knew there was a different collation. The white paper had a purer intent than that: a love of language, and a desire to see the right thing done with the languages of Pakistan. A great white paper done as the site was just starting to think beyond the Middle East and look at languages in other parts of the world that shared the Arabic script.

Abdul-Mqajid's site can be found here.

The 35-page white paper contains the following table of languages and the number of people who speak them as their mother tongue:

Language Number of Speakers
Balochi 5,685,000
Balti 270,000
Brahui 2,000,000
Farsi 1,000,000
Hindko 2,500,000
Kashmiri 105,000
Khowar 223,000
Parkari 250,000
Pashto 11,100,000
Punjabi 30 to 45 m.
Saraiki 15 to 30 m.
Sindhi 17,000,000
Urdu 10,700,000

And some interesting text about written language in the country:

Major spoken languages of Pakistan are: Punjabi, Saraiki, Sindhi, Pashto, Urdu, Balochi, Hindko and Brahui. Of these, only Urdu, Sindhi, and Pashto have a standardized alphabet. There are very few written works available in these other languages. Speakers of these languages, if they ever need to write in their language, use the alphabet of some other major language (usually Urdu or Sindhi) in which they have been formally educated. For Punjabi, mostly Urdu alphabet and writing style is used because most of the Punjabis have received their schooling in Urdu. For Saraiki, Urdu as well as Sindhi alphabet is used because Saraiki is spoken in Punjab as well as Sindh. Balochi also does not have any standardized alphabet. Mostly Urdu, sometimes Farsi, and occasionally Sindhi alphabets are used for it. Situation of the remaining languages is not much different.

The paper explains some of the important distinctions about Arabic script vs. language to answer people who misunderstand that distinction, and perhaps more importantly provides some of the best text I have read explaining the Naskh/Nastaleeq difference, e.g.:

That Nastaleeq and not Naskh should be the writing style used for computers is also based on this misconceived “Nastaleeq or Naskh” notion – which in turn is an unfortunate legacy of Urdu word processing packages which supported one style or the other. So far as Unicode is concerned, for example word Pakistan would always comprise of characters Pay, Alef, Kaf, Seen, Tay, Alef and Noon.

The white paper also did some work to contrast the sort orders of Pashto, Sindhi, and Urdu.

Some time in 2004, the data in this paper, and its information about Urdu, was used as one of the sources for the Urdu collation that was finally added to Vista, many versions after the locale was added to Windows (as well as supporting data for the different sort order for Pashto, which was being added for Afghanistan in Vista). 

It was preferred over the document Michael Everson wrote about the languages of Afghanistan, because that document primarily used the Unicode Collation Algorithm tailoring syntax without word list examples (and we don't use the UCA). The sorts were comparable either way.

Anyway, fast forward to part 5 of this series you are reading now, which listed the locales being added to Windows 8, which include:

This is pretty exciting, since at one point Sindhi was being considered for Vista (but was ultimately not done).

I suspect that Abdul-Majid Bhurgri (who I was in contact with back in 2007 talking about Urdu and Sindhi) will be pleased to see Sindhi finally being added to Windows 8!

Interesting trivia about our support of Pakistan:

Our GEO location data for Pakistan includes the following data:

Neither name is wrong, and context varies but of course locales don't have this notion of two kinds of names. If you look at the the Developer Preview from //BUILD, we're still working out which name to use across these three different Pakistani locales. Don't worry, we'll figure it out (we had the same problem in Vista pre-release with whether to call China P.R.C. or People's Republic of China)....

A part of me wonders whether (with 11,000,000 speakers) we won't wonder about not choosing to add a ps-PK (Pashto - Pakistan), too? :-)

Stuart on 15 Nov 2011 8:28 AM:

"The sorts were comparable"? That's awfully meta!

Michael S. Kaplan on 15 Nov 2011 10:03 AM:

That's how I roll! :-)

Alex Cohn on 17 Nov 2011 9:37 AM:

I wonder how sort orders for these languages happened to diverge.

Michael S. Kaplan on 17 Nov 2011 10:56 AM:

Most often there is a phonetic/phonemic distinction, like in Lithuanian....

Fahim Mahmood Mir on 28 Nov 2011 12:29 AM:

Thank God if the Windows 8 have the Urdu Unicode Font or True Type Already Installed then it will Save us from a big headache of using and installing Third Party Software releases to Support Urdu.

Michael S. Kaplan on 28 Nov 2011 5:41 AM:

Check out Part 9 for the font!

