Subsets of subsets of subsets of subsets of subsets

by Michael S. Kaplan, published on 2006/11/30 03:26 -05:00, original URI:

The big master list of locales that Microsoft has assigned LCID values for is quite large and even includes ones like Yiddish (0x043d) that are unlikely to be added to the Windows locale list any times soon.

There is a subset of that big list, the official list of locales in Windows. I was going to post it till I saw that Kieran actually did already.

There is another not entirely matching subset of locales that are supported by Office for language support and document text tagging. Many locales in fact made it into the big list based on requests from Office.

Now there is a smaller subset of locales that Office supports proofing tools for. This to me is the coolest list since it is the one list with the power to help shape language in positive directions when things like standardized spellings are hard to come by.

There is another subset representing the locales supported by the .NET Framework (it is only smaller now since locales were not really added to it to make it up to the Vista list, but here is the list that it natively supports without Windows only locales).

There is that weird subset/superset of locales supported by SQL Server for their collation support (subset because they folded many of them together and also because they did not update for Vista or even Server 2003, superset because they added a few collations to try to bring some up to the Server 2003 level), and then there is the subset supported by SQL Server's independent locale list.

Which brings me to Aldo Donetti's mail that he sent to me yesterday:

In SQL 2005 if you get the Collations and their LCIDs and group the latter, you’ll see there are approximately 46 LCIDs.

Now there’s a feature called Full Text index which also has some language support – but only for 16 LCIDs (+ a Neutral one)

So if anyone were to programmatically try to set the LCID on the Full Text index based on the DB/Table/Column collation, there’s a good 75% chance it will fail.

The list of languages supported for the Full text index is this:

  • Traditional Chinese
  • German
  • English
  • French
  • Italian
  • Japanese
  • Korean
  • Dutch
  • Swedish
  • Thai
  • Simplified Chinese
  • British English
  • Chinese (Hong Kong SAR, PRC)
  • Spanish
  • Chinese (Singapore)
  • Chinese (Macau SAR)
  • + the “Neutral” one

So maybe chances to fail are less than 75% given some of the most used languages are in there, but Hindi is missing and so is Hebrew, Arabic, Cyrillic, Turkish and a bunch of others (30 overall)

Not sure how many people in the SQL division knew about this, but I (and a bunch of people in DevDiv, including our VP) know about it now. :-( 

Notice how they were even upset that (for example) both Swedish and Finnish weren't even on their list of SQL Server collations but were unhappy about the lack overlap between collation support and full text support? :-)

So the actual "missing" list is even bigger if you come at it from the .NET Framework point of view, especially in 2.0 and later where the .NET Framework picks up all that is in Windows even if they don't support it natively.

The languages for which Index Server/SQL Server Full Text Search have word breakers and stemmers is yet another subset, a theoretically open subset (since anyone can write one and the SDK tells people how) but in practice a mostly closed subset given how difficult it is to write a word breaker/stemmer,

I guess it is easy to look at this mess and wonder how interoperability work on these products at all, isn't it? :-)

Though the changes to bring Windows and the .NET Framework into sync with Vista and .NET 2.0 are a good first step. The next steps would be to try to bring Office and SQL Server into the fold if we can.

If we are lucky, the only ones that will always remain semi-open subsets are proofing tools and full text search indexes, since they are the two that are not just adding a compatibility layer but require unique and difficult work. :-)

I think the fact that there is apparently a VP who has been made aware of all this might help us though (and big cross-division effort needs a champion!), so maybe I need to follow up with Aldo on this!


This post brought to you by  (U+0929, a.k.a. DEVANAGARI LETTER NNNA)

# Bertilo Wennergren on 30 Nov 2006 5:55 AM:

So when are you going to add an Esperanto locale? There are actually quite a few people using Esperanto in their computers. Probably many more than for some of the languages that are already in that list (Sorbian, Tamazight, Sami, Romansh, Occitan, Mapudungun, Sanskrit...). What those Esperanto people mostly need is actually a keyboard layout. Wouldn't be so hard, now, would it?

Most of the big Linux distros have Esperanto locales and keyboard layouts nowadays. Don't know about the Macs.

I promise we won't sue you (like the Mapudungun people did...).

referenced by

2008/06/30 Thirdly, aka Forty two, aka Understanding the answer can require a properly defined question

2007/10/22 We weren't Vista heroes, but I think we were kinda heroic

2007/03/13 Track change (a.k.a. A new job that has a few things in common with the old one)

2006/12/12 Open it all up, get out of the way, and then what happens?

2006/11/30 So when is Esperanto coming?

go to newer or older post, or back to index or month or day