The weird, weird world of the SUBLANGID

by Michael S. Kaplan, published on 2005/05/17 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/17/418372.aspx

I think I may have said in the past that the SUBLANGID is an odd beast.

They are defined in the winnt.h header file in the SDK (and ntdef.h in the DDK). Here is an excerpt of the ones there (which mostly have the value of 1 or 2):

#define SUBLANG_DEFAULT                  0x01    // user default
#define SUBLANG_SYS_DEFAULT              0x02    // system default
#define SUBLANG_ARABIC_SAUDI_ARABIA      0x01    // Arabic (Saudi Arabia)
#define SUBLANG_ARABIC_IRAQ              0x02    // Arabic (Iraq)
#define SUBLANG_AZERI_LATIN              0x01    // Azeri (Latin)
#define SUBLANG_AZERI_CYRILLIC           0x02    // Azeri (Cyrillic)
#define SUBLANG_CHINESE_TRADITIONAL      0x01    // Chinese (Taiwan)
#define SUBLANG_CHINESE_SIMPLIFIED       0x02    // Chinese (PR China)
#define SUBLANG_CROATIAN_CROATIA         0x01    // Croatian (Croatia)
#define SUBLANG_DUTCH                    0x01    // Dutch
#define SUBLANG_DUTCH_BELGIAN            0x02    // Dutch (Belgian)
#define SUBLANG_ENGLISH_US               0x01    // English (USA)
#define SUBLANG_ENGLISH_UK               0x02    // English (UK)
#define SUBLANG_FRENCH                   0x01    // French
#define SUBLANG_FRENCH_BELGIAN           0x02    // French (Belgian)
#define SUBLANG_GERMAN                   0x01    // German
#define SUBLANG_GERMAN_SWISS             0x02    // German (Swiss)
#define SUBLANG_ITALIAN                  0x01    // Italian
#define SUBLANG_ITALIAN_SWISS            0x02    // Italian (Swiss)
#define SUBLANG_MALAY_MALAYSIA           0x01    // Malay (Malaysia)
#define SUBLANG_MALAY_BRUNEI_DARUSSALAM 0x02    // Malay (Brunei Darussalam)
#define SUBLANG_NORWEGIAN_BOKMAL         0x01    // Norwegian (Bokmal)
#define SUBLANG_NORWEGIAN_NYNORSK        0x02    // Norwegian (Nyorsk)
#define SUBLANG_PORTUGUESE               0x02    // Portuguese
#define SUBLANG_PORTUGUESE_BRAZILIAN     0x01    // Portuguese (Brazilian)
#define SUBLANG_SERBIAN_LATIN            0x02    // Serbian (Latin)
#define SUBLANG_SERBIAN_CYRILLIC         0x03    // Serbian (Cyrillic)
#define SUBLANG_SPANISH                  0x01    // Spanish (Castilian)
#define SUBLANG_SPANISH_MEXICAN          0x02    // Spanish (Mexican)
#define SUBLANG_SPANISH_MODERN           0x03    // Spanish (Modern)
#define SUBLANG_SWEDISH                  0x01    // Swedish
#define SUBLANG_SWEDISH_FINLAND          0x02    // Swedish (Finland)
#define SUBLANG_UZBEK_LATIN              0x01    // Uzbek (Latin)
#define SUBLANG_UZBEK_CYRILLIC           0x02    // Uzbek (Cyrillic)

Some of it boils down to that evil use of the word DEFAULT coming back to bite us you know where. After all, the decision of which SUBLANGID comes in what order is due to an arbitrary combination of alphabetical order and historical assignment. If we did not give assignments for any of the SUBLANGID==1 entries, it would imply that the first LCID in the series was somehow, you know, like the default, as opposed to the rest of the LCIDs in the series. Because in most cases, it isn't.

Of course if you ask me, the train already left the station for the ones that have no country in them. Which is to say that SUBLANG_SWEDISH, SUBLANG_PORTUGUESE, SUBLANG_ITALIAN, SUBLANG_GERMAN, and SUBLANG_FRENCH already sort of says something along those lines by not being SUBLANG_SWEDISH_SWEDEN, SUBLANG_PORTUGUESE_PORTUGAL, SUBLANG_ITALIAN_ITALY, SUBLANG_GERMAN_GERMANY, and SUBLANG_FRENCH_FRANCE, respectively. Don't they?

Might have saved a few lines in the header file, if nothing else....

Although sometimes the comments name a country (like Finland), other times they name a way of doing something in a country (like Mexican). So there is no need to read into patterns; there are so many different ones that you can do almost anything and still be consistent with an entry that is already there.

I guess we could fix the comments to be more consistent -- there is no backcompat issue if we changed the one to Finnish or the other to Mexico; the code would still compile the same way.

Now they do have to all be in ASCII, since the C standard did not recognize the ability to have anything other than ASCII in header files. I guess that only hurts the comment on SUBLANG_NORWEGIAN_BOKMAL, but everyone can probably pretend it is actually "// Norwegian (Bokmål)" and call it a day. I tend to think of those comments as a lot of overhead to maintain, even if you ignore the occasional geopolitically sensitive issue....

Also, as far as I can tell, our Right Honourable Data Lady¹ never really worked during the years to deal with these values directly -- she always assigned LCIDs and some dev would do the math to fill in the header file (after getting the text to use as the comment, of course -- the best protection against the aforementioned geopolitical issues is involving people who understand them!).

Back in the Summer of 2001, I was asked by the Unicode Technical Committee to provide updated information for a comparative table of Language Codes and Country Codes. After consulting with various people at Microsoft it was decided that a list of SUBLANGID values as "country codes" was pretty much a useless idea, so we went with some country codes we had defined, instead.

Which are not useless in and of themselves -- they are mostly² just international dialing codes. Though I am guessing that if people are going to the Unicode site they are not looking for a number to dial. But they are at least more useful than a bunch of random numbers that without the context of macros like MAKELANGID and information about the construction of LCIDs serve no useful purpose ever. It is the same value as that returned by GetLocaleInfo with the LOCALE_ICOUNTRY LCTYPE. Though note the documentation for that constant gives us a cooler explanation:

LOCALE_ICOUNTRY Country/region code, based on international phone codes, also referred to as IBM country/region codes.

So maybe we can blame IBM, especially for that "mostly" part? :-)

Oh, never mind. It was just a thought....

Which reminds me that I should be having someone give us new CTRY_* constants for all of the new locales in countries that we have never had locales in before. Darn, I had forgotten about those. At least I know I was not the only one! :-)

Some of the codes are really not needed, like SUBLANG_LITHUANIAN and SUBLANG_KOREAN, since they are lone reeds and are really the only one we expect for the respective languages. But then again, we did not use SUBLANG_LITHUANIAN_LITHUANIA and SUBLANG_KOREAN_KOREA so someone knew that was where we were heading?

Though perhaps I am mistaken about that, who knows? It certainly means that the dozens of entries that are not included at all have a chance to feel slighted, though I hope they do not. I wish we could take some of them out, only we can't.

Anyway, we have a bunch of codes defined. We cannot ever undefine them (who knows where they may be used?). But we could certainly wait to define them until/unless we need them.

Now remember that for NLS, where there is no way to get data for neutrals, defining PRIMARYLANGID and SUBLANGID values and not LCID values is not entirely useful anyway. But those PRIMARYLANGID values are useful combined with SUBLANG_NEUTRAL for the sake of resource loading, and you can't just have those values dangling out there alone.

So in the end it all makes sense as to why they exist. But it took a long post like this one to give it enough context that one could say that. :-)

1 - There is a really funny story (IMHO) about how Cathy was "given" that title, but I will ask for her permission before telling it (I may ask her to tell it to me again to make sure I do not fumble it!)
2 - I think they are all internationl dialing codes, other than CTRY_CANADA which is defined as 2 in winnls.h, even though their dialing code is 1.

This post brought to you by "ফ" (U+09ab, a.k.a. BENGALI LETTER PHA)

# Mihai on 17 May 2005 11:19 AM:

And still we cannot avoid the "evil DEFAULT" for LANG_JAPANESE :-)

# Michael S. Kaplan on 17 May 2005 5:10 PM:

Yes, but it is not evil there since it IS the default. :-)

# Michael S. Kaplan on 17 May 2005 5:15 PM:

Though I guess we could add the following:

#define SUBLANG_THEONEANDONLY 0x01 // sublang when there is only one sublang to choose

for use in such cases, huh? :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/08/11 The fault is ~60% functionality, ~40% documentation

2006/03/01 About that Portuguese localization question...

2005/08/31 Sometimes it *does* pay to be neutral

go to newer or older post, or back to index or month or day