CultureInfo subsetting attempts that suck

by Michael S. Kaplan, published on 2007/12/22 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/12/22/6833199.aspx

When you think about locales (or make that cultures for those of you who are into managed code), you are basically looking at, for lack of a better bit of terminology, buckets of information. Buckets that vary according to language or region or dialect or script or a combination of some or all of these things.

In Win32, for years people complained for years about how the locale "buckets" were simply too huge and that data should be cut up into smaller bundles. So two separate items were added, starting in the earliest version of the .NET Framework's System.Globalization namespace:

The RegionInfo class, whose intent is to take data from locales that are specific to location and separate them out from the large bucket of locales/cultures, and
Neutral cultures, whose intent is to support only the properties within a CultureInfo object that are specific to language and separate them out from the large bucket of locales/cultures;

In the end, both of these intents met with both implementation flaws and an overarching fundamental design flaw that limits their usefulness in the globalization problem space.

First the technical flaws: many of the properties and methods, from CultureInfo.CompareInfo to RegionInfo.IsMetric to CultureInfo.Calendar to RegionInfo.NativeName to CultureInfo.TextInfo and more, all supported by these objects that are intended to isolate language-specific and region-specific data, all require the context of language/region/script bundle known as a specific CultureInfo in order to provide appropriate data.

Basically they all return information from a "default" specific culture, and the results imply a full CultureInfo.

You can look at these properties and methods and imagine the problems with assigning full CultureInfo specific information to them.

But rather than get into those, I'd like to avoid swinging at a pitch in the dirt.

We'll move on to the bigger problem, the conceptual one.

A locale has like maybe 150 pieces of information attached to it.

The exact number varies from version to version of Windows, but let's just call it 150 for the sake of argument.

Now there are maybe 10 of those tied specifically to language that do not need the context of a region or script to obtain the appropriate data.

And similarly, there are maybe 10 of those tied specifically to region/country that do not need the context of a language or script to obtain the appropriate data.

Now if the problem was that the one object was too big because it captured 150 pieces of info, then having it capture 130 pieces instead has not really solved the problem.

Add to that the fact that neutral cultures are in theory quite useful if you ignore there size when it comes to resource loading, though the size is one of the reasons that it sucks for those purposes too (as I mentioned back in January of this year in Two things that suck about CurrentUICulture (Part 1, aka It's just too much!).

Of course even for resource fallback it has limitations -- look how easily important issues like the underlying native name and code page change hugely as one follows the fall back chain with code like this:

using System;
using System.Globalization;

public class Test {
    public static void Main() {
        foreach(CultureInfo ci in CultureInfo.GetCultures(CultureTypes.SpecificCultures)) {
            if(ci.TextInfo.ANSICodePage != ci.Parent.TextInfo.ANSICodePage) {
                Console.WriteLine("'{0}' is using '{1}', which is not the as its parent ({2}) is ({3}).",
                    ci.Name,
                    ci.TextInfo.ANSICodePage,
                    ci.Parent.Name,
                    ci.Parent.TextInfo.ANSICodePage);
            }
        }
    }
}

And the results make one sad that the code is written the way it is because of yet another problem that pops up:

'sr-Latn-CS' is using '1250', which is not the as its parent (sr) is (1251).
'az-Cyrl-AZ' is using '1251', which is not the as its parent (az) is (1254).
'uz-Cyrl-UZ' is using '1251', which is not the as its parent (uz) is (1254).
'bn-BD' is using '0', which is not the as its parent () is (1252).
'bs-Cyrl-BA' is using '1251', which is not the as its parent () is (1252).
'tg-Cyrl-TJ' is using '1251', which is not the as its parent () is (1252).
'mn-Mong-CN' is using '0', which is not the as its parent (mn) is (1251).
'prs-AF' is using '1256', which is not the as its parent () is (1252).
'sah-RU' is using '1251', which is not the as its parent () is (1252).
'mi-NZ' is using '0', which is not the as its parent () is (1252).
'ug-CN' is using '1256', which is not the as its parent () is (1252).
'ii-CN' is using '0', which is not the as its parent () is (1252).
'sr-Latn-BA' is using '1250', which is not the as its parent (sr) is (1251).
'ba-RU' is using '1251', which is not the as its parent () is (1252).
'ps-AF' is using '0', which is not the as its parent () is (1252).
'ne-NP' is using '0', which is not the as its parent () is (1252).
'am-ET' is using '0', which is not the as its parent () is (1252).
'iu-Cans-CA' is using '0', which is not the as its parent () is (1252).
'si-LK' is using '0', which is not the as its parent () is (1252).
'lo-LA' is using '0', which is not the as its parent () is (1252).
'km-KH' is using '0', which is not the as its parent () is (1252).
'bo-CN' is using '0', which is not the as its parent () is (1252).
'as-IN' is using '0', which is not the as its parent () is (1252).
'ml-IN' is using '0', which is not the as its parent () is (1252).
'or-IN' is using '0', which is not the as its parent () is (1252).
'bn-IN' is using '0', which is not the as its parent () is (1252).
'tk-TM' is using '1250', which is not the as its parent () is (1252).
'mt-MT' is using '0', which is not the as its parent () is (1252).
'bs-Latn-BA' is using '1250', which is not the as its parent () is (1252).

Now the ones marked in red were originally my main point, where the process of normal language resource fallback would massively change the identity of chosen resources as entire code pages shifts help to represent shifts in the underlying script -- Cyrillic to Latin, Mongolian to Cyrillic, Latin to Cyrillic.

But notice how the small number of these is overwhelmed by all of the Windows-only cultures that hasve no neutral culture to fall back on they fall back directly to invariant. Even though a neutral culture in many of these would make a lot of intuitive sense to use --nd of course the code skips all of the Windows-only cultures that happen to be using code page 1252.

Looking at Windows-only cultures (discussed previously), this could be considered a very real limitation, one that I was not so down on previously but whivch I have chasnged my mind about as the limintations in resource fallback became clearer and clearer.

There is also that big name shift that Kieran mentioned in this post and Shawn mentioned in this post to think about.

It is all well and good to decide to change sr-SP-Cyrl to sr-Cyrl-CP in the name of consistency with international standards.

The reason that the standard wanted to do things differedntly was to embrace the fallback identity of locales like sr-Cyrl when one is using string parsing to figure out fallback. Trying to be consistent with their naming while being simultaneously inconsistent with the principal reasons for it and occasionally even breaking people who would try and do it (like in NO doesn't mean maybe, and it certainly doesn't means NB!) kind of defeats some of the purpose behind being consistent when one takes the whole end-to-end scenario into mind, doesn't it?

It suggests that to rehabilitate neutral cultures, some changes would have to be made:

Add additional neutral cultures for the language-script cases when they exist, and
Come up with a solution for neutral fallbacks for Windows-only cultures -- by parsing and filling in from the parent if needed, but some solution so resource fallback works better.

Though there is no full rehabilitation given all the functionality in neutral cultures that will keep right on working even when it shouldn't since it requires locale//culture-specific behavior.

All of the characters in Unicode have taken off for Grand Cayman for the Christmas holiday weekend
(they are staying at the Mariott Grand Cayman Beach Hotel in case you are there and are curious at all the characters hanging out by the pool!)

# Jeroen Ruigrok van der Werven on 23 Dec 2007 1:09 PM:

which is not the as its parent -> which is not the same as its parent, I reckon?

# Michael S. Kaplan on 23 Dec 2007 1:50 PM:

More or less. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/03/16 Reporting one casualty in the operation; luckily it was the stupidest member of the unit

2010/02/14 Whither RegionInfo?

go to newer or older post, or back to index or month or day