Four exceptions to prove the rule

by Michael S. Kaplan, published on 2008/05/07 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/05/07/8464448.aspx

As a general rule, once a sort has been added to Windows, it cannot ever be removed.

But you have probably heard the expression that every rule has an exception, right?

Well this rule is so freaking important that it has four exceptions!

They are:

#1: Lithuanian Classic (0x0827)

This is a fun little order that you can only in fact see in Windows 98, and note how just like Modern vs. Traditional Spanish was implemented with a new SUBLANGID rather than a SORTID (though by this time the SORTID did in fact exist).

The sort was removed at the request of either the Lithuanian subsidiary representative or the Lithuanian government (I have gotten different stories from different people). You can see it even in .NET 2.0 if you run it on Windows 98 and get at it through the windows Only culture support.

The removal was complete, for sure, and for certain -- its constant (SUBLANGID_LITHUANIAN_CLASSIC) was even removed from the header files -- a nice exception to another (even stronger) rule!

No other sort has this distinction, mind you.

I have hinted at it blogs like The three stages of grief^H^H^H^H^Hcollation without defining it, because to be perfectly honest I do not know what the difference is (though I am reasonably certain that it is not connected with the CafePress Lithuanian Girls are Best Classic Thong.

Though I could be wrong about that, I guess....

There are definitely differences, though -- more than any other of the collations on this page, I believe.

Even the person who did the original work once told me she couldn't remember what it was....

And I would guess that if I installed Windows 98 and spent some time in there I'd be able to reverse engineer it.

I may do that eventually, though I am hoping someone who remembers this sort might volunteer the information first and save me the trouble.... :-)

This collation is supported on Windows 98 as I said, and in .NET when you run on Windows 98, and nowhere else.

#2: Japanese Unicode (0x10411)

I first mentioned this one back in December of 2004 in my And what about the Japanese (Unicode) sort? blog.

It has a simple (if stupid) purpose -- take the default table and make two changes to it:

Put U+005c (\) aka REVERSE SOLIDUS in the same place as U+00a5 (¥) aka YEN SIGN;
Put U+2015 (―) aka HORIZONTAL BAR in the same place as U+30fc (ー) aka KATAKANA-HIRAGANA PROLONGED SOUND MARK;

And then it does nothing else.

This looks about as Japanese as David Carradine (the guy from Kung Fu, who wasn't even Chinese, let alone Japanese).

This one is supported in Windows <= 2000. It is also supported in SQL Server and .NET, because even though Windows >= XP did not have it listed in its supported collations in the registry, it was never taken out of the source file for the collation data, and both SQL Server and .NET took that list as gospel and said amen.

#3: Korean Unicode (0x10412)

Now I first mentioned this one just prior to the similarly misguided Japanese Unicode sort, back in December of 2004, in the Whats up with the Korean (Unicode) sort? blog.

It too has a simple (if stupid) purpose -- take the default table and make one change to it:

Put U+005c (\) aka REVERSE SOLIDUS in the same place as U+20a9 (₩) aka WON SIGN;

and then it does nothing else.

This looks about as Korean as my blond ex-fiancé does (which is to say, not at all).

#4: China Hong Kong - Stroke (0x20c04)

This one is the weirdest one of all, and not just because the default Hong Kong China sort you get with 0x0c04 is already a stroke count sort anyway -- the same sort that is provided for Taiwan (which is technically not a stroke count sort but a Big5/CNS11643 based sort, which is stroke based within planes (Big5 itself is sorted by use frequency/stroke count/radical, which makes for a "sorta stroke count" sort!).

But no, 0x20c04 is weird for an entirely different reason.

You see, it is indistinguishable from the PRC stroke count order that 0x0404 returns for Simplified Chinese in China.

Let me repeat that since it seems vaguely important.

0x20c04 is indistinguishable from the PRC stroke count order that 0x0404 returns for Simplified Chinese in China.

It is a stroke count based on Simplified Chinese, even though Kong Kong is primarily a Traditional Chinese kind of place.

Which means as far as expectations go we may see the same problems here as we did with Chinese - Macau that I described in my blog How bad does it need to be in order to be not good enough, anyway?, though hopefully not with newly introduced bugs inherited from standards, as described in Every character has a story #31: U+272f0 from CJK Extension B, an ideograph that proves that every rose has its thorn! (aka It wasn't my fault, but [from the Windows standpoint] it was because of me....).

This collation is also supported by SQL Server (7.0 and 8.0, removed in SQL Server 2005 as mentioned by SQL Server developer Jun here) and SQL Server CE, for reasons that do not have much in the way of a good excuse, and for the same reason as the previous two in relation to things sitting in the source file....

Now if you take code like the following in .NET 2.0 you can see how the last three are still supported there:

        static void Main(string[] args) {
            CultureInfo ci;
            ci = new CultureInfo(0x20c04);
            Console.WriteLine("CultureInfo.Name             --> {0}", ci.Name);
            Console.WriteLine("CultureInfo.LCID             --> 0x{0}", ci.LCID.ToString("x4"));
            Console.WriteLine("CultureInfo.CompareInfo.Name --> {0}", ci.CompareInfo.Name);
            Console.WriteLine("CultureInfo.CompareInfo.LCID --> 0x{0}", ci.CompareInfo.LCID.ToString("x5"));
            Console.WriteLine();
            ci = new CultureInfo(0x10411);
            Console.WriteLine("CultureInfo.Name             --> {0}", ci.Name);
            Console.WriteLine("CultureInfo.LCID             --> 0x{0}", ci.LCID.ToString("x4"));
            Console.WriteLine("CultureInfo.CompareInfo.Name --> {0}", ci.CompareInfo.Name);
            Console.WriteLine("CultureInfo.CompareInfo.LCID --> 0x{0}", ci.CompareInfo.LCID.ToString("x5"));
            Console.WriteLine();
            ci = new CultureInfo(0x10412);
            Console.WriteLine("CultureInfo.Name             --> {0}", ci.Name);
            Console.WriteLine("CultureInfo.LCID             --> 0x{0}", ci.LCID.ToString("x4"));
            Console.WriteLine("CultureInfo.CompareInfo.Name --> {0}", ci.CompareInfo.Name);
            Console.WriteLine("CultureInfo.CompareInfo.LCID --> 0x{0}", ci.CompareInfo.LCID.ToString("x5"));
        }

As it returns the following:

CultureInfo.Name             --> zh-HK
CultureInfo.LCID             --> 0x20c04
CultureInfo.CompareInfo.Name --> zh-HK_stroke
CultureInfo.CompareInfo.LCID --> 0x20c04

CultureInfo.Name             --> ja-JP
CultureInfo.LCID             --> 0x10411
CultureInfo.CompareInfo.Name --> ja-JP_unicod
CultureInfo.CompareInfo.LCID --> 0x10411

CultureInfo.Name             --> ko-KR
CultureInfo.LCID             --> 0x10412
CultureInfo.CompareInfo.Name --> ko-KR_unicod
CultureInfo.CompareInfo.LCID --> 0x10412

And on that note, I will leave you ponder these powerful, terrible, weird, and hopefully intriguing and interesting. :-)

This post brought to you by ― (U+2015, aka HORIZONTAL BAR)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/11/08 The evolving Story of Locale Support, part 5 (...until the decision was made to not refuse to add it)

go to newer or older post, or back to index or month or day