Call it Reversible Error, aka Yes it has no weight; it was supposed to have no weight!

by Michael S. Kaplan, published on 2010/06/11 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/06/11/10022741.aspx


.Net globalization's collation lives in an unusual world.

It was originally architected by a developer in Windows based on the detailed design doc written largely by another developer in windows that he reported to.

It more or less has the same data as Windows Server 2003 (plus or minus some Turkic stuff) in its first version (1.0), as well as every subsequent version for many years (up to and including 3.5, which still carried the 2.0 code).

Then add the Windows-Only cultures added in >= 2.0, which grab data from the operating system if the OS has a locale for which .Net has no matching culture.

Now take code like this code, based on something in a test:

using System;
using System.Threading;
using System.Globalization;

public class Test {
    public static void Main() {
        Console.WriteLine("\r\nWith en-US:");
        Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
        testing();
        Console.WriteLine("\r\nWith en-IN:");
        Thread.CurrentThread.CurrentCulture = new CultureInfo("en-IN");
        testing();
    }

    public static void testing() {
        Console.WriteLine("\u00AD");
        Console.WriteLine("test".StartsWith("\u00AD"));
        Console.WriteLine("test".EndsWith("\u00AD"));
        Console.WriteLine("test".IndexOf("\u00AD"));
        Console.WriteLine("test".LastIndexOf("\u00AD"));
        Console.WriteLine("test"[0] == "\u00AD"[0]);
        Console.WriteLine(string.IsNullOrEmpty("\u00AD"));
        Console.WriteLine("\u00AD".Length);
    }
}

And let's run it to see what we get using Windows 7 and .Net 3.5:

With en-US:
-
False
False
-1
-1
False
False
1

With en-IN:
-
True
True
0
4
False
False
1

Wow, look at that! The English (United States) culture uses the old tables built into .Net, while the English (India) culture uses the Windows tables.

Now in .Net 4.0, they both use the more up-to-date tables. So both will give the results that only en-IN does in the code above running in earlier versions.

The reason for the change?

Well, I have talked about U+00ad, the SOFT HYPHEN, previously in blogs like this one.

Well in prior versions this character was weighed a lot like a regular hyphen:

...
0x002d   6 130   2   2   ;Hyphen-Minus
...
0x00ad   6 131   2   2   ;Soft Hyphen
...

I explain the "6" weight there in A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating). This is why I'm saying it is being weighed in a hyphen-ey way.

While in Vista and beyond, it is given no weight (ref: The jury will give this string no weight), though it does it in purpose in this case, because you aren't supposed to see it from a linguistic standpoint.

So the change is expected -- the old behavior was a bug, an error.

And the update was considered reversible error to use a legal term, and reversed in a major version update....


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day