Refusing to ignore some particular character's width isn't [always] an act of discrimination…

by Michael S. Kaplan, published on 2010/09/07 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/09/07/10058112.aspx

Despite what the title may suggest to anyone, this blog is not about the Kevin Smith/Southwest Airlines debacle in any way. Though now that I've mentioned it and since the title may make an inappropriate suggestion of my opinion on that matter, I'll say that in my opinion Kevin was totally right and Southwest Airlines was totally wrong, and still is totally wrong. As someone who is regularly allowed to violate the baggage maximums on all airlines with neither fee nor penalty charged due to what can only be described as reverse discrimination, I'd say the treatment of Kevin was arbitrarily obnoxious and I wouldn't want to fly them again either.

But as i said, the blog is not about that.

In the past, in blogs such as A few of the gotchas of CompareString and Sort the words, sort the strings and The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen and especially A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating), I have taken the issue of string sort vs. word sort and beaten it to death, practically.

I felt like I had to, really. Because this feature (the WORD sort) is the default in the linguistic collation functions, but is not very well understood.

By anyone.

Now another point I have made in the past like in #11 of How to track down collation bugs is that as a general principle any time behavior is different between CompareString/CompareStringEx and LCMapString/LCMapStringEx with LCMAP_SORTKEY, while deciding which one is wrong can vary, in practice it is almost never the sort keys -- thus they make the best baseline for us, functionally

The few exceptions to that (e.g. Everybody's doing the wraparound.... and the bug Atsushi Enomoto reported) are quite rare; the underlying point of A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time) is that the core nature of these functions results in that particular truth.

And sort key bugs are just pretty damn rare.

Now as if to try and prove that the difference been NEVER happens and SELDOM happen, a sort key bug happened in Windows 7 and .Net 4.0.

You can see the public report of that bug on the Connect site, in SortKey.Compare returns different results on .NET 4 than .NET 2, does not match CompareOptions.Compare:

string a = "－";
string b = "-";

CompareInfo compareInfo = CultureInfo.GetCultureInfo(1033).CompareInfo;
CompareOptions compareOptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth;
Console.WriteLine(compareInfo.Compare(a, b, compareOptions));

SortKey sortKeyA = compareInfo.GetSortKey(a, compareOptions);
SortKey sortKeyB = compareInfo.GetSortKey(b, compareOptions);
Console.WriteLine(System.Globalization.SortKey.Compare(sortKeyA, sortKeyB));

ACTUAL RESULTS:
0
1

EXPECTED RESULTS:
0
0

Same output as .NET Framework 2

The problem?

I can reproduce the same reported issue in Windows 7 (passing NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH); weirdly, it is a genuine sort key bug. Basically you get the exact same sort key whether you pass the flags or not – it is not ignoring the width, essentially -- even when you ask it to. You simply always get the width included now, in this case.

The sort keys:

U+ff0d 01 01 01 01 ff ff 82 13 00
U+002d 01 01 01 01 ff ff 82 12 00

The keys will be the same, you pass NORM_IGNOREWIDTH or CompareOptions.IgnoreWidth to the native function or managed method, respectively.

Initially I had hoped that setting the compatibility flag on an EXE with the code would fix the behavior, but this seems to perhaps not be the case. :-(

The minimal repro is to just pass NORM_IGNOREWIDTH and don’t worry so much about the string comparison; it is the sort key generation that is broken.

Now although "－" (U+ff0D) vs. "-" (U+002d) have this problem where NORM_IGNOREWIDTH doesn't ignore the width, you don't see the same results looking at "ＡＢＣＤ" vs. "ABCD" (U+ff21 U+ff22 U+ff23 U+ff24 vs. U+0041 U+0042 U+0043 U+0044).

It really is just in the word sorting -- those characters in that table in A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating).

Of course one can "workround" the problem -- you can just use STRING sorting (i.e. SORT_STRINGSORT and CompareOptions.StringSort), since it is only the special case code for WORD sorting that has broken sort keys being created.

Here are the sort keys in these various cases:

U+ff0d 01 01 01 01 ff ff 82 13 00   // (No flags)
U+002d 01 01 01 01 ff ff 82 12 00   // (No flags)
U+ff0d 01 01 01 01 ff ff 82 13 00   // NORM_IGNOREWIDTH
U+002d 01 01 01 01 ff ff 82 12 00   // NORM_IGNOREWIDTH
U+ff0d 06 82 01 01 03 01 01 00      // SORT_STRINGSORT
U+002d 06 82 01 01 01 01 00         // SORT_STRINGSORT
U+ff0d 06 82 01 01 01 01 00         // NORM_IGNOREWIDTH | SORT_STRINGSORT
U+002d 06 82 01 01 01 01 00         // NORM_IGNOREWIDTH | SORT_STRINGSORT

Now this is a pretty core regression in behavior that has existed for as long as the functionality has.

I wouldn't tend to be too excited about the workaround, myself (it works, but WORD SORTING is there for a reason). I would instead prefer to look forward to it being fixed in some future version, though....

But in any case, and just to leave no loose ends: although I am not a lawyer, I am reasonably certain that this bug, and the fact that in its default configuration Microsoft Windows 7 refuses to ignore the width of certain "wider" characters does not constitute discrimination on its part, in any way. It's just a bug that it all likelihood came out of a huge re-factoring that took place, and no one caught it prior to ship. If it bothers you, then keep in mind that you hadn't found it/reported it yet, either. :-)

John Cowan on 7 Sep 2010 7:45 AM:

Refactoring without testing? Shame, shame.

Michael S. Kaplan on 7 Sep 2010 8:49 AM:

Agree. There is always a risk in refactoring old code, when every feature and the interactions between them aren't completely understood. The risk of those missing edge cases....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day