The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen

by Michael S. Kaplan, published on 2006/11/16 07:14 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/11/16/1086283.aspx


William Hooper pointed me to an interesting bug report:

Help Strings in .NET not always transisitve!!!!!!

We need someone to help us!!!!!!

>>>>>>>>>>>>>>>>>>>>>>

https://connect.microsoft.com/feedback/viewfeedback.aspx?FeedbackID=236900&wa=wsignin1.0&siteid=210

>>>>>>>>>>>>>>>>>>>>>>

   string s1 = "-0.67:-0.33:0.33";
   string s2 = "0.67:-0.33:0.33";
   string s3 = "-0.67:0.33:-0.33";
   Console.WriteLine(s1.CompareTo(s2));
   Console.WriteLine(s2.CompareTo(s3));
   Console.WriteLine(s1.CompareTo(s3));

Bug in the hypen logic when there are two hyphens in string...

Now I have talked about string sort vs. word sort in the past. And this is a problem which appears in both Windows and the .NET Framework, in all versions -- and has for all the versions I checked.

Even more interestingly, if you reverse all three comparisons the bug still happens (it is common in these kinds of issues when they come up for there to be a difference!).

Let's look at the sort keys to see how things look for the three strings:

0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 1b 06 82 00
0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 1b 06 82 00
0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 2f 06 82 00 

 Now let's focus on that end piece that contains the hyphen info, since the main parts of the string are all identical:

80 07 06 82 80 1b 06 82 00
80 1b 06 82 00
80 07 06 82 80 2f 06 82 00 

 Now obviously from a sort key standpoint, s1 < s3 < s2, always. Which is obviously superior to s1 > s2 > s3 > s1 any day of the week, and which points to comparison that has the bug in it -- the comparison of s1 vs s2:

CompareInfo ci = CompareInfo.GetCompareInfo(0x409);
string st1 = "-0.67:-0.33:0.33";
string st2 = "0.67:-0.33:0.33";
string st3 = "-0.67:0.33:-0.33"; 
SortKey sk1 = ci.GetSortKey(st1);
SortKey sk2 = ci.GetSortKey(st2);
SortKey sk3 = ci.GetSortKey(st3);

Console.WriteLine(ci.Compare(st1, st2));
Console.WriteLine(ci.Compare(st2, st3));
Console.WriteLine(ci.Compare(st1, st3));
Console.WriteLine();
Console.WriteLine(SortKey.Compare(sk1, sk2));
Console.WriteLine(SortKey.Compare(sk2, sk3));
Console.WriteLine(SortKey.Compare(sk1, sk3));

The results?

1
1
-1

-1
1
-1 

Definitely a bug in the string comparison, one that has been around for a long time. Perhaps a better reason than ever to keep in mind the importance of linguistically meaningful strings when one is using the NLS/NLS+ collation support? Clearly the intent of the word sort is to handle an entirely different class of usage of U+002d -- basically it is meant to handle the hyphen, not the minus. A meaningful comparison of these strings would have to be based on the underlying source of the numeric data.

Do we have an NLS tester in the house? Let's make sure that Connect bug makes its way to me when it gets ported. And I suppose we need one for v.Next of Windows, too....

It might even be worth treating U+2212 (MINUS SIGN) as different than U+002d (HYPHEN-MINUS), though as a mitigation it won't help since most methods of converting a minus to a string come up with U+002d, not U+2212. Just as most hyphens do not end up as U+2010 (HYPHEN). Though of course the best fit mappings for both characters usully puts them as U+002d. Changing all of those string creating definitions would probably create even more problems with compatibility. :-(

 

This post brought to you by - (U+002d, a.k.a. HYPHEN-MINUS)


Bart on 16 Nov 2006 10:16 AM:

So how will you work around this ?

Add code to check if a x2d is followed by 0-9 ? and treat it as a x2212 ?

Michael S. Kaplan on 16 Nov 2006 9:53 PM:

At some level of the code you know that you are passing non-linguistic strings; the key is to know when to not be calling CompareString or its ilk.

Oleg on 23 Nov 2006 11:17 AM:

Look at the sort keys for the three strings again. There are four 01 bytes in front of the end piece that contains the hyphen info.

According to my local MSDN documentation the sort key is stored as an array of byte values in the following format:

[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x00

Online documentation (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_5s2v.asp) specifies another format:

[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x01 0x00

Note one more 0x01.

Could it be that both documentation sources are incorrect and the format is:

[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 0x01 [all Special weights] 0x00

Can you make it clear?

Michael S. Kaplan on 23 Nov 2006 10:49 PM:

Actually, it is just the punctuation weights are separate outside of the other special weights....

Oleg on 24 Nov 2006 6:16 AM:

Then, do I understand it correctly that the format is:

[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 ["Other" Special weights] 0x01 [Punctuation weights] 0x00 ?

Can you give an example of the string that has "other" special weights piece in the sort key? Does punctuation weights piece contain information only about hyphens and apostrophes?

Michael S. Kaplan on 24 Nov 2006 9:04 AM:

Yes, when I talked about "punctuation weights" I was referring to those hyphen/apostrophe characters affected by word sorts. Those are not "special weights".

Special weights are something you'll see when you deal with a lot of Kana....

Oleg on 24 Nov 2006 11:26 AM:

Then example string could be SO-DIMMソケット×2.

The sort key for this string is:

0e 91 0e 7c 0e 1a 0e 32 0e 51 0e 51 22 16 22 0d

22 1c 22 1e 08 1c 0c 33 01 01 12 12 12 12 12 12

01 c6 c6 c4 ff 02 c4 c4 c4 c4 ff ff 01 80 0f 06

82 00

In this key "Special weights" piece is:

c6 c6 c4 ff 02 c4 c4 c4 c4 ff ff

and the "punctuation weights" piece is:

80 0f 06 82

And the LCMapString with the LCMAP_SORTKEY flag stores a sort key in the buffer, as an array of byte values in the following format:

[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x01 [Punctuation weights] 0x00

"Punctuation weights" piece is specific for WORD sort and contains

information about hyphen/apostrophe characters.

Is this correct description now?

Michael S. Kaplan on 24 Nov 2006 1:41 PM:

Yes, it is.

stefang on 12 Aug 2009 4:09 AM:

I have just spent a whole day chasing a problem where data from an SQL server database was sorted differently by SSIS Sort and SQL Server 2005.

In both cases I have used the same collation (finnish_swedish_cs_as) Since this is a windows collation the sorts produced by SQL server should (according to SQL Server documentation) be compatible with sorts produced by CompareString.

After lots of investigation I have identified that the reason for the problem is that SSIS uses CompareString with the above bug, but that SQL server uses something else that has fixed the bug.

If you ever get a chance to fix this problem in CompareString, please make sure that the result is compatible with the SQL server implementation.

The strings i have had problems with are:

a="MX97_5_OE-SMC-24/45"

b="MX97_5_OESMC-24/45"

SQL server thinks that a<b while CompareString thinks that a>b

Michael S. Kaplan on 12 Aug 2009 11:40 AM:

Not exactly ontopic here, but version of Windows where the SSIS code is running would also be needed to pin this down....

stefang on 18 Aug 2009 10:46 AM:

I have tested this on XP SP2 and Server 2003 R2. SQL Server 2005

Michael S. Kaplan on 19 Aug 2009 1:32 AM:

The differences here between Windows and SQL Server are more or less expected since the SQLS tables are based on data from Windows 2000 pre-Beta2 and the XP/Server 2003 is based on data from a half  decade later that was not based on  non-shipping Windows collation data....

The most comprehensively compatible results between Windows and SQLS are to be found in Windows Server 2008 and SQL Server 2008, as the latter was updated and put largely into sync with the former.

stefang on 19 Aug 2009 11:19 AM:

Does this mean that this bug has been fixed on windows server 2008 ?

Michael S. Kaplan on 19 Aug 2009 2:31 PM:

Depends on your definition of "fixed" to be honest.

stefang on 20 Aug 2009 3:24 AM:

Thank you for the answers. This really makes the situation much clearer.

Sorry about the slightly offtopic comments.


referenced by

2010/09/07 Refusing to ignore some particular character's width isn't [always] an act of discrimination…

2010/06/10 WORD SORT...Why'd it have to be...WORD SORT?

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/05/06 One product's feature is another product's bug -- just ask 'em!

2006/11/25 Punctuation... now, isn't that SPECIAL [weights] ?

go to newer or older post, or back to index or month or day