Is it a bug?

by Michael S. Kaplan, published on 2006/03/27 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/27/561195.aspx

Regular readers you can think of this as a part of the Sorting It All Out mid-term.

Basically we are looking at two calls to CompareString. The first is:

CompareStringW(0x0409, 0, L"Hello-Bob", -1, L"Hello Bob", -1)

which returns CSTR_GREATER_THAN, and the second is:

CompareStringW(0x0409, 0, L"-", -1, L" ", -1)

which returns CSTR_LESS_THAN.

I promise there are no "spoofing" characters or anything else unexpected in the strings, it is literally

a comparison of two almost identical strings and
a comparison of two substrings that literally represent the only differences between those two almost identical strings

The question -- is the difference between the two calls a bug? And if so, then which one is incorrect? And if not, then why?

Answers will be graded for accuracy, or short of that for how convincing the provided expository bullshit is, in an otherwise inaccurate answer....

(All posts will be moderated unless they do not give away the answers!)

Phylyp on 27 Mar 2006 5:52 AM:

Accuracy: 0% Bullshit: probably 100%

Is it the fact that the " " and "-" are in difference Unicode subcategories (punctuation vs. separator)?

nat on 27 Mar 2006 6:52 AM:

it's not a bug i guess...

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winui/winui/windowsuserinterface/resources/strings/stringreference/stringfunctions/comparestring.asp

Typically, strings are compared using what is called a "word sort" technique. In a word sort, all punctuation marks and other nonalphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently than the other nonalphanumeric symbols, in order to ensure that words such as "coop" and "co-op" stay together within a sorted list.

If the SORT_STRINGSORT flag is specified, strings are compared using what is called a "string sort" technique. In a string sort, the hyphen and apostrophe are treated just like any other nonalphanumeric symbols. Their positions in the collating sequence are before the alphanumeric symbols.

Nick Lamb on 27 Mar 2006 9:17 AM:

A self-consistent, reproducible result from a localised collation function is a bug if and only if it conflicts with the reasonable expectations of the users of that locale in the context of the collation. If all the users disagree, you have a bug. If a substantial population of users disagree then in fact you have one or more new groups who might need a new locale.

You'll need some native English speaking Americans. My guess is that they don't care either way, and so the current collation is not a bug.

Maurits [MSFT] on 27 Mar 2006 1:14 PM:

There's no bug. This is a documented feature called "word sort". To avoid it, use SORT_STRINGSORT in dwCmpFlags.

http://tinyurl.com/t4k4
Typically, strings are compared using what is called a "word sort" technique. In a word sort, all punctuation marks and other nonalphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently than the other nonalphanumeric symbols, in order to ensure that words such as "coop" and "co-op" stay together within a sorted list.

"Hello-Bob" is word-sorted as "HelloBob", which comes AFTER "Hello Bob".
"-" is word-sorted as "", which comes BEFORE " ".

Maurits [MSFT] on 27 Mar 2006 2:28 PM:

Some sort key weights...

"": 1 1 1 1
" ": 7 2 1 1 1 1
"-": 1 1 1 1 128 7 6 130
"ab": 14 2 14 9 1 1 1 1
"a b": 14 2 7 2 14 9 1 1 1 1
"a-b": 14 2 14 9 1 1 1 1 128 11 6 130

"a" is 14 2
"b" is 14 9
" " is 7 2

Note "a-b" sorts as "ab" plus a little bit (128 11 6 130)
Note "-" sorts as "" plus that same little bit (128 11 6 130)

Michael S. Kaplan on 27 Mar 2006 11:36 PM:

Ok, Maurits is the one who had the full answer, and nat is the runner up. :-)

Michael S. Kaplan on 27 Mar 2006 11:41 PM:

The trick that causes the difference here is that due to the WORD SORT rules (which happen on all languages, by the way!), the hyphen gets kind of put at the end -- so in actuality the space is being compared to the "B" and of course it will be less than, so string1 will be greater than string2....

Maurits [MSFT] on 28 Mar 2006 11:22 AM:

The hyphen-and-apostrophe rule makes me think this is for sorting names... so Jo-Anne comes next to Joanne, and O'Brady comes next to OBrady.

Oops in my previous comment... the "little bit" for "-" is 128 7 6 130, which is not quite the same as the "little bit" for "a-b" (128 11 6 130) (perhaps because of where in the string the hyphen occurs?)

There are primary weights, secondary weights, and tertiary weights:

primary weights: a < b
secondary weights: a < A
tertiary weights: "" < ' < -

These are visible in the sort keys:
a: (0e 02) 01 01 [] 01 01 {}
a': (0e 02) 01 01 [] 01 01 {80 0b 06 80}
a-: (0e 02) 01 01 [] 01 01 {80 0b 06 82}
A: (0e 02) 01 01 [12] 01 01 {}
A': (0e 02) 01 01 [12] 01 01 {80 0b 06 80}
A-: (0e 02) 01 01 [12] 01 01 {80 0b 06 82}

(primary weight) 01 01 [secondary weight] 01 01 {tertiary weight}

The x0101 is used as a boundary between weights because (guess mode = ON)
* a x00 would terminate the byte array and
* x0101 is LESS than any weight -- which is necessary to make shorter strings sort as "less than" longer strings with the same prefix (a < a')

Michael S. Kaplan on 28 Mar 2006 11:27 AM:

Actually. primary weights are often alphabetic, secondary weights are often diacritic, tertiary weights are often case, and punctuation in a WORD sort is quaternary, along with other "special weights."

Maurits [MSFT] on 28 Mar 2006 11:49 AM:

a: (0e 02) 01 [] 01 {} 01 ? 01 <>
a': (0e 02) 01 [] 01 {} 01 ? 01 <80 0b 06 80>
á: (0e 02) 01 [0e] 01 {} 01 ? 01 <>
A: (0e 02) 01 [] 01 {12} 01 ? 01 <>
b: (0e 09) 01 [] 01 {} 01 ? 01 <>

(primary weight)
01
[secondary weight]
01
{tertiary weight}
01
?missing quaternary weight?
01
<quaternary or quintary weight>

primary weight: a < b
secondary weight: a < A
tertiary weight: a < á
quaternary weight: a < a' (in a WORD sort)

Hmmm... what goes where the question mark is?

Maurits [MSFT] on 28 Mar 2006 11:51 AM:

Oops, I switched secondary and tertiary weight... in my previous comment: A < á.

Maurits [MSFT] on 28 Mar 2006 12:05 PM:

> what goes where the question mark is:
Ah, here we go:
http://blogs.msdn.com/michkap/archive/2005/06/15/429279.aspx

Primary weight is "UW"
Secondary weight is "DW"
Tertiary weight is "CW"
Quaternary weight is "SW"
Quintary weight is unnamed (the sort key in that post has nothing between the last 01 and the final 00)

Alas, SW happens to be empty.

UW... I got nothing.
DW... "diacritic weight", n'est-pas?
CW... "case weight", perhaps?
SW... "something weight" (but WHAT?)
(anonymous)... "word sort weight", if you will

Maurits [MSFT] on 28 Mar 2006 12:12 PM:

Heh, re-reading the comments to that post, I missed this gem:
"some implementations assume the byte array is NULL terminated since it is doc-ed as such"

Maurits [MSFT] on 28 Mar 2006 12:41 PM:

OK, found UW and SW:
http://download.microsoft.com/download/d/1/8/d18fc51c-1f18-436e-8e1e-312a47353a77/DBA319.ppt

Slide 24:
[all Unicode] 0x01 [all Diacritic] 0x01 [all Case] 0x01 [all Special] 0x01 0x0

So U is Unicode, and S is Special.

Slide 26 has examples of strings with SW... but I can't copy them :'(

Maurits [MSFT] on 28 Mar 2006 12:48 PM:

Here's a screenshot (from PowerPoint Viewer 2003 on Windows 2000)

http://www.geocities.com/mvaneerde/sw-example.gif

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day