A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

by Michael S. Kaplan, published on 2007/09/20 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/20/5008305.aspx

Previous posts in this series:

Part 0: The empty string sorts the same in every language
Part 1: The law of the letter -- e.g. Latin < Greek < Cyrillic
Part 2: The string that won? Didn't have a mark on him!
Part 3: Should you let a string make it's case? If so, Y?
Part 4: It isn't a race but let's make an EXCEPTION and cross the Finnish line
Part 5: EXPANSIONing your horizons
Part 6: Relax, be calm, and deCOMPRESS if you are feeling out of sorts
Part 7: You're very thin now, but I can still recognize you
Part 8: You can often think of ignoring weights as a form of ignorance

Today's contribution to this series is one that has been the source of several misunderstandings.

It has also led to more than a few bug reports, though in truth most of the bugs were duplicates of the very same issue. :-)

It is about certain types of punctuation, and the impact of WORDSORT (as opposed to STRINGSORT) in linguistic comparisons.

I first talked about the feature in A few of the gotchas of CompareString and I will quote the relevant bit here:

SORT_STRINGSORT - Treat punctuation the same as symbols. For example, a STRING sort treats co-op and co_op as strings that should sort together since the hyphen and the underscore are both treated as symbols. On the other hand, a WORD sort treats the hyphen and apostrophe differently, so that co-op and co_op would not sort together but co-op and coop would. The real documentation for this is built into the winnls.h header file:

//
// Sorting Flags.
//
//    WORD Sort:    culturally correct sort
//                  hyphen and apostrophe are special cased
//                  example: "coop" and "co-op" will sort together in a list
//
//                        co_op     <------- underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        co-op     <------- hyphen (punctuation)
//                        cork
//                        went
//                        were
//                        we're     <------- apostrophe (punctuation)
//
//
//    STRING Sort: hyphen and apostrophe will sort with all other symbols
//
//                        co-op     <------- hyphen (punctuation)
//                        co_op     <------- underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        cork
//                        we're     <------- apostrophe (punctuation)
//                        went
//                        were
//

The reasons for this feature are fairly clear -- in many contexts, the default word sorting is more useful and much more intuitive.

Of course among the reasons that it is not always expected:

When - (U+002d, a.k.a. HYPHEN-MINUS) is being less of a hyphen and more of a minus, the results are not only unintuitive, they also cause interesting problems such as I discussed in The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen;
Sometimes, for example in situations like the one discussed in Sort the words, sort the strings, even for nominal hyphens the behavior can be less than intuitive (mainly when they are a part of symbolic identifiers);
When ' (U+0027, a.k.a. APOSTROPHE) is used as a variation of " (U+0022, a.k.a. QUOTATION MARK) then the fact that the behavior is intuitive for contraction usage is sullid by how unintuitive it is for other uses.

The most recent report of the first issue (by far the most common) was just a week ago, and it led to ond of the developers over in MSN concluding that:

It just shows how evil String.Compare is (for being completely counter-intuitive.)

Hard to argue with that!

It almost makes one want to do something smarter in the function to try and detect the two cases and handle them differently -- perhaps not too hard since the actual cases are so very different?

Plus it would be cool to add a SORT_SMARTSORT constant here. :-)

Let's take a step back and see what the sort keys say.

Remmeber that sort keys are only run on a single string and are thus not subject to the "a < b but b < a" type bugs that can lead to real problems. Although as a general principle any time transitivity is not there or when behavior is different between CompareString/CompareStringEx and LCMapString/LCMapStringEx with LCMAP_SORTKEY deciding which one is wrong can vary, in practice it is usually not the sort keys -- thus they make the best baseline for us, functionally¹.

(WS) -0.67:-0.33:0.33    0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 1b 06 82 00
(WS) -0.67:0.33:-0.33    0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 2f 06 82 00
(WS) 0.67:-0.33:0.33     0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 1b 06 82 00

(SS) -0.67:-0.33:0.33    06 82 0c 03 07 33 0c 7d 0c 90 07 37 06 82 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 00
(SS) -0.67:0.33:-0.33    06 82 0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 06 82 0c 03 07 33 0c 46 0c 46 01 01 01 01 00
(SS) 0.67:-0.33:0.33     0c 03 07 33 0c 7d 0c 90 07 37 06 82 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 00

Clearly, there is a specific preferred ordering for the sort keys based on the earlier hyphen trumping the later one, so in choosing which CompareString/CompareStringEx to prefer, one has a way to go. This is thankfully much less controversial of a decision than when one has to choose between CompareString/CompareStringEx and LCMapString/LCMapStringEx with LCMAP_SORTKEY, since in those cases clients like SQL Server are already contending with index corruption due to the "not so very transitive" results, so as long as one is correct and the other is treated as a bug, they can fix it without waiting for new collations, etc.

At this point, many people thinking about the fillers one sees in the DW and CW values will wonder exactly how these punctuation weights are being defined. I'll give you a very big hint:

-012345     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 07 06 82 00
0-12345     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 0b 06 82 00
012-345     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 13 06 82 00
0123-45     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 17 06 82 00
01234-5     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 1b 06 82 00
012345-     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 1f 06 82 00

See what is going on? It still is acting as a filler -- it is just compressing it some. This is much less feasible to do for DW and CW portions of the weight given the additive nature of the values in them, but at least you know punctuation does not waste too much space in sort keys!

There are other characters that have this special weight value; to find them all, you could grab the sort key of each character from 0x0000 to 0xFFFF when the SORT_STRINGSORT flag is included -- these "special punctuation" values all have a first byte value of 06.

In Vista, these values (in weight order, with their other weights present as well) are:

CODEPOINT SM   AW   DW   CW   COMMENT
0x0001    6    3    2    2    ;Start Of Heading
0x0002    6    4    2    2    ;Start Of Text
0x0003    6    5    2    2    ;End Of Text
0x0004    6    6    2    2    ;End Of Transmission
0x0005    6    7    2    2    ;Enquiry
0x0006    6    8    2    2    ;Acknowledge
0x0007    6    9    2    2    ;Bell
0x0008    6    10   2    2    ;Backspace
0x000e    6    11   2    2    ;Shift Out
0x000f    6    12   2    2    ;Shift In
0x0010    6    13   2    2    ;Data Link Escape
0x0011    6    14   2    2    ;Device Control One
0x0012    6    15   2    2    ;Device Control Two
0x0013    6    16   2    2    ;Device Control Three
0x0014    6    17   2    2    ;Device Control Four
0x0015    6    18   2    2    ;Negative Acknowledge
0x0016    6    19   2    2    ;Synchronous Idle
0x0017    6    20   2    2    ;End Of Transmission Block
0x0018    6    21   2    2    ;Cancel
0x0019    6    22   2    2    ;End Of Medium
0x001a    6    23   2    2    ;Substitute
0x001b    6    24   2    2    ;Escape
0x001c    6    25   2    2    ;File Separator
0x001d    6    26   2    2    ;Group Separator
0x001e    6    27   2    2    ;Record Separator
0x001f    6    28   2    2    ;Unit Separator
0x007f    6    29   2    2    ;Delete
0x0027    6    128 2    2    ;Apostrophe-Quote
0xff07    6    128 2    3    ;Fullwidth Apostrophe-Quote
0x07F4    6    129 2    2    ;NKO HIGH TONE APOSTROPHE
0x07F5    6    129 20   2    ;NKO LOW TONE APOSTROPHE
0x002d    6    130 2    2    ;Hyphen-Minus
0xff0d    6    130 2    3    ;Fullwidth Hyphen-Minus
0xfe63    6    130 2    8    ;Small Hyphen-Minus
0x2212    6    131 2    2    ;Minus Sign
0x208b    6    131 2    4    ;Subscript Hyphen-Minus
0x207b    6    131 2    14    ;Superscript Hyphen-Minus
0x2010    6    132 2    2    ;Hyphen
0x058a    6    132 21   2    ;Armenian Hyphen
0x2011    6    133 2    2    ;Non-Breaking Hyphen
0x2027    6    134 2    2    ;Hyphenation Point
0x2043    6    135 2    2    ;Hyphen Bullet
0x2012    6    136 2    2    ;Figure Dash
0x2013    6    137 2    2    ;En Dash
0xfe32    6    144 2    12    ;Glyph For Vertical En Dash
0x2014    6    144 21   2    ;Em Dash
0xfe58    6    144 21   8    ;Small Em Dash
0xfe31    6    144 21   12    ;Glyph For Vertical Em Dash
0x2015    6    146 2    2    ;Quotation Dash
0x301c    6    147 2    2    ;Wave Dash
0x3030    6    148 2    2    ;Wavy Dash

The control characters are there for compatibility with prior versions and are used by some very low level pieces of Windows as sentinels (when we triedd to change the values we qwuickly made the system unbootable!).

The only other thing to note is that 08 case weight bug I mentioned in Part 3 is here too for a few characters. Did anyone from the NLS test team put that bug in yet? :-)

As a bonus, can anyone explain why it doesn't matter in the default case, and under what combination of circumstances it would matter? I'll put the answer in a comment eventually if no one else figures it out....

1 - The de facto decision to consider sort key results to be definitive is quite ironic given that SQL Server² uses CompareString to build its sort keys.
2 - It is fascinating though perhaps not unexpected to note that the single biggest reporter of CompareString anomalies is in fact SQL Server.

This post brought to you by 9 (U+0039, a.k.a. DIGIT NINE)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/09/07 Refusing to ignore some particular character's width isn't [always] an act of discrimination…

2010/06/11 Call it Reversible Error, aka Yes it has no weight; it was supposed to have no weight!

2010/06/10 WORD SORT...Why'd it have to be...WORD SORT?

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/01/14 On reversing the irreversible (The Set-Up)

2007/10/30 EXPECTED is in the eye of the [non-expecting type of ]expectant

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

go to newer or older post, or back to index or month or day