If you ask the average person "Which comes first, '=' or '_' ?" they will stare at you blankly. With good reason.

by Michael S. Kaplan, published on 2010/02/21 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/02/21/9966915.aspx


Some questions can really take you back, you know?

Like the other day, when someone asked the following question on a programming alias:

I am using string.CompareTo to compare two strings like “status=abc” and “status_includes=abc”. The result indicates “status=abc” is greater than “status_includes=abc”. However, on ASCII table ‘=’ is before ‘_’.  Did I misunderstand string.CompareTo?

Reminds me of the old days!

Now of course if you ask the average person on the street whether "A" comes before "B" you will get as reasonably consistent answer.

But most of them, when asked to give their opinion on whether "=" comes before "_", will simply give you a blank look.

And let's face it, they are right.

Clearly the kind of very technical people (aka Geeks or Nerds) who would even say things about the "ASCII table" and the order of things in it are a subset of all of the people in the world. Writing stuff to be intuitive for them rather than the set of everyones else in the world wouldn't make a lot of sense.

Being a geek who looks at sorting and thinks about the default behavior that they believe should match the ASCII table, that is obscure enough that you might almost class it as th e professional equivalent of a fetish. :-)

Now by default, the method in question uses CurrentCulture for comparisons, but for the record these two characters will sort the same in every culture, including InvariantCulture, because none of the m change the handling here.

You can actually go to the protocol docs to look at the source weights used (hint: see [MS-UCODEREF in particular if you are one of those types of people!) but I'll save the normal type people among you some time and meaningfully subset the big table here:

;------------------------------------------------------------------------------------------------
;Windows NT 4.0 through Windows Server 2003 Sorting Weight Table
;This file contains detailed character weight specifications that permit consistent sorting and
;comparison of Unicode strings.  The data is not used by itself but is used as one of the
;inputs to the comparison algorithm.
;------------------------------------------------------------------------------------------------
...
...
...
0x0038 12 162 2 2 ;Digit Eight
0x0039 12 180 2 2 ;Digit Nine
0x003a 7 55 2 2 ;Colon
0x003b 7 58 2 2 ;Semicolon
0x003c 8 14 2 2 ;Less-Than Sign
0x003d 8 18 2 2 ;Equals Sign
0x003e 8 20 2 2 ;Greater-Than Sign
0x003f 7 60 2 2 ;Question Mark
0x0040 7 62 2 2 ;Commercial At
0x0041 14 2 2 18 ;Latin Capital Letter A
...
...
...
0x005a 14 169 2 18 ;Latin Capital Letter Z
0x005b 7 63 2 2 ;Opening Square Bracket
0x005c 7 65 2 2 ;Backslash
0x005d 7 66 2 2 ;Closing Square Bracket
0x005e 7 67 2 2 ;Spacing Circumflex
0x005f 7 68 2 2 ;Spacing Underscore
0x0060 7 72 2 2 ;Spacing Grave
0x0061 14 2 2 2 ;Latin Small Letter A
...
...
...

And there you have it.

Letters have a "SCRIPT MEMBER" of >= 14, while regular punctuation tends to be 7, and mathematical stuff tends to be 8.

And given those groupings, even a Nerd would treat U+003d (aka EQUALS SIGN) as a mathematical sign and U+005F (aka LOW LINE, aka Spacing Underscore) as being in general punctuation.

The decision here of how to group them, whether to group them, and in what order to group them (by choosing a number, one chooses an order) is to some degree arbitrary but now has been present for so long that it is not really going to change.

Until the Geek Culture is created; clearly that locale will use the Ordinal sort anyway, which is not only more intuitive but will actually tend to be faster!

For the original question, the developer was pointed to the overload that accepts a StringComparer and was able to pass StringComparer.Ordinal to solve the problem, though I personally would have recommended splitting these strings into name/value pairs since even without knowing what they are they clearly are and then sorted the names -- which would give the right answer for all cases, in both the existing Cultures and the faux one I posited!


no comments

referenced by

2010/03/09 Coloring outside the lines in the a-ness of the Hungarian Technical Sort

go to newer or older post, or back to index or month or day