Sort the words, sort the strings

by Michael S. Kaplan, published on 2006/05/04 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/05/04/589656.aspx


Last night a colleague asked me via email:

Hi Michael,

I have the following line 

        Console.WriteLine("A > B ? = {0}", CultureInfo.CurrentCulture.CompareInfo.Compare("kok-in", "ko-kr", CompareOptions.None));

And current culture is en-US

This line printing

A > B ? = -1

Which means "kok-in" < "ko-kr"

I think this is wrong. Right?

I have actually talked about string sorts vs. words sorts in the past in posts like this one, though it raises an interesting question in terms of usability.

The easy argument? Well, in that CompareInfo.Compare call above, just use CompareOptions.StringSort rather than CompareOptions.None.

However, in this particular case, the behavior was causing a problem when using SortedList with .NET Framework culture names as keys. Because the scenarios for which the word sort behavior is intuitive do not have among them a scenario like hyphenated tokens of 2-3 characters each. And although the string sort behavior works better here (since treating the hyphen as a symbol will give the behavior most would expect, in this scenario), there is actually a problem with that approach, namely the fact that the SortedList class does not give an easy way to apply the flag. To use it, you have to do something like implement an IComparer with a Compare call that uses the flag, which is the most common operation in the world....

This is made more complicated by the problems I pointed out previously in the post On approaching international programming....

Taking a step back, the names of the two sorts (string sort vs. word sort) do not give good clues in the words as to what each one does, anyway. So without docs, it would be hard to figure out which one to use in the first place!

The fact that the SortedList does not contain easy ways to get at these settings makes it (in my opnion) just a little less sorted, if you know what I mean. :-)

 

This post brought to you by "A" (U+0041, LATIN CAPITAL LETTER A)


Maurits [MSFT] on 4 May 2006 11:48 AM:

So a word-sort of a language list would mix major language groupings!

string sorted list:

cd-aa
cd-ee
cd-ss

cde-bb
cde-mm

word sorted list:

cd-aa
cde-bb
cd-ee
cde-mm
cd-ss

Michael S. Kaplan on 4 May 2006 12:38 PM:

Well, that is why I was pointing out that this particular list needs a string sort! :-)

Maurits [MSFT] on 4 May 2006 12:45 PM:

I suppose another way around it would be to map all the two-character major-language codes to their three-character ISO 639 equivalents prior to sorting
http://en.wikipedia.org/wiki/List_of_ISO_639_codes

Then the hyphens would all be in the fourth position and have no impact on the sort.

Michael S. Kaplan on 4 May 2006 1:07 PM:

Actually, this would  have two fairly blocking problems:

1) the two letter is not a 100% subset of the three letter;
2) The three letter would not create valid cultures in the cases where the change is made;

Maurits [MSFT] on 4 May 2006 2:00 PM:

> the two letter is not a 100% subset of the three letter

I don't think anyone's going to stick up for Serbo-Croatian now that Slobodan Milošević has passed on.  As a legacy measure, it could map to hbs per ISO 639-3

> The three letter would not create valid cultures in the cases where the change is made

That sounds like a bug, actually.  Surely the culture code should recognize the equivalence of "en" and "eng"?

Michael S. Kaplan on 4 May 2006 2:16 PM:

Sorry Maurits -- the culture names are identifiers, like LCIDs. I'll be blogging about this soon though, never fear. ;-)

referenced by

2010/09/07 Refusing to ignore some particular character's width isn't [always] an act of discrimination…

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/05/06 One product's feature is another product's bug -- just ask 'em!

2006/11/16 The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen

go to newer or older post, or back to index or month or day