Comparison confusion: INVARIANT vs. ORDINAL

by Michael S. Kaplan, published on 2004/12/30 01:58 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/29/344136.aspx


There is a great deal of confusion surrounding the meaning of these two different things in the .NET Framework, and when to use each. If you have suffered, are suffering, or think may suffer in the future from such a confusion, then read on!

(Otherwise, I guess you can go away and come back another time)

The invariant culture's direct ancestor is the invariant locale. Officially added to the Windows source tree at 10:23am on May 12, 2001, its intention was not to be used as an actual locale (which would explain why no locale data was added until a month later; until then no one was using it in GetLocaleInfo!).

Originally, LOCALE_INVARIANT had just one noble purpose -- to allow one to use CompareString (and LCMapString with the LCMAP_SORTKEY flag) in a way that would only use the "Default" Windows sorting table as mentioned a little bit here and especially here. The results, as that second article mentioned, would not vary when the user or system locale settings did; they would be invariant within that installation of Windows.

The data was added for this locale a month later, as I said, for obvious reasons -- if you have an LCID that one function considers to be valid, you must have a very good reason if another will not. And it cannot duplicate any other locale, either. Much weird data was added so that no one would be tempted to try to act like they spoke a language called "Invariant" and then all was good.

Note that these string comparisons still had much linguistic value -- half of the locales in Windows use that default table, so an invariant sort would not only avoid varying, it would also look right to a lot of the world.

The .NET framework had similar requirements (with the additional need for invariant parsing/formatting support) and thus CultureInfo.InvariantCulture was created. As with the locale, any string comparions made with InvariantCulture's CompareInfo object would have linguistic validity in a lot of places, and would not vary within that installation of the .NET Framework.

So everyone had what they needed, right?

Well, no.

A bunch of people wanted a method of doing a more binary type of comparison, instead of one that would be based on the "linguistically appropriate" approach gven a particular culture1.

The difference between what we had and what they wanted was akin to the difference between the C Runtime's strcoll/wcscoll versus strcmp/wcscmp (in the CRT documentation they refer to the difference as being locale based versus lexicographic).

The other advantage to such a "lexicographic" comparison is that it would be faster since a simple binary comparison of the code point values was being used.

To meet this need, the notion of an Ordinal sort was added and an Ordinal member was added to the CompareOptions enumeration. Selecting it would ignore all of those cultural collation features and give you a binary sort that would also, incidentally, not vary.

The only remaining problem at this point is that there were now two useful ways to do these different "niche" type of comparisons but neither name really jumps out at the developers who were looking for such solutions.

That problem remains to this day, though every single time I speak at a conference or answer a question in a newsgroup or get someone to look at posts like this one, then there is at least one less developer who has this problem. Maybe this time it is you? :-)

Now the story does not end here; many people have wanted to do things in a case-insensitive way. Of course if you wanted a case-insensitive invariant comparison then you could have done that all along -- just use the InvariantCulture's CompareInfo methods with the CompareOptions.IgnoreCase flag passed in. Easy!

But some people wanted a case-insensitive ordinal comparison?!?

Now the closet linguist in me shudders at this concept since a casing operation is essentially a linguistic one while an ordinal one is specifically not -- it's lexicographic.

So people are asking for a linguistic non-linguistic support, a request that for me brings to mind the comedian Steven Wright's dog2.

However, the technical half of me understands the need and so I got over my linguistic fetish as one of my colleagues on the BCL team worked in Whidbey to add a new OrdinalIgnoreCase member to the CompareOptions enumeration.

The behavior is basically to do the casing operation using the default casing tables prior to doing the binary comparison. This feature has been in the "Whidbey" version of the .NET Framework for some time (first checked into the source code tree on February 7, 2003), so you can try it out today if you have just about any build of Whidbey underfoot.

Hopefully this post will help clear up some of the confusion about these two interesting comparison types.

 

1 - What can I say? Some people are Некультурные (uncultured) though not in the culturally offensive sense.
2 - Steven Wright claimed to have named his dog Stay so that he could call out "Come here, Stay! Come here, Stay!" and watch the dog walk toward him in a stuttery fashion.

 

This post brought to you by "Ω" (U+03a9, GREEK CAPITAL LETTER OMEGA)
I talked to Omega just before this post went live. She said that as the last letter in the Greek alphabet (who was pretty much always therefore last in the queue), she understood the cost of keeping letters in order. Any performance benefit is  good one, to her mind. Especially since a binary sort would let her come before her little sister (U+03c9, GREEK SMALL LETTER OMEGA) for once.


# Panos Theofanopoulos on 30 Dec 2004 8:29 AM:

for 1000000 comparisons

Ordinal : 47 mseconds
InvariantCulture : 299 mseconds

OrdinalIgnoreCase : 188 mseconds
InvariantCultureIgnoreCase : 245 mseconds

This post brought to you by "Ώ" (U+038F, GREEK CAPITAL LETTER OMEGA WITH TONOS) which beats Ω, ω and ώ in a binary sort :-)

# Michael Kaplan on 30 Dec 2004 8:36 AM:

Heh heh heh -- see what I mean?

You do have to be careful with the perf testing in an actual app situation to use representative strings. The reason is that in both cases (ordinal *and* invariant) the code will exit as soon as it finds the winner.

In order (therefore) to get the most effective test you have to craft the strings to match the kind of comparisons you will see in your app.

I'll get more into how it works internally in a future post. :-)

# Michael Kaplan on 30 Dec 2004 8:48 AM:

BTW -- Panos, in answer to your suggestion to the MSDN Feedback Center:

http://lab.msdn.microsoft.com/ProductFeedback/viewfeedback.aspx?feedbackid=2682c87e-5c46-4697-bbed-6f0de0047a7d

True benchmarks for these methods under load, with indications of where the true slowdowns are, is the way to proceed here. IMHO, all perf. work has to be done that way....

# Norman Diamond on 4 Jan 2005 5:10 PM:

> Note that these string comparisons still had
> much linguistic value -- half of the locales
> in Windows use that default table, so an
> invariant sort would not only avoid varying,
> it would also look right to a lot of the
> world.

Isn't linguistic value counterproductive for the intended value of the invariant locale? Didn't you say in an earlier posting that people shouldn't use the invariant locale for human-oriented operations? Surely the way to discourage people from using the invariant locale for sorting is to make the results appear wrong for every known culture in the world?

# Michael Kaplan on 4 Jan 2005 5:20 PM:

> Isn't linguistic value counterproductive for the intended value of the invariant locale?

No, not at all. I have explained the exact purpose of the invariant locale, and that is NOT it.

> you say in an earlier posting that people shouldn't use the invariant locale for human-oriented operations?

No, I said that about ORDINAL comparisons.

> Surely the way to discourage people from using the invariant locale for sorting is to make the results appear wrong for every known culture in the world?

Well, no one is trying to discourage invariant sorting. So I have to reject the premise here.

# Norman Diamond on 4 Jan 2005 10:51 PM:

In page
http://weblogs.asp.net/michkap/archive/2004/12/08/278170.aspx
you said:
> The invariant locale is pretty weird. Lets
> take a look at its interesting
> chracteristics.
[...]
> Its data was chosen to dissuade people from
> trying to use it as a locale. And that is
> putting it charitably.

That did not say ordinal comparisons, that said the invariant locale. Please look at your own posting an un-reject the premise here.

Now, to help dissaude people from trying to use the invariant locale as a locale, would it not have been beneficial to avoid giving it linguistic values? Would it not have been better if sort results were wrong for every human language?

# Michael Kaplan on 5 Jan 2005 12:01 AM:

It was intended to be used for sorting, we want it used for sorting. This was the REASON it was added. It will not change, and we would not want it to change even if we had the option.

When I spoke of trying to make it less appealing as a locale, I am referring to EVERYTHING ELSE beyond collation.

I hope these words are plain enough.

# Norman Diamond on 5 Jan 2005 4:38 PM:

Yes it is clearer this time.

Though I still think it would be better if the invariant locale's sort ordering differed from every human language so that no one would be tempted to use it for human-oriented sorting. For internal ordering of databases and finding hash keys and running scripts in a uniform manner it would still be fine (which I thought its intended purpose was).

# Michael Kaplan on 5 Jan 2005 5:03 PM:

Norman, I understand you think this.

Do you understand that we do not and that we intend it to be used and we have no ingtention of telling people not to use it?

It is basically the same as saying "use English" or whatever -- the default table. Thats what its job is, and it does that well.

Its not fair to change the requirement based on what you think it ought to do, and thus make it nor perform well. :-(

# Norman Diamond on 5 Jan 2005 11:05 PM:

Then you should call it the English locale and not the invariant locale. (Unless you really meant the en-US locale.) Attaching a non-provincial label for the purpose of appearing non-provincial while still making the factual behavior every bit as provincial as your other thread said you were trying to avoid, just yields another bit of pretense with continued provincialism. Also since facts of the invariant locale look like en-US, it will no longer be surprsing when US developers overuse it and use it for undesirable purposes.

# Michael Kaplan on 6 Jan 2005 1:08 AM:

Norman, you are wrong once again. Over half othe languages there are also in the defualt table. I can say use Hebrew. Or German. Or any number of locales instead. Thinking it is provincial to take the word "English" out of it is plain silly.

No if you look at the REST of the data, it looks nothing like English. Or Arabic. Or German. Or Hebrew. Or any of the other 50 locales that use the table. So go ahead and try to claim it is a provincial attempt to push a particular culture if you must -- but those who are reading will have long seen who gets it.

What we do here with invariant has a valid and explainable purpose.

# Michael Kaplan on 6 Jan 2005 5:13 PM:

Further comments that find to be inappopriate the intended purpose and use of the invariant locale's collation behavior (to use the DEFAULT collation table used by half of the locales in Windows) will not be accepted.

Those with dissenting opinions on the topic are invited to create their own blogs or web pages where they can explain the amount of evil-ness (evility?) in Microsoft's implementaion choices.

# Norman Diamond on 7 Jan 2005 7:41 PM:

1/6/2005 1:08 AM Michael Kaplan

> Thinking it is provincial to take the
> word "English" out of it is plain silly.

Of course it is. That's why I didn't use the word provincial that way. I used the word provincial in the way you used it in page
http://weblogs.asp.net/michkap/archive/2004/12/08/278170.aspx
and I agree with the way you used it there.

# Michael Kaplan on 7 Jan 2005 7:56 PM:

Yes, but did you read the whole article?

"...since the only real goal anyway was to "use the default table" for sorting..."

Our point is therefore entirely clear here for all to see (and read).

Microsoft Windows and the .NET Framework work as designed in regard to the INVARIANT locale/culture. The only bug we have now is the inappropriate expectations of one Norman Diamond. :-)

# Phil Hackett on 2 Sep 2008 12:00 PM:

Thanks Michael - that article was -really- useful.  It's a shame the official Microsoft help/documentation doesn't explain this, really, as it's such an obvious question that people are bound to ask.


referenced by

2007/05/12 The exception that proves the rule that was the exception that proves another rule (aka On the variability of the Invariant)

2007/04/25 The nature of OrdinalIgnoreCase vs. intuitive expectations

2007/04/10 When methods use collation to 'disturb the peace' we charge them with being 'out of sorts'

2006/08/27 It has not always been so invariant

2006/05/24 Invariant vs. Ordinal, the third

2005/12/22 New in Windows Vista: OrdinalIgnoreCase for Win32

2005/10/15 If you are using INVARIANT then you are probably MISusing it, #1

2005/04/26 Intelligent unmanaged string comparison

2005/04/13 Invariant and Ordinal Redux

2005/04/03 TechEd Bloggers does not work for this site?

2005/02/11 Surrogate pairs and binary (Ordinal) comparisons

2005/01/23 SQL Server has its own version of .NET "ordinal" comparisons

2004/12/30 How do sort keys work?

go to newer or older post, or back to index or month or day