Sometimes, ignoring case is stupid

by Michael S. Kaplan, published on 2006/08/01 09:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2006/08/01/685351.aspx


I talked about some of this once before in CompareString ignores case by lowercasing...., as you may recall.

There are some people who think of ignoring case as required to keep things from sorting in the 'ASCII order' -- by which I mean

   ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

But this really is not true -- and (for example) file names will never sort that way in Explorer.

The difference is between

   AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz

and

   AabBcCDdeEfFgGHhIijJkKlLMmNnOopPqQRrSstTuUvVWwXxyYzZ

!

In other words, what we are talking about when it comes to comparisons and ignoring case, we are talking about IGNORING case and giving no deterministic order between items that only differ in case.

Other people (for example the Shell) just sort of implicitly assume that the way that they order files should use NORM_IGNORECASE since NTFS, FAT, and FAT32 all ignore case.

Though I am forced to point again to this article to hopefully help people to realize that the argument is flawed. Or perhaps a moment to think about the difference between 'Which comes first?' vs. 'Are they equal?' will make it clearer.

The bottom line is that if one chooses to make the answer to an IDENTITY question a case-insensitive one, that in no way means that the question of ORDERING of items also needs to be made case insensitive. In fact, if they are being done as two separate operations than it is likely better if they are not done the same way at all.

So why not do two separate operations in two separate ways? :-)

Others argue that the performance is improved if there are less things to look for, though in practice this does not tend to be an issue. The reason for this is that case is never an issue unless two elements have identical primary weights (which is unusual in most situations), and if that happens then the "case winner" is stored, and subsequent case differences are ignored. Just like as if you passed the flag anyway.

The bottom line is that people actually do kind of like determinism, a certain way that patterns will behave in a consistent manner. And if you don't ignore case, things can be ordered a bit more deterministically!

Some of the wittier readers might think of the other ignore flags like NORM_IGNOREKANA and NORM_IGNOREWIDTH and realize that the same issues apply to them as well. But NORM_IGNORECASE is kind of special due to the confusion about filesystems and their case insensitivity and such.

Now all of the above are referring to CompareString and its managed analogues. I'll talk about where sort keys fit into this another time....

 

This post brought to you by (U+ff98, a.k.a. HALFWIDTH KATAKANA LETTER RI)


# Dean Harding on Tuesday, August 01, 2006 8:02 PM:

I suppose people get that misconception because they're used to strcmp vs. stricmp for example. Since they basically do an ordinal sort, vs. a lower-case-then-ordinal sort...

# Adam on Wednesday, August 02, 2006 9:24 AM:

OK, I've read through the previous links, and I still don't get it.

From http://blogs.msdn.com/michkap/archive/2005/11/22/495033.aspx

"The world of the less lexicographic and more linguistic CompareString and mythical EqualString functions is a very different one. [...] This is hardest on CompareString where with those weights there are so many different levels of difference, with some levels trumping other levels. Therefore, any difference between the two strings is an interesting bit of trivia unless no greater difference is found later on. [...] And this is where the fictional EqualString function would have it easier -- because like its binary/ordinal cousins it would be able to return after any difference is found, of any weight."

If, _for a given collation_ (e.g. English, case insensitive), you're asking "Which comes first?" and "Are they equal?", surely "Are they equal?" should return true _if and only if_ "Which comes first?" is "neither".

How is it supposed to work otherwise? How are you defining identity for things that can be ordered, if not by how they are ordered?


And what is the difference between the two "aa...zz" strings you posted? If you're looking at them case-insensitively, surely they're identical? No?

/very confused.

# Michael S. Kaplan on Wednesday, August 02, 2006 10:29 AM:

Hi Adam,

The primary issue is that there is never anything to lose in having a deterministic ordering that says that "aa" always comes before "AA", even if they are in some context basically equal. And there is much to lose in having the behavior be random here -- having a consistent ordering is never a bad thing.

# Adam on Wednesday, August 02, 2006 10:55 AM:

Whoa!

I thought you were trying to get at the difference being that although some things "ordered the same", they might not be considered identical. i.e. "AA" and "aa" would return 0 from CompareString(), but EqualString() might return false. I mean, that almost kind of makes sense. They go "in the same position in a list", but they're not "exactly the same".

But you're saying instead that two things which are "identical" can have a well-defined order? But ... but ... but ... that's absurd!

OK, maybe not absurd. But ... things which are identical are indistinguishable from each other. That's the meaning of identity. Isn't it? How ...? Wha ...?

Even more confused! Brain ... melting ...

(Although the original post is now starting to make a bit more sense now I know which way round the equality/comparison disconnect you're talking about is.)

# Michael S. Kaplan on Wednesday, August 02, 2006 11:30 AM:

Actually, what I am saying is that although you might consider two things to be equivalent, you still want their order to be fixed if they can appear to be a bit different.

Because a list like

AA
aa
aA
aA
Aa

looks much better as

aa
aA
aA
Aa
AA

and so on....

# Adam on Wednesday, August 02, 2006 11:30 AM:

Also, would you be kind enough to give an example of lossage that might occur if the ordering of identical strings is not determined by their content?[0] I'm having a hard time thinking of one.


[0] I keep rereading that sentence and my brain keeps telling me there's an error in it. I'm pretty sure it's not spelling or grammar, and the warning from my brain is the kind I get when I've gone back to a post and edited half a sentence, but not updated the other half to match. I think it's just confused about what it thinks "identity" means/is supposed to mean. :-/

# Michael S. Kaplan on Wednesday, August 02, 2006 11:34 AM:

Think of it is that if you order a list as if it were case sensitive, then you will consider all of these items to be almost identical, and they will sort NEAR each other.

When ordering a list, there is no need to lose these small distinctions. It is only when asking the generic question of "ARE YOU EQUAL TO ____?" that one might need to be less discriminating, based on specific rules....

# Michael S. Kaplan on Wednesday, August 02, 2006 11:36 AM:

Adam --

Also note that the same problem extended to the NORM_IGNORENOSPACE flag is the cause for this (related) problem....

# Adam on Wednesday, August 02, 2006 11:43 AM:

Hmmm....maybe. Although if I'd wanted some kind of "fixed order" when sorting a list with equal items, I'd likely have used a stable sort to maintain the original order of things that are equal. Except that a stable sort will no longer work with this behaviour. (Depending on your definition of "work", of course)

# Michael S. Kaplan on Wednesday, August 02, 2006 11:51 AM:

Well, a stable sort will only work if everything was in the right order in the first place -- and if items are added randomly, this will not be the case.

And as I showed in the Korean example in that link above, equal is not always what you want here -- esp. when you have over 100 characters that are "equal" in some sense....

# Adam on Wednesday, August 02, 2006 12:32 PM:

So, and I think I've got this now, EqualString(), when "ignoring case", _does_ ignore case, and will tell you if two strings _are_ equal to each other. However, this does not mean that CompareString() will say there is no difference between them because it doesn't _ignore_ case, it just does something different with it.

Hang on - in an English locale, "A" and "a" _are_ next to each other anyway, aren't they? Isn't the situation you describe just a case-*sensitive* sort in an English locale? So, what's the difference between ignoring case and not ignoring case in such a scenario? Isn't CompareString() just ignoring the ignore case flag?


Sorry - is there some kind of primer I should be reading to get all this? The posts you linked to (and the posts they linked to) have great insights on individual facets of this, but ... this feels like a book-length subject. Can you recommend one?

# Michael S. Kaplan on Wednesday, August 02, 2006 3:05 PM:

I do not know of a book, but there is this blog. :-)

# Adam on Wednesday, August 02, 2006 6:28 PM:

:-)

But I feel guilty about taking up your time with this. One or two questions seems "reasonable". 5 or 6 seems intrusive, and makes me think that this must surely have been written down somewhere before.

But, if that comment can be taken as an invitation - my last question is really intriguing me now. What is the difference between the case-sensitive and case-insensitive versions of StringCompare()  in an English locale?

An example - given the string "ABABabab", I'd expect the following results from StringCompare() in an English locale:

"en" locale, case sensitive "AAaaBBbb" (stable)
"en" locale, case insensitive "aAaABbbB" (unstable)

For comparison with the old "C" locale:

"C" locale, codepoint sensitive: "AABBaabb" (stable)
"C" locale, case insensitive: "AaaAbBbB" (unstable)

However, you seem to claim (in your 11:30 post) that "en, case insensitive" should be the same as what I think "en, case sensitive" would be. So, what's the point in the case-sensitivity flag? What am I missing?

# Michael S. Kaplan on Wednesday, August 02, 2006 7:22 PM:

I tell people when they are annoying me, you haven't hit that limit. :-)

Note that we sort the opposite way in Windows (lowercase first), other than that the difference is that either *a* and *A* have a specific order, or they do not. There is no in between.

(the stable/unstable issue is a separate one which has more to do with whether re-sorting the list one just sorted can change the order by flipping some of these "identical" entries)

# Dean Harding on Wednesday, August 02, 2006 7:41 PM:

Adam: I think you're right. In case-insensitive, "A" and "a" will appear next to each other, but you don't know which one will come first. In case-sensitive, "A" and "a" still come next to each other, but "a" will always come before "A".

In C-locale, the "case-sensitive" strcmp will sort "a" next to "Z" (or is it the other way around?)

# Michael S. Kaplan on Wednesday, August 02, 2006 7:46 PM:

The other way around. :-)

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

# Adam on Thursday, August 03, 2006 3:33 AM:

Dean - that was my point. If I've understood Michael correctly, in a case-insensitive comparison you _do_ know which will come first! The order is well-defined - leaving me to wonder what the NORM_IGNORECASE flag is for. What _does_ it do if there is apparently no difference between a case-sensitive and case-insensitive sort?

# Michael S. Kaplan on Thursday, August 03, 2006 5:15 AM:

Hi Adam --

I explain it quite literally in that first link -- NORM_IGNORECASE gives you no deterministic ordering between them. You simply do not know.

There is a clear difference between the two -- case sensitive is deterministic and case in-sensitive is random.

# Adam on Thursday, August 03, 2006 6:50 AM:

Huh? Isn't that the opposite of what you said in your 10:30AM post?

So are you saying now that EqualString() _would_ return true if and only if CompareString() returns 0, if they were both passed the NORM_IGNORECASE flag? (Or they were both _not_ passed the flag?)

(Which is what I originally asked, but you said was not the case. Didn't you? Are we just going round in circles now?)

# Michael S. Kaplan on Thursday, August 03, 2006 7:23 AM:

Actually, no -- I never said really anything about EqualString in the comments, since the function does not exist.

If you mean the post that gives the two different ways five strings might be ordererd, the first one has NORM_IGNORECASE passed; the second one does not have it passed.

What I have been saying in terms of a summary is:

DETERMINING ORDER -- do not pass the NORM_IGNORECASE flag

VERIFYING IDENTITY -- pass the NORM_IGNORECASE flag.

I never said the function would behave differently; I am saying that the caller should CALL them differently. And thus the the post title "Sometimes, ignoring case is stupid."

# Adam on Thursday, August 03, 2006 8:54 AM:

Ahhhhhhhhh............! That makes sense now!

Sorry - reading through the back articles, particularly the one I quoted in my first post, I got the impression that there was a fundamental difference between how order and identity is determined, due to an (probably incorrectly perceived by me) implication that EqualString() would be inherently faster than CompareString().

(I'm aware that EqualString() doesn't exist - hence "would" instead of "does" in my last post)

Right - yes - if you use different ordering criteria for determining identity than you do for actually ordering things, then yes, of course your definition of "the same" will be different in each case.


I'm still not sure *why* you'd use different collation criteria for determining identity and order, but now that I get what you're actually saying I at least have a chance of figuring that one out! :-)

Sorry for the misunderstanding.


(p.s. Is it just me, or does your 2005/11/22 post I originally linked to imply that the mythical EqualString() would be inherently faster than CompareString() given the same collation options?)

# Michael S. Kaplan on Thursday, August 03, 2006 9:36 AM:

EqualString would be faster, as the question it answers takes less effort. :-)

For the real example of why you would want the two operations different, see that Korean post I linked to -- people consider the behasvior to be broken if you *don't* make the two operations separate.....

referenced by

2011/05/10 On Compare ≠ Equals, and the Relevance of both

go to newer or older post, or back to index or month or day