Casing and IgnoreCase are still not the same thing....

by Michael S. Kaplan, published on 2006/03/15 03:50 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/15/551773.aspx


Rob commented in the Suggestion Box:

It might be worthwhile to address this article:

http://www.codeproject.com/buglist/comparenocase.asp

The author doesn't quite "get it" in that he thinks that you could, say, uppercase a string without caring about the locale (sharp S? dotted I?) but he does seem to have found some confusing documentation.

As Rob pointed out, there are a few problems here, mainly in perception. Add to that a few bugs, mainly in documentation, to help add to the confusion....

To start with, as I have pointed out previously, Collation != Case. And more importantly in this case, ignoring case in a linguistic comparison is not exactly the same as either uppercasing a string to make a binary comparison or lowercasing it.

Perhaps the best way to look at it is to realize that ignoring case is not guaranteed to be the same as trying to force the case to be identical....

(As you can see in a comment to that article by Abhinaba, there is confusion on this issue and the relation to the file system over on the SQL Server team as well, sometimes.)

Then, as I pointed out last May in this article, there are some locale specific behaviors in some collation operations that claim to be independent of such matters. Though it really is easy enough to look at this as the docs thinking about locale sensitivity in the collation, not the casing. Perhaps it should more explicitly point out what is happening here (I am sure it is too late to change the behavior, though I am loathe to suggest that yet another comparison method be added to the CRT to capture the OS comparison behavior)....

Now admittedly the various CString methods that Geert Delmeiren pointed out like CString::CollateNoCase and CString::CompareNoCase don't make it any easier to try discern what is happening. With such interesting text as:

The generic-text function _tcscoll, which is defined in TCHAR.H, maps to either stricoll, wcsicoll, or _mbsicoll depending on the character set that is defined at compile time. Each of these functions performs a case-insensitive comparison of the strings, according to the code page currently in use.

for CollateNoCase and

Compares this CString object with another string using the generic-text function _tcsicmp. The generic-text function _tcsicmp, which is defined in TCHAR.H, maps to either _stricmp, _wcsicmp, _mbsicmp depending on the character set that is defined at compile time. Each of these functions performs a case-insensitive comparison of the strings, and is not affected by locale.

for CompareNoCase, it is easy to see why anyone could get confused if they do not selectuvely ignore the docs and read blog entries here. :-)

However, I would not suggest using Geert's suggested MyCompareNoCase function as a solution generically since the exact soluion depends on the context and anyattempt at a "one size fits all" solution is bound to fail....

 

This post brought to you by "é" (U+00e9, a.k.a. LATIN SMALL LETTER E WITH ACUTE)


# Gabe on 15 Mar 2006 5:27 AM:

I can understand how it's possible to get unexpected matches with CompareNoCase. For example, I would expect the word "rèsumé" to uppercase with accents but compare as equal to "resume" because in my locale (us-en) accents don't matter.

What I don't understand is why CompareNoCase finds two strings different even though they roundtrip properly between cases? The only thing I can conclude is that CompareNoCase is using a different canonicalization function than strupr.

# Michael S. Kaplan on 15 Mar 2006 9:28 AM:

Well I believe source is available for CString for those who doubt? :-)

# Dean Harding on 15 Mar 2006 6:12 PM:

You got the documentation around the wrong way: CompareNoCase uses xxxicmp and CollateNoCase uses xxxicoll :-)

I was looking through the source, and if you use the C locale, then _wcsicmp uses an internal function __ascii_towlower, which is #define'd as:

#define __ascii_towlower(c)     ( (((c) >= L'A') && ((c) <= L'Z')) ? ((c) - L'A' + L'a') : (c) )

So that's obviously why it didn't work for him in the first place. If you change the locale (with setlocale, like he did towards the end) then instead of this #define, it ends up calling LCMapString to do the lowercasing.

On the other hand, I believe _wcsicoll uses CompareString. But it depends on how the LC_CTYPE and LC_COLLATE settings are set by default.

It's all very confusing!! I'd much prefer to just use CompareString/LCMapString directly. At least then you're being explicit about how you want the comparison/casing to work. I'd say you should only use the CRT functions if you want to port to other operating systems.

# Michael S. Kaplan on 15 Mar 2006 6:50 PM:

Hi Dean,

Actually, I got the documentation right -- note that the one it points to comes right after the quote in both cases. :-)

I agree that the CRT stuff is way too confusing here....

# Dean Harding on 15 Mar 2006 8:01 PM:

Oh sorry, you're right heh :-) I saw the link CString::CompareNoCase right above the text for the collate one and got mixed up...

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/10/25 Jokes that aren't really all that funny in the end (aka At least SQL Server isn't on our case)

2006/08/08 Collation != case, still

go to newer or older post, or back to index or month or day