Similar descriptions does not mean similar methodologies

by Michael S. Kaplan, published on 2005/05/08 13:30 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/05/08/415522.aspx

The other day, I had to take a look at the various unmanaged case insensitive string comparison functions. I thought I would post what the comparison/contrast information.

First the locale sensitive functions:

CompareStringW (kernel32.dll) -- the mother of all of the functions below, you can choose the locale, the flags, and whether the strings are counted or null-terminated. Embedded nulls are allowed.
lstrcmpiW (user32.dll) -- assumes null-terminated strings, then calls CompareStringW with the NORM_IGNORECASE flag and the thread locale (if that fails then it tries again with the system locale; in the unlikely event both fail, it uses a call to _wcsicmp).
_wcsicoll (CRT) -- assumes null-terminated strings. If using the "C" locale, does an ASCII (A to Z) ToLowercase followed by a binary compare; otherwise it calls CompareStringW with the LCID of the CRT locale and the SORT_STRINGSORT and NORM_IGNORECASE flags.
_wcsnicoll (CRT) -- takes one count parameter for both strings, but will also exit on an embedded null. If using the "C" locale, does an ASCII (A to Z) ToLowercase followed by a binary compare; otherwise it calls CompareStringW with the LCID of the CRT locale and the SORT_STRINGSORT and NORM_IGNORECASE flags (note that using just one count parameter will break compressions on locales that use them and expansions on all locales).
StrCmpIW (shlwapi.dll) -- assumes null-terminated strings, then calls CompareStringW with the NORM_IGNORECASE flag and the thread locale (if that fails then it tries again with the system locale). Manages to look a lot like lstrcmpiW, though not completely so in rare scenarios.
StrCmpNIW (shlwapi.dll) -- takes one count parameter for both strings, but will also exit on an embedded null. It calls CompareStringW with the thread locale of the CRT locale and the NORM_IGNORECASE flags (note that using just one count parameter will break compressions on locales that use them and expansions on all locales). Manages to look a lot like a hybrid of lstrcmpiW and _wcsnicoll.
StrCmpLogicalW (shlwapi.dll) -- does linguistic comparisons using the thread locale (falling back to the system locale on failure), cleverly wrapping multiple calls to CompareStringW to support treating the 0123456789 digits as numbers.

And now the locale insensitive functions:

RtlCompareUnicodeString (ntdll.dll) -- taking lengths in it UNICODE_STRING parameters (and allowing embedded nulls), it converts characters to uppercase and then does a binary comparison on them. This comparison matches what a lot of the operating system does for many of its objects (most of which use this very function!).
_wcsicmp (CRT) -- assumes null-terminated strings. If using the "C" locale, on each character it does an ASCII (A to Z) ToLowercase followed by a binary compare; otherwise on each character it does a full ToLowercase followed by a binary compare.
_wcsnicmp (CRT) -- takes one count parameter for both strings, but will also exit on an embedded null. If using the "C" locale, on each character it does an ASCII (A to Z) ToLowercase followed by a binary compare; otherwise on each character it does a full ToLowercase followed by a binary compare.
StrCmpICW (shlwapi.dll) -- assumes null-terminated strings. On each character it does an ASCII (A to Z) ToLowercase followed by a binary compare. It matches the "C" locale behavior of _wcsicmp, which of course does not match the OS behavior at all.
StrCmpNICW (shlwapi.dll) -- takes one count parameter for both strings, but will also exit on an embedded null. On each character it does an ASCII (A to Z) ToLowercase followed by a binary compare. It matches the "C" locale behavior of _wcsicmp, which of course does not match the OS behavior at all.

A few interesting points about these functions:

1) According to comments in the SHLWAPI source, many of them were initially added because the CRT and user32 counterparts were not supported on earlier versions of Win9x. Kind of ironic when you note the small behavior differences between them all, huh?

2) Given the Georgian casing issue, it is a little sad that almost all of these functions that convert prior to comparison use a lowercasing operation when so much of the core OS uses uppercasing. Especially given how often people use the functions to emulate the OS behavior for tidier validation messages. Luckily, the amount of data in Khutsuri is small so the inconsistency is not often noticed.

3) Am I the only person who thinks it is weird that _wcsicmp and _wcsnicmp have locale-specific behaviors, especially such really weird ones? They doc this a bit I guess, but until I looked at the code I would never have guessed.

4) CompareStringW is definitely the king of the linguistic comparison -- everyone else is either (a) calling our function, (b) doing the job wrong, or (c) both!

Now there is no king (nor good heir apparent) for the non-linguistic comparison right now in unmanaged code, like I talk about here.

Yes, I am still thinking about it. :-)

The situation is kind of like when you have a vacancy in management and a lot of "wannabe" replacements (like these other functions), none of whom really fit the bill and none of whom can get the job done themselves. If you know what I mean....

This post brought to you by "ς" (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA)

# Sriram on Sunday, May 08, 2005 2:40 PM:

All for the sake of comparing 2 strings. Whatever happened to good old strcmp? :-)

# Michael S. Kaplan on Sunday, May 08, 2005 2:51 PM:

Ah, remember my criteria -- Unicode, case insensitive. The strcmp function (intrinsic or CRT) is none of those. :-)

# Michael S. Kaplan on Sunday, May 08, 2005 3:37 PM:

Don't worry, I'll point out the explosion of methods and overrides in managed code soon. I hinted at them in http://blogs.msdn.com/michkap/archive/2005/04/14/408116.aspx

:-)

# Dean Harding on Sunday, May 08, 2005 7:04 PM:

Well, to be honest, I prefer lots of overloads to lots of differently-named functions. At least with overloads you can look in the same place for all the documentation whereas with differently-named functions, you've got to rely on the documentation to include pointers to all the other possible variants.

Still, one function that can do it all would be best of all, even if I have to write my own little wrappers for my own special cases. At least then I can follow my own standards, rather than trying to remember the difference between RtlCompareUnicodeString, StrCmpNIW and lstrcmpiW for example...

# Michael S. Kaplan on Sunday, May 08, 2005 7:17 PM:

Well, me too.

But I prefer fewer functions with fewer overrides best of all -- with lots of intuitive enumerations, which intellisense also help with....

# Someone passing by on Monday, May 09, 2005 5:56 PM:

# StrCmpLogicalW (shlwapi.dll) -- does linguistic comparisons using the thread locale (falling back to the system locale on failure), cleverly wrapping multiple calls to CompareStringW to upport treating the 0123456789 digits as numbers.
^

to support;)

# Michael S. Kaplan on Monday, May 09, 2005 9:26 PM:

Good catch -- fixed now. :-)

# Nazgul on Tuesday, July 05, 2005 9:16 AM:

Hi. I'm trying to use CompareStringW to compare some WideStrings and I need to compare them case-sensitively. However, I always got then compared case-insensitively. I did NOT set the "NORM_IGNORECASE" flag on.
So, when I sort strings "France", "Portugal" and "other", I want the result to be either

France
Portugal
other

or

other
France
Portugal

but what I get is

France
other
Portugal

cuz when I compare "France" and "Portugal", the result is 1 (this is correct), comparing "other" and "Portugal" gives 1 (that's correct, too), but comparing "France" and "other" also gives 1 (incorrect, should be 3).
It's interesting that whene I call CompareStringW on "portugal" and "Portugal" the result I get is not 2, but 1. It looks like this function does case-insensitive comparison, and only if the compared strings don't differ (case-insensitive) it looks on the case.
Is there a way to make the CompareStringW function not ignore the case?
I am using locale MAKELCID(MAKELANGID(LANG_CZECH, SUBLANG_DEFAULT), SORT_DEFAULT), but it behaves exactly in the same way even if I set it to MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_DEFAULT), SORT_DEFAULT).

# Michael S. Kaplan on Tuesday, July 05, 2005 10:32 AM:

Hi Nazgul, See my post 'What it means to be case insensitive' at <A HREF="/michkap/archive/2005/06/16/429667.aspx">http://blogs.msdn.com/michkap/archive/2005/06/16/429667.aspx</A> to understand what is meant here. There is no NLS function that does what you want here, and it would certainly not be an 'ignore case' since that is the opposite of what you are doing -- you are not only *not* ignoring case, you are going out o you way to pay attention to it in non-intuitive ways! :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2006/06/16 Neither GDI nor Uniscribe solve the ultimate font problem completely, either

2006/03/15 Casing and IgnoreCase are still not the same thing....

2005/06/16 More on locales in SQL Server

2005/06/12 Browsing the shoals of managed string comparisons

2005/06/02 The New String recommendations

go to newer or older post, or back to index or month or day