When _wcsnicmp can't hack it, CompareStringW delivers

by Michael S. Kaplan, published on 2006/10/23 00:11 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2006/10/23/860181.aspx


The question that nikos asked in the microsoft.public.win32.programmer.international newsgroup was:

This occurs within code that searches text within files

If the text file (UTF-8) contains a bulgarian P (capital) and I try to match it with a lowercase p, _wcsnicmp "fails" (i.e. says strings are not identical but they ARE when case is not considered) whereas lstrmpi succeeds.

A bit more information:

The letter appears in the file as these bytes
CAPITAL P (UTF8): (EF BB BF) D0  A0

I load the file and convert to unicode as such:
MultiByteToWideChar(CP_UTF8, 0, ....);

in unicode trim the same letter has the following encoding:
normal(UPPER) Разредка
0D642D0   20 04 30 04 37 04 40 04   .0.7.@.
00D642D8  35 04 34 04 3A 04 30 04  5.4.:.0.
target (LOWER) разредка
00D64500  40 04 30 04 37 04 40 04  @.0.7.@.
00D64508  35 04 34 04 3A 04 30 04  5.4.:.0.


_wcsnicmp fails to see these strings as identical whereas lstrcmpi is ok. Clearly this has something to do with the locale CRT uses and WINAPI uses, but I didn't touch anything, the program is running with
the defaults of a unicode application for an english (UK) pc.

I could switch to lstrcmpi but my problem is that i search for text in a big buffer which isn't 0-terminated, so i'd have to copy it out and that would slow things down

any clues?
thanks
nikos

Now there are two separate issues here (the UTF-8 piece in the beginning is a red herring), and one note:

The note is that text should be looked in the debugger as WORD values rather than BYTE values, to avoid the byte reversal seen above (the actual text has code point values of U+0420 U+0430 and so on).

Issue #1 has to do with why _wcsnicmp's results don't match lstrcmpiW's. The reason is that _wcsnicmp is doing a lexicographic (binary) comparison, and lstrcmpiW is doing a linguistic comparison. If you really want to use the CRT here, then you should really switch to the CRT's linguistic comparison function, _wcsnicoll.

And then, once you take care of the first issue and have two functions with the same basic method of returning linguistically appropriate results, you have Issue #2 to deal with:  to make sure they are using appropriate locale values. With lstrcmpiW one has no choice -- in Windows < Vista the thread locale is used, and then in Vista the user locale is now what is used. But with _wcsnicoll one must either set the locale appropriately (it starts up with the "C" locale which only handles A-Z/a-z casing, which will never match lstrcmpiW), or else call the new _wcsnicoll_l, which allows you to pass the locale you wish to use.

Now of course this points to what may be the best solution for a single function that will let you pass string length, ignore case, choose an appropriate locale, and work in different versions of Windows -- the master NLS collation function, CompareStringW!

And lstrcmpiW is just a wrapper around CompareStringW anyway, so if you almost liked the behavior of lstrcmpiW then the behavior of CompareStringW should be perfect. :-)

 

This post brought to you by Р (U+0420, a.k.a. CYRILLIC LETTER CAPITAL ER)


# Adam on Monday, October 23, 2006 3:44 AM:

"But with _wcsnicoll one must either set the locale appropriately (it starts up with the "C" locale which only handles A-Z/a-z casing, which will never match lstrcmpiW)"

Wha!?!

Why on earth does the locale not propagate to the C runtime properly? When setting the Windows locale, why isn't setlocale() also called (or faked) so that the C functions (like wcsnicoll()) use the same locale as the Windows functions?

Are you /trying/ to make the standard C functions look broken?

# Michael S. Kaplan on Monday, October 23, 2006 3:51 AM:

Me? No.

But I believe the standard defines the default behavior here. You have to call SetLocale to yourself to choose the behavior that matches the OS user settings.

# Adam on Monday, October 23, 2006 10:38 AM:

Apologies - by "you", I meant MS.

Still, according to the POSIX standard[0] (which follows the C standard wherever the C standard defines behaviour) for how the locale is set up:

"If the LANG environment variable is not set or is set to the empty string, the implementation-dependent default locale is used."

So, if the user does not set a locale in their environment (which most users will not), the implementation is free to use any suitable default locale. With windows, that would appear to me to be the current windows locale for the user. I'd have certainly thought it would be more appropriate than the "C" locale!

Further, while the user *can* use setlocale()[1] to change their locale for a program using the C runtime, I am not aware of *any* prohibition in the C (or even POSIX) standard on implementors providing their own high-level functions that call any other standard function (e.g. setlocale()) as one small part part of their operation. Especially if such a call/behaviour was documented. Frankly, I'd be absolutely amazed if this kind of prohibition existed.

(Note - many of the standard functions are defined as being not allowed to affect the shared state of some non-reentrant library functions, e.g. rand(), but again that is only a limitation of the functions defined by the standard)

[0] http://www.opengroup.org/onlinepubs/007908799/xbd/envvar.html

[1] http://www.opengroup.org/onlinepubs/007908799/xsh/setlocale.html

# Michael S. Kaplan on Monday, October 23, 2006 11:02 AM:

Well, in that case the most likely reason would be the perf issue, I guess -- I mean, the "C" locale has the advantage of being much faster (at the cost of being somewhat linguistically lame).

# Mihai on Monday, October 23, 2006 1:09 PM:

From the C Standard (ISO 9899):

<<

At program startup, the equivalent of

   setlocale(LC_ALL, "C");

is executed.

>>

(section 7.11.1.1, "The setlocale function")

# Michael S. Kaplan on Monday, October 23, 2006 2:10 PM:

Ah, I guess my initial recollection was correct! :-)

# Adam on Monday, October 23, 2006 3:22 PM:

Mihai > Cool. But could you post some more context there? (C standard is expensive :( )

I'm pretty sure that this isn't mandated if LC_ALL (or any of the other language variables) is already set to something other than "C" in the environment. Why bother with the other environment vars if conforming apps are forced to override them all at program startup?

Seems odd.

# Dean Harding on Monday, October 23, 2006 8:36 PM:

Adam, C and POSIX are two different standards...

You can download drafts of ISO standards for free. For example, the latest working draft:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

From there:

4. At program startup, the equivalent of

setlocale(LC_ALL, "C");

is executed.

5. The implementation shall behave as if no library function calls the setlocale function.

# Adam on Tuesday, October 24, 2006 2:33 PM:

Dean,

Cheers for the link. I'm aware that C and POSIX are different. But the C standard is expensive and (as far as I *was* aware) not online, while POSIX (Single Unix Specification), which follows the C standard as much as it can, is. That's a great help though, thanks.

As for the "The implementation shall behave as if no library function calls the setlocale() function" - hmm......I'd always taken that to mean that the the implementation shall behave as if no *standard* library function calls the setlocale() function, but the standard does seem to be quite precise about using "library" and "standard library". But srand() has a similar clause (7.20.2.2) which means that an implementation may not provide, say, a "setgenseed(char *, unsigned)" function that would set the PRNG algorithm and seed at the same time, which seems - strange.

I can understand forcing such guarantees on *standard* library functions - a strictly conforming program should be able to rely on, say, strtok(), not breaking on one system as part of their implementation of the library uses it but does not act as-if it did not.

But yeah, your reading seems right.

But still - that leads me to wonder about where "the implementation" ends, and where "another library that happens to be supplied by the vendor" begins. There's certainly no prohibition on 3rd party libraries acting as-if they call setlocale() or srand() or strtok(). At what point does a library that happens to be supplied by the compiler vendor (e.g. Win32, as supplied by MS), where all parts of the library are defined in separate (non-standard) headers, stop being part of "the implementation"?

Maybe I should head over to comp.lang.c.moderated or comp.std.c :)


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day