When conversions ignore the errors...

by Michael S. Kaplan, published on 2009/09/04 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/09/04/9890852.aspx


The other day I was sent mail about a Connect bug. This Connect bug, in fact.

The title alone (mbstowcs_s does not return an error when the current code page does not support all the characters in mbstr) might suggest what is going on to some of you

And the description will give a hint to some of you too:

When mbstr contains characters not supported by the current process code page, mbstowcs_s does not return an error and put garbage characters in wcstr.

Example:

setlocale(LC_CTYPE, ".1252"); //set the process to use a locale with English code page
    //you can also try not setting the locale. The default process LC_CTYPE locale is C
    //which means 7-bit ASCII.
char* mbTestStr = "Test. 真的."; //this is a 9 character string with 2 Chinese characters.
size_t charCount;
wchar_t wcStr[50];
errno_t error = mbstowcs_s(charCount, wcStr, 50, mbStr, -1);

After the call, no error is returned, charCount becomes 12, and wcStr contains "Test. ÕæµÄ." It seems charCount is the actual byte count in mbStr. The two Chinese characters each takes two bytes in mbStr.

The function should fail to convert the Chinese characters and return an error because the code page does not support Chinese characters.

If I set the locale to ".936" (936 is a code page for simplified Chinese). No error is returned, charCount becomes 10, and wcStr contains "Test. 真的.". Everything is correct.

_mbstowcs_s_l has the same problem if you give it a locale that does not support all the characters in mbstr.

Sound familiar yet? :-)

When people started digging into the issue, they found that under the covers, MultiByteToWideChar was being called with the MB_ERR_INVALID_CHARS flag.

Which should really at first glance be able to protect developers from this kind of thing -- if a character is invalid there are times you would like it to be treated as such!

Unfortunately, like I pointed out back in 2007 in What's up with MB_ERR_INVALID_CHARS?, it doesn't always get to work this way.

In fact the byte in question (0x8F) is not defined in code page 1252, but not handled by MB_ERR_INVALID_CHARS -- thus you get this "ignore it" behavior, along with being mapped to a control character that comes up as garbage.

So the backcompat issue rears its ugly head, with the argument being that this behavior has always been there.

When I think of all the breaks that have been introduced in the last few years in code pages for stated reasons like security hardening and Unicode conformance, I wonder whether it is a good time to question these issues and clean up crap like this.

Though I may be the only one who feels that way....


Random832 on 4 Sep 2009 11:32 AM:

"Test. 真的."; //this is a 9 character string with 2 Chinese characters.

No, it's a 11-byte string (not including the null terminator) containing the bytes that the compiler put there. Which happen to be 0xD5 0xE6 0xB5 0xC4, all of which are, obviously, defined and valid in codepage 1252.

There is no way for any function taking an ANSI string to distinguish this from a codepage 1252 string that actually is "Test. ÕæµÄ.", so it's not really a "the behavior has always been there" thing.

Michael S. Kaplan on 4 Sep 2009 1:30 PM:

Um, if you look at the Connect bug you will see that some of the examples deal with the case I describe (I did not pull that lead byte out of my ear!)....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day