Garbage in, garbage out -- and this means Ü!

by Michael S. Kaplan, published on 2009/07/20 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/07/20/9840848.aspx


So the other day a colleague over in C++-ville forwarded a bug report he was looking at, one he wanted my thoughts about.

It went something like this:

Repro Steps:
+ Change language in Control Panel under "Regional an Language Options" tab "Advanced" to Japanese
+ Make new c++ console project with MFC support
+ Change Character set to Multi-Byte (_MBCS)
+ Compile folowing Code (0xFC => "ü"):


    char* test= "\xfc";
    CString tt;
    tt = test;
    tt.MakeUpper();


+ Run or Debug the program => Crashes when MakeUpper() is called.

The result of running this code, as the last step mentions, is a crash in the "Microsoft Visual C++ Debug Library".

Obviously, the question was whether this was a bug....

Now one of the cool things about the CRT, MFC, and ATL from the developer's point of view is that you don't have to take my word for it, you can look at the source if you don't believe me!

In this case, CStringA's MakeUpper function calls an internal function StringUppercase that calls (when non-Unicode data is passed) CharUpperA or CharUpperBuffA -- in this case I think CharUpperA. These functions both call LCMapString eventually, after some convolutions of their own....

But the important point is the

0xFC => "ü"

claim in the repro steps.

This is very true in Windows code page 1252, but since the steps require changing the default system locale to be Japanese, the code is running with Windows code page 932.

And on that code page, 0xFC is not LATIN SMALL LETTER U WITH DIAERESIS; it is a valid lead byte that is used for the following potential trail bytes.

So now we see what is happening.

Since a legal lead byte was found, the assumption is that a legal trail byte will follow. And when that attempt to access the non-existent trail byte happens, a crash occurs.

So whose bug would it be?

Sorting out what should be changed/fixed or if anything should be is something I'll leave to the owners to track down.... :-)

John Cowan on 20 Jul 2009 8:36 PM:

If a system function causes a user program to crash when the user passes in bad data, the answer is: It's Microsoft's fault.

Michael S. Kaplan on 21 Jul 2009 2:54 AM:

Um, okay.

But as I went to some length to explain, there are a lot of potential places to look here -- notice that even if you are right, there are three groups across two divisions involved.

Sure I can blame Google if there is a bug in a Google product. But I doubt that the investigative process within Google wouldn't be a bit more granular in its determination of where the bug might lie, or that a blog post (were someone to be as forthcoming about it as I have been) would not do the same....

Tim Greenwood on 22 Jul 2009 11:33 AM:

If we were discussing an end user application that crashed when given bad data then John would be correct. The author of the application would be at fault. But this is incorrect data being passed to a low level programming function. Complaining about a crash here is analogous to complaining about a segmentation error if I wrote

int *pa=0;

(*pa)++;

Random832 on 9 Aug 2009 11:56 PM:

Of course, the opposite assumption to the NLS one (that a lead byte is always followed by a trail byte) is a somewhat widespread C/C++ assumption - i.e. a zero byte (or a zero WCHAR) ends a string, full stop.

Which is probably where the ATLMFC people are coming from.

An you did mix up one thing - reading the byte after \xfc (i.e. the non-existent trail byte) is fine - that's what it's there for, it's the string terminator. It's reading (and/or writing) the byte after that, and so on until encountering another zero that _isn't_ preceded by a lead byte, because it missed the real null terminator because it skipped over it thinking it was a trail byte, that is the problem.

Ultimately it depends on whether a buffer containing { 0xFC, 0x00 } is a valid "null-terminated string". If it is, then the bug is in CharUpperA (and would be in any length-determining function that would not always return 1 on this buffer)

Michael S. Kaplan on 11 Aug 2009 12:40 PM:

Some follow-up thoughts and the actual cause here...


referenced by

2009/08/11 On the nonubiquitousness of Ü

go to newer or older post, or back to index or month or day