On the nonubiquitousness of Ü

by Michael S. Kaplan, published on 2009/08/11 12:29 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/08/11/9864576.aspx

So, in a quick follow-up to Garbage in, garbage out -- and this means Ü! from the other day, the investigation into the cause has happened:

ok, what happens is that MFC is actually throwing an exception, because _mbsupr_s() does return EILSEQ. The crash is caused by the unhandled exception.

if you put a try catch around the MakeUpper, you'll see the exception being caught:

        try {
        } catch(CException *) {
            printf("mfc exception\n");

_mbsupr_s() is called by Checked::mbsupr_s from StringUppercase:

        static LPSTR __cdecl StringUppercase( _Inout_cap_(size) LPSTR psz, _In_ size_t size ) throw()
            Checked::mbsupr_s(reinterpret_cast< unsigned char* >( psz ), size);
            return psz;

_mbsupr_s() returns EILSEQ because 0xFC is a lead byte in MBCS Japanese.

Okay, so now blame can be assigned. :-)

Everyone is clearly doing what they meant to be doing, at some level. Though the question of whether there should be a different way to handle this case is an interesting one (an exception in a MakeUpper method call seems like it may be somewhat unexpected.

Why do I think that?

Well, the data itself is invalid on its face -- the meaning of a solitary 0xFC in dats purporting to be Japanese is invalid.

But is it truly the job of MakeUpper to do data validation?

At the point where the data is already in a string and one wishes to uppercase it is perhaps not the best time for anyone to start screaming about failing to tag second base.

And I am not so keen on MakeUpper in the role of umpire, either :-)

So while I will still say the caller of the code, the who made the invalid code page assumption (about the ubiquity of Ü) that is at fault, it is hard to wonder whether this small amount of over-officiousness is also at least a little bit to blame for the problem, too.

This is not unlike the things that bother me about over-validation in codepage conversions -- in both cases I'd want options to have the functions I call do less work and be less demanding of their notion of data purity.

Just a desire to trust the caller unless the caller asks for validation of some kind....

I'll admit that I am till wrapping my head around the problem here, but I thought I'd run it up the flagpole and see what others thought.

Ideas, anyone?

bg on 12 Aug 2009 5:51 AM:

the exception in makeupper does seem a little extreme. but you kinda want that security layer that the xxx_s stuff gives you as well. you can't really have one and not the other these days.

are there a lot of corner issues stopping the user from validating the string first?

oh and don't u need to call delete on that exception ;)


Random832 on 13 Aug 2009 6:54 AM:

It's funny, because my reaction is the opposite - when I thought it was walking past the end of the stream I thought its behavior was wrong, whereas with it throwing the exception it's more reasonable. The only arguable "fault" is in throwing an exception rather than leaving it as an error return that can be more easily ignored.

Since it has to do the actual case mapping (I assume) in unicode, the other "reasonable" thing to do would be to silently replace the invalid character with a substitution character. As it turns out, if I try to convert this character, I get U+30FB, which becomes cp932 0x8145 - which would not necessarily _fit_ in the buffer. The obvious answer would be to use 0x3F (or maybe 0xA5, the halfwidth version of this character, though this would require that specific information in the codepage) instead.

Mihai on 27 Aug 2009 2:34 PM:

I tend to agree with bg here: exception is ok, especially in a function that promises secure behavior.

Michael S. Kaplan on 27 Aug 2009 8:07 PM:

The MakeUpper method does not have a specific, documented secure behavior guaranteed though, AFAIK. What does this do to the opinion?

go to newer or older post, or back to index or month or day