If the data is invalid, the results can be invalid too

by Michael S. Kaplan, published on 2007/08/29 03:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/29/4624037.aspx

They say that a good lawyer never asks a question in court without already knowing the answer.

Well, I'd probably make a lousy lawyer.

Because when I was doing the research for What's up with MB_ERR_INVALID_CHARS?, I did not know the full extent of the overall limitations in the flag.

But given what I discovered, I made some recommendations.

Though I find myself really agreeing with Yossi and the comment Yossi left:

This inconsistency is pretty bad (the difference between how the actual Code page and best fit tables treat invalid characters). It renders MultiByteToWideChar pretty much useless in certain cases where these invalid characters are finding themselves into the output stream.

I'm using MSXML2 to read an XML file which was produced after converting MBSC character stream to Unicode. Since the following characters:

0x81 0x0081
0x8d 0x008d
0x8f 0x008f
0x90 0x0090
0x9d 0x009d

in the 1252 best fit appears to be "OK", the MSXML2 just fails to parse the file.

Is there a way to resolve this problem (other than to scan the stream in a for-loop and replacing this invalid characters?

Is there a version of MSXML2 that is consistent with the behavior of MultiByteToWideChar?

I do find myself curious about what method msxml2 is using here for its conversions that is managing to fail on these characters that are technically mapped in the code page 1252 that the system defines. How is this component doing its conversions, exactly?

But on the other hand, I am left with the knowledge that this never-before-defined behavior is hardly referring to bytes that are useful in a stream of text.

So if you are seeing them, it is entirely reasonable to consider the text to be corrupt. What is that expression? Garbage in, garbage out.

And then of course relying on code pages in this day and age is not the best plan even when you stay within the valid mapped characters that make up the long-documented portions of the code pages.

The best thing to do is just stay away from them, especially if you think you might have invalid data like Yossi was seeing (or maybe investigating whatever is converting text to these unexpected code points!).

This post brought to you by ǻ (U+01fb, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day