What's up with MB_ERR_INVALID_CHARS?

by Michael S. Kaplan, published on 2007/07/25 03:11 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/07/25/4037646.aspx


It seems like it was just yesterday that I posted about how TAV is in the public use area.

Admittedly the reason is that it was only yesterday.... :-)

Now there is one issue that has never been discussed and which frankly has never been described well in the docs anywhere -- the way that these unmapped characters in the code pages are being mapped to the Private Use Area (for the sake of EUDC if you want to believe the comments!), or elsewhere.

Now I have talked about MB_ERR_INVALID_CHARS before, like in A few of the gotchas of MultiByteToWideChar and How does it detect invalid characters?, and believe you me I will be rethinking that flag's usefulness a bit from here on!

Anyway....

Here are the "strange" characters in each code page, both the weirdly mapped and the ones mapped to the PUA, all of which are documented in the original tables as having no mapping:

Code page 874 (best fit table):

0x81 0x0081
0x82 0x0082
0x83 0x0083
0x84 0x0084
0x86 0x0086
0x87 0x0087
0x88 0x0088
0x89 0x0089
0x8a 0x008a
0x8b 0x008b
0x8c 0x008c
0x8d 0x008d
0x8e 0x008e
0x8f 0x008f
0x90 0x0090
0x98 0x0098
0x99 0x0099
0x9a 0x009a
0x9b 0x009b
0x9c 0x009c
0x9d 0x009d
0x9e 0x009e
0x9f 0x009f
0xdb 0xf8c1 ;Undefined -> EUDC
0xdc 0xf8c2 ;Undefined -> EUDC
0xdd 0xf8c3 ;Undefined -> EUDC
0xde 0xf8c4 ;Undefined -> EUDC
0xfc 0xf8c5 ;Undefined -> EUDC
0xfd 0xf8c6 ;Undefined -> EUDC
0xfe 0xf8c7 ;Undefined -> EUDC
0xff 0xf8c8 ;Undefined -> EUDC

Code page 932 (best fit table) -- SBCS only:

0x80 0x0080  ;Control
0xa0 0xf8f0 ;Undefined -> EUDC
0xfd 0xf8f1 ;Undefined -> EUDC
0xfe 0xf8f2 ;Undefined -> EUDC
0xff 0xf8f3 ;Undefined -> EUDC

Code page 936 (best fit table) -- SBCS only:

0xff 0xf8f3 ;?

Code page 949 (best fit table) -- SBCS only:

0x80 0x0080 ;Undefined -> Control
0xff 0xf8f7 ;Undefined -> EUDC

Code page 950 (best fit table) -- SBCS only:

0x80 0x0080 ;Undefined -> Control
0xff 0xf8f8 ;Undefined -> EUDC

Code page 1250 (best fit table):

0x81 0x0081
0x83 0x0083
0x88 0x0088
0x90 0x0090
0x98 0x0098

Code page 1251 (best fit table):

0x98 0x0098

Code page 1252 (best fit table):

0x81 0x0081
0x8d 0x008d
0x8f 0x008f
0x90 0x0090
0x9d 0x009d

Code page 1253 (best fit table):

0x81 0x0081
0x88 0x0088
0x8a 0x008a
0x8c 0x008c
0x8d 0x008d
0x8e 0x008e
0x8f 0x008f
0x90 0x0090
0x98 0x0098
0x9a 0x009a
0x9c 0x009c
0x9d 0x009d
0x9e 0x009e
0x9f 0x009f
0xaa 0xf8f9 ;Undefined -> EUDC
0xd2 0xf8fa ;Undefined -> EUDC
0xff 0xf8fb ;Undefined -> EUDC

Code page 1254 (best fit table):

0x81 0x0081
0x8d 0x008d
0x8e 0x008e
0x8f 0x008f
0x90 0x0090
0x9d 0x009d
0x9e 0x009e

Code page 1255 (best fit table):

0x81 0x0081 ;Undefined -> Control
0x8a 0x008a ;Undefined -> Control
0x8c 0x008c ;Undefined -> Control
0x8d 0x008d ;Undefined -> Control
0x8e 0x008e ;Undefined -> Control
0x8f 0x008f ;Undefined -> Control
0x90 0x0090 ;Undefined -> Control
0x9a 0x009a ;Undefined -> Control
0x9c 0x009c ;Undefined -> Control
0x9d 0x009d ;Undefined -> Control
0x9e 0x009e ;Undefined -> Control
0x9f 0x009f ;Undefined -> Control
0xd9 0xf88d ;Undefined -> EUDC
0xda 0xf88e ;Undefined -> EUDC
0xdb 0xf88f ;Undefined -> EUDC
0xdc 0xf890 ;Undefined -> EUDC
0xdd 0xf891 ;Undefined -> EUDC
0xde 0xf892 ;Undefined -> EUDC
0xdf 0xf893 ;Undefined -> EUDC
0xfb 0xf894 ;Undefined -> EUDC
0xfc 0xf895 ;Undefined -> EUDC
0xff 0xf896 ;Undefined -> EUDC

Code page 1256 (best fit table):

None (I talked about this previously here.)

Code page 1257 (best fit table):

0x81 0x0081
0x88 0x0088
0x8a 0x008a
0x8c 0x008c
0x90 0x0090
0x98 0x0098 ;Not Used
0x9a 0x009a
0x9c 0x009c
0x9f 0x009f
0xa1 0xf8fc ;Undefined -> EUDC
0xa5 0xf8fd ;Undefined -> EUDC

Code page 1258 (best fit table):

0x81 0x0081 ;Undefined -> Control
0x8a 0x008a ;Undefined -> Control
0x8d 0x008d ;Undefined -> Control
0x8e 0x008e ;Undefined -> Control
0x8f 0x008f ;Undefined -> Control
0x90 0x0090 ;Undefined -> Control
0x9a 0x009a ;Undefined -> Control
0x9d 0x009d ;Undefined -> Control
0x9e 0x009e ;Undefined -> Control

So in the end, every single code point from 0x00 to 0xFF is accounted for in the database, and the MB_ERR_INVALID_CHARS flag functions only by detecting the characters that map to the PUA (the green entries above) -- all of the ones that map to control characters are NOT considered invalid.

Further, if you don't pass MB_ERR_INVALID_CHARS, then none of them will be replaced by the default character -- they map as indicated here.

Now as I said at the beginning of this post, very little of this is currently documented. I'll make sure that the doc folks get to hear about this to address the problem at some point....

 

This post brought to you by  (U+05ea, a.k.a. HEBREW LETTER TAV)


# ctate on 31 Jul 2007 2:34 PM:

I'm afraid this is only tangentially related to the topic here, but it's similar and I am coming up with dead zero information about it, so here goes...

Take an application built without /DUNICODE, i.e. one using straight ASCII and code pages and so on.  Let's also assume for now that we're using the default US locale and code page set, with no jiggery-pokery.

If that application, running on Windows XP [or 98 or 95 or...] calls ExtTextOut() to write the character 0x7f, a solid block is drawn on the screen.

In Vista, nothing is drawn.  In fact, nothing is drawn in Vista when calling ExtTextOut() in this application with any extended ASCII character.

This is a problem... how is a non-Unicode application running in Vista supposed to draw a solid block, or diacritics, or...?  I realize that MS is trying to push people towards using Unicode, but for legacy apps this is major engineering, especially when those apps' source is shared across platforms.

Do you at least know of any documentation on this Vista change and how to work with it?

# Michael S. Kaplan on 31 Jul 2007 5:11 PM:

0x7f is not a character - it is the delete code. There are all kinds of wonky things that code can do, this would be one of them....

# Yossi on 28 Aug 2007 7:36 PM:

This inconsistency is pretty bad (the difference between how the actual Code page and best fit tables treat invalid characters). It renders MultiByteToWideChar pretty much useless in certain cases where these invalid characters are finding themselves into the output stream.

I'm using MSXML2 to read an XML file which was produced after converting MBSC character stream to Unicode. Since the following characters:

0x81 0x0081

0x8d 0x008d

0x8f 0x008f

0x90 0x0090

0x9d 0x009d

in the 1252 best fit appears to be "OK", the MSXML2 just fails to parse the file.

Is there a way to resolve this problem (other than to scan the stream in a for-loop and replacing this invalid characters?

Is there a version of MSXML2 that is consistent with the behavior of MultiByteToWideChar?


referenced by

2012/02/20 Where short file names can fail

2009/09/04 When conversions ignore the errors...

2008/05/08 In hindsight, they may have BEST FIT these files where the sun never shines

2007/08/29 If the data is invalid, the results can be invalid too

go to newer or older post, or back to index or month or day