TAV is in the public use area

by Michael S. Kaplan, published on 2007/07/24 13:06 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/07/24/4031609.aspx


Some time yesterday, Chris Shearer Cooper asked in a whole mess of newsgroups (microsoft.public.il.hebrew.vc, microsoft.public.il.hebrew.windows2000, microsoft.public.win32.programmer.international, and microsoft.public.win32.programmer.kernelI:

Can anyone tell me what MultiByteToWideChar() does if you pass it an invalid input character but do not set the MB_ERR_INVALID_CHARS flag?

By "invalid character" I have a specific test case, which is trying to convert character 0xFA in code page 1255 (Hebrew) into Unicode.  There is no character at that position in that code page.

I'm trying to understand the documentation for MultiByteToWideChar(), it says an invalid character is "a character that is not the default character in the source string but translates to the default character when MB_ERR_INVALID_CHARS is not set".  How do I know what "the default character" is in the source string, and I assume that "the default character" could be a different value in the destination string?

What I see on my machine, is that MultiByteToWideChar() succeeds (as it should, I didn't set the MB_ERR_INVALID_CHARS flag) and in my output string it has placed the Unicode character 0xF894 which is in the "Private Use" area of Unicode.  Can I rely on MultiByteToWideChar() to convert invalid 8-bit characters to Unicode characters in the range E000-F8FF?  Or is there
a Windows function to get the value of "the default character"?

Thanks,
Chris

Thankfully, one of those newsgroups was one I follow. :-)

Anyway, Bob Eaton got there first and pointed out a bunch of facts to help explain what appears there in this case:

On my XP/SP2 system, the code page 1255 (Ansi--Hebrew) shows the following information:

Max bytes/character: 1
Default legacy character: (?) (i.e. question mark = d63)
Default Unicode character: (?) (i.e. the same = u003f)

and it does have a Unicode equivalent for d250 (=0xFA) which is ת (u05ea).

But I think the answer to your question is if you were to pass it an invalid input character (which I don't think 0xFA is), then it should return a question mark character at that position.

The Windows function to get the default character information is: GetCPInfoEx

Bob

I was also unable to repro the problems that Chris was reporting, and Windows code page 1255 does have a character there. And I never saw any private use character in the mix, U+f894 or other. Given the valid character (U+05ea, HEBREW LETTER TAV) is very much in public use I am not sure where the PUA character might have come from for Chris. It is certainly not expected.

I responded a bit after that to talk a bit about MB_ERR_INVALID_CHARS (or not) and how it affected things (or not), not much else to say since Bob had really covered the major points.

And then funnily enough, Ken "Skywing" Johnson, the Windows SDK MVP who is behind Nynaeve (Adventures in Windows debugging and reverse engineering), who I didn't even knew had any idea that I existed (but probably would have if I had looked at his blogroll any of the times I had dropped in on his blog!), put in a small postscript:

<OffTopic>
Funny.  I was just thinking "you know, this would be a great Michael Kaplan
blog question", and here you are following the newsgroups.
</OffTopic>

Well I can't ignore something like that, can I? :-)

Anyway, looking at the tables for cp1255, while 0xFA does indeed map to TAV, 0xFB (which is undefined) maps to the PUA (you cannot see it here but you cane see it here. This is part of a misguided attempt long ago to let invalid data roundtrip:

0xf88d 0xd9 ;EUDC -> Undefined
0xf88e 0xda ;EUDC -> Undefined
0xf88f 0xdb ;EUDC -> Undefined
0xf890 0xdc ;EUDC -> Undefined
0xf891 0xdd ;EUDC -> Undefined
0xf892 0xde ;EUDC -> Undefined
0xf893 0xdf ;EUDC -> Undefined
0xf894 0xfb ;EUDC -> Undefined
0xf895 0xfc ;EUDC -> Undefined
0xf896 0xff ;EUDC -> Undefined 

(I do not imagine that EUDC usage is all that comnmon in Hebrew!)

So, mystery solved -- a classicvc off-by-one error! :-)

 

This post brought to you by  (U+05ea, a.k.a. HEBREW LETTER TAV), the first letter in תשעה באב, which (apropos of nothing) is actually today.


no comments

referenced by

2007/07/25 What's up with MB_ERR_INVALID_CHARS?

go to newer or older post, or back to index or month or day