Not every code page value is supported

by Michael S. Kaplan, published on 2005/08/02 02:30 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/01/446475.aspx


The other day, John Bates asked in the suggestion box:

This suggestion is probably just a documentation update, but here goes.

One of my applications (compiled for Unicode) allows the caller to specify a code page for output. During testing I found WideCharToMultiByte works for most CPs but it fails for 1200, 1201, 12000 and 12001. The "Code-Page Identifiers" page lists these as valid CP values, but my system's NLS key doesn't have any values for these CPs.

Is there something that has to be installed for this to work, or is there another API (or series of APIs) that should be called instead?

I think there's a need for a (simple) encoding-to-encoding conversion API!

Regards,

John Bates

Well, I will have to take this apart one piece at a time. :-)

Now, if there ever were a function to handle "code page" 1200, it would not be WideCharToMultiByte, which has the job of converting UTF-16 LE into a byte-based encoding of some type, and by no stretch of the imagination can "cp 1200" be considered such a thing. :-)

I'll break that one piece at a time rule for the rest -- the other three "code pages", 1201, 12000, and 12001 (a.k.a. UTF-16 BE, UTF-32 LE, and UTF-32 BE), also fall into a similar rule. They are not byte based and thus really not something I would want to see us bend the WideCharToMultiByte and MultiByteToWideChar functions to do. It is (in my humble opinion) unfortunate that we went this route with the Encoding class in the .NET Framework, but that is not by itself a reason to mess up the model for Win32 NLS API functions....

Further, there is no need to have a conversion with "code page 1200" since that is converting something to itself. If you want to convert an LPWSTR or a WCHAR * to an LPBYTE or a BYTE * then you can just use a cast and then you are done, no need to go through a conversion function. Just cast it and you are done....

As for the UTF-16 BE, UTF-32 LE, and UTF-32 BE cases, Murray Sargent of Microsoft once explained to Asmus Freytag of Unicode fame (who accosted me at a Unicode conference to make a similar demand for UTF-32 support) that there wass no need for this -- the conversions in question are macros and do not have to be full functions. I think Asmus mostly backed down after being out-accosted, but I very much appreciated the support. :-)

The only useful excuse for functions in any of these cases would of course be to also handle validation (i.e. is it actual, valid Unicode), and I do not want to minimize that. But it is not a reason to back down from that model (in my opinion). Perhaps it is a reason for another function in the Win32 NLS API for these types of conversions, if there were a lot of customer requests that expressed such a need. We are not quite there yet, though; at this point those macros can still handle the immediate need....

Sorry, John. :-( But I will talk to someone about the doc issue here, in any case. :-)

 

This post brought to you by "𐒑" (U+10491, a.k.a. OSMANYA LETTER MIIN)
A character that is just as comfortable as U+10491 as it is as U+d801 U+dc91, because it is not self-conscious about its weight. :-)


Aaron Ballman on 1 Apr 2008 1:11 PM:

Maybe I'm just being dense... but!  Let's say I have input bytes in UTF8 (CP_UTF8), and I want to convert them to UTF-16BE (1201).  Wouldn't I just call MultiByteToWideChar to convert UTF8 to UTF-16LE, and then WideCharToMultiByte to convert UTF-16LE to UTF-16BE?  If not, then how would I do it, aside from writing my own byte-swapping function?  How does that change if I want to go from UTF-8 to UCS4 BE?

UTF16/UCS2 mess on 4 Jul 2010 7:08 AM:

>>> Further, there is no need to have a conversion with "code page 1200" since that is converting something to itself

It looks MS is messed up with UTF16 and UCS2 encodings. UTF16 is not UCS2. The conversion with UTF16 is absolutely necessary.

Michael S. Kaplan on 4 Jul 2010 7:33 AM:

Microsoft Windows currently assumes all the data is valid UTF-16 but will strip out invalid stuff and replace it with the replacement character. In theory this could make cp1200 useful, but it will take more than wishing to make it so -- these functions have no "colonic" capabilities of that nature...


referenced by

2007/08/11 Documentation does not always imply existence

go to newer or older post, or back to index or month or day