Not every code page value is supported

by Michael S. Kaplan, published on 2005/08/02 02:30 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/01/446475.aspx

Now, if there ever were a function to handle "code page" 1200, it would not be WideCharToMultiByte, which has the job of converting UTF-16 LE into a byte-based encoding of some type, and by no stretch of the imagination can "cp 1200" be considered such a thing. :-)

I'll break that one piece at a time rule for the rest -- the other three "code pages", 1201, 12000, and 12001 (a.k.a. UTF-16 BE, UTF-32 LE, and UTF-32 BE), also fall into a similar rule. They are not byte based and thus really not something I would want to see us bend the WideCharToMultiByte and MultiByteToWideChar functions to do. It is (in my humble opinion) unfortunate that we went this route with the Encoding class in the .NET Framework, but that is not by itself a reason to mess up the model for Win32 NLS API functions....

Further, there is no need to have a conversion with "code page 1200" since that is converting something to itself. If you want to convert an LPWSTR or a WCHAR * to an LPBYTE or a BYTE * then you can just use a cast and then you are done, no need to go through a conversion function. Just cast it and you are done....

As for the UTF-16 BE, UTF-32 LE, and UTF-32 BE cases, Murray Sargent of Microsoft once explained to Asmus Freytag of Unicode fame (who accosted me at a Unicode conference to make a similar demand for UTF-32 support) that there wass no need for this -- the conversions in question are macros and do not have to be full functions. I think Asmus mostly backed down after being out-accosted, but I very much appreciated the support. :-)

The only useful excuse for functions in any of these cases would of course be to also handle validation (i.e. is it actual, valid Unicode), and I do not want to minimize that. But it is not a reason to back down from that model (in my opinion). Perhaps it is a reason for another function in the Win32 NLS API for these types of conversions, if there were a lot of customer requests that expressed such a need. We are not quite there yet, though; at this point those macros can still handle the immediate need....

Sorry, John. :-( But I will talk to someone about the doc issue here, in any case. :-)

This post brought to you by "𐒑" (U+10491, a.k.a. OSMANYA LETTER MIIN)
A character that is just as comfortable as U+10491 as it is as U+d801 U+dc91, because it is not self-conscious about its weight. :-)

Maybe I'm just being dense... but! Let's say I have input bytes in UTF8 (CP_UTF8), and I want to convert them to UTF-16BE (1201). Wouldn't I just call MultiByteToWideChar to convert UTF8 to UTF-16LE, and then WideCharToMultiByte to convert UTF-16LE to UTF-16BE? If not, then how would I do it, aside from writing my own byte-swapping function? How does that change if I want to go from UTF-8 to UCS4 BE?

>>> Further, there is no need to have a conversion with "code page 1200" since that is converting something to itself

It looks MS is messed up with UTF16 and UCS2 encodings. UTF16 is not UCS2. The conversion with UTF16 is absolutely necessary.

Microsoft Windows currently assumes all the data is valid UTF-16 but will strip out invalid stuff and replace it with the replacement character. In theory this could make cp1200 useful, but it will take more than wishing to make it so -- these functions have no "colonic" capabilities of that nature...

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.