by Michael S. Kaplan, published on 2010/07/30 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/07/30/10033760.aspx
A familiar question I got the other day:
We are considering porting a Win32 application to use Unicode for internal string handling and are trying to decide which encoding to use. We would like to use UTF-8 and wondered whether there is any way to tell the application to use this encoding without having to compile using the _UNICODE compiler switch. What we'd really like is to be able to call an equivalent to the SetConsoleCP function for a full blown Win32 application as opposed to just a console application. We have tried to achieve this by changing the locale setting but see no way to set code page 65001 using the SetThreadLocale API function. Is there any way to do this without compiling with the _UNICODE compiler switch?
Thanks in advance for you help.
I think I've been asked this one before!
The reason Andreas was unable to find a way to do this is that there is no way to do this.
You cannot have a CP_ACP that is 65001 (UTF-8). Period.
Having that would make migration to Unicode "easier", for developer "customers".
But the cost to Microsoft would be scrubbing literally thousands (and when I say thousands I mean the high thousands!) of functions to make sure they behave okay with UTF-8 (a significant percentage of them will not, in fact).
There would likely be little other time to do feature work in the next release of Windows beyond this one huge feature that most users would neither understand nor care about. That would be a hard sell to management, believe me!
Now this is not to say it hasn't been prototyped to see what works, etc. Because it has (at least twice, over the years)....
Pascal Craponne on 30 Jul 2010 7:26 AM:
I had the same problem, a few years ago.
We finally decided to write a wrapper for the console, in order to convert internal unicode strings to something that could be displayed in the console.
This worked fine, but of course, not all characters could be displayed. Anyway, since the console was not our main output device, it was fairly enough for us, even if not all characters could be displayed.
Michael S. Kaplan on 30 Jul 2010 8:06 AM:
Note that if you run the same app with some of the modifications I posted, many more characters can be displayed.
And if you run it in Powershell ISE, then you can see pretty much all of the characters!
John Cowan on 30 Jul 2010 8:23 AM:
Yet CP_ACP can be set to mixed 8/16 code pages, correct? It's strange that UTF-8 would break in ways that SJIS (say) would not. Not that I doubt you, it's just peculiar.
Michael S. Kaplan on 30 Jul 2010 8:40 AM:
UTF-8's principal source of breakage as an ACP is in all the code that assumes a maximum of two bytes per character. As an OEMCP the bugs are more subtle but are still present in a few cases....
Yuhong Bao on 31 Jul 2010 12:58 PM:
Well, you know that it is standard for Unix nowadays, do you? Which is why it is unfortunate that the CRT do not support UTF-8.
Michael S. Kaplan on 31 Jul 2010 2:12 PM:
I know it's the standard for UNIX. As is writing all their shell scripts in Notepad. :-)
Yuhong Bao on 31 Jul 2010 3:36 PM:
Yep, I remember that UTF-8 BOM fiasco, a curse of the Microsoft bureaucracy. What do you think of that issue?
Michael S. Kaplan on 31 Jul 2010 3:44 PM:
I've already written about it. UNIX shell scripts authors are a bunch of whiners who shouldn't use programs from companies they claim to hate so much!
Yuhong Bao on 31 Jul 2010 3:48 PM:
Yes, but I am talking about the UTF-8 BOM fiasco in general, not just shell scripts.
Yuhong Bao on 31 Jul 2010 3:55 PM:
And when I mention "a curse of the Microsoft bureaucracy", I am particularly thinking of your patch which added not only UTF-8 without BOM support but also OEM code page support handy for editing Windows batch files.
Michael S. Kaplan on 31 Jul 2010 3:56 PM:
I don't see it as a fiasco. The scenario is valid and Microsoft makes use of the scenario. End of story....
Yuhong Bao on 31 Jul 2010 4:07 PM:
BTW, even better would be to provide a text box in the Open/Save dialogs where the user can type in a codepage number, which of course would require an UI change.
Michael S. Kaplan on 31 Jul 2010 4:25 PM:
Better for who? Unicode is the future, code pages are the past -- guess which one is the priority in top level Windows UI? :-)
Dan Bishop on 31 Jul 2010 5:11 PM:
Where is all this code that assumes a maximum of two bytes per character? Seems to me that a more common false assumption would be ONE byte per character. Or assuming that you can use a byte-oriented Find function for strings, which breaks for the Asian code pages but NOT for UTF-8.
I'm working on some cross-platform libraries (using UTF-8 strings) and it's VERY frustrating that I can't even call fopen on Windows because the character encoding is wrong.
Michael S. Kaplan on 31 Jul 2010 5:34 PM:
The "ANSI" code path. Since no Windows "ANSI" code page is ever more than two bytes per "character", the assumption is valid for all intended cases. Of course this design limits options for making use of some code pages or UTF-8 as an ACP, it is as true design limitation, which was the point of this blog -- that there is no option to magically support a UTF-8 ACP.