Unicode without UNICODE/

by Michael S. Kaplan, published on 2010/07/30 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/07/30/10033760.aspx

We are considering porting a Win32 application to use Unicode for internal string handling and are trying to decide which encoding to use. We would like to use UTF-8 and wondered whether there is any way to tell the application to use this encoding without having to compile using the _UNICODE compiler switch. What we'd really like is to be able to call an equivalent to the SetConsoleCP function for a full blown Win32 application as opposed to just a console application. We have tried to achieve this by changing the locale setting but see no way to set code page 65001 using the SetThreadLocale API function. Is there any way to do this without compiling with the _UNICODE compiler switch?

The reason Andreas was unable to find a way to do this is that there is no way to do this.

Having that would make migration to Unicode "easier", for developer "customers".

But the cost to Microsoft would be scrubbing literally thousands (and when I say thousands I mean the high thousands!) of functions to make sure they behave okay with UTF-8 (a significant percentage of them will not, in fact).

There would likely be little other time to do feature work in the next release of Windows beyond this one huge feature that most users would neither understand nor care about. That would be a hard sell to management, believe me!

Now this is not to say it hasn't been prototyped to see what works, etc. Because it has (at least twice, over the years)....

I had the same problem, a few years ago.

We finally decided to write a wrapper for the console, in order to convert internal unicode strings to something that could be displayed in the console.

This worked fine, but of course, not all characters could be displayed. Anyway, since the console was not our main output device, it was fairly enough for us, even if not all characters could be displayed.

Note that if you run the same app with some of the modifications I posted, many more characters can be displayed.

And if you run it in Powershell ISE, then you can see pretty much all of the characters!

Yet CP_ACP can be set to mixed 8/16 code pages, correct? It's strange that UTF-8 would break in ways that SJIS (say) would not. Not that I doubt you, it's just peculiar.

UTF-8's principal source of breakage as an ACP is in all the code that assumes a maximum of two bytes per character. As an OEMCP the bugs are more subtle but are still present in a few cases....

Well, you know that it is standard for Unix nowadays, do you? Which is why it is unfortunate that the CRT do not support UTF-8.

I know it's the standard for UNIX. As is writing all their shell scripts in Notepad. :-)

Yep, I remember that UTF-8 BOM fiasco, a curse of the Microsoft bureaucracy. What do you think of that issue?

I've already written about it. UNIX shell scripts authors are a bunch of whiners who shouldn't use programs from companies they claim to hate so much!

Yes, but I am talking about the UTF-8 BOM fiasco in general, not just shell scripts.

And when I mention "a curse of the Microsoft bureaucracy", I am particularly thinking of your patch which added not only UTF-8 without BOM support but also OEM code page support handy for editing Windows batch files.

I don't see it as a fiasco. The scenario is valid and Microsoft makes use of the scenario. End of story....

BTW, even better would be to provide a text box in the Open/Save dialogs where the user can type in a codepage number, which of course would require an UI change.

Better for who? Unicode is the future, code pages are the past -- guess which one is the priority in top level Windows UI? :-)

Where is all this code that assumes a maximum of two bytes per character? Seems to me that a more common false assumption would be ONE byte per character. Or assuming that you can use a byte-oriented Find function for strings, which breaks for the Asian code pages but NOT for UTF-8.

I'm working on some cross-platform libraries (using UTF-8 strings) and it's VERY frustrating that I can't even call fopen on Windows because the character encoding is wrong.

The "ANSI" code path. Since no Windows "ANSI" code page is ever more than two bytes per "character", the assumption is valid for all intended cases. Of course this design limits options for making use of some code pages or UTF-8 as an ACP, it is as true design limitation, which was the point of this blog -- that there is no option to magically support a UTF-8 ACP.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

Unicode without UNICODE/_UNICODE?