Why can't the CP_ACP be UTF-8?

by Michael S. Kaplan, published on 2006/10/11 11:05 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/10/11/816996.aspx


Bart van der Werf asks:

I was working on getting a 8bit string application ready for other languages and i was suprised to see that the ACP (Ansi Codepage) character encoding (the default encoding for 8bit strings in windows) couldn't accept utf8 as a valid MBCS encoding.

Why isn't this the case ? it would allow all sorts of cleaned up legacy apps to become multilingual overnight.?

Now I'd be the first to criticize the search on this site, though there are many times it comes through. In this case a search on the terms ACP UTF-8 returns as its first item Can the ACP be UTF-8?, posted a few months ago (it finds a ton of others that are pretty unrelated, but what it find first is what counts!).

To say a bit more on the topic, yes he is right -- it would make some operations much easier. But beyond that, taking the current model that ties the default system code page to the default system locale -- would this mean that you could only use such a UTF-8 ACP when dealing with Unicode-only locales? And would it provide easier internationalization to force a system-wide setting to be changed in order to make an application work? Ugh. As would breaking any existing application relying on the current behavior.

And then of course there is the cost is the huge PM/test/dev effort to bring every "A" API function (and underlying kernel API function as well) up to speed handling up to four bytes to character when so many of them are strictly limited to handling only one or two. Thousands of functions and whole systems (like for Window messaging). And some of these algorithms really do come from the original Win 3.x source for the "A" version!

It is not exaggerating to suggest that this effort could cost billions, impact thousands of people, and take years. For code that is and has been set up as legacy. There is simply no real way to even contemplate this as an effort that would get approval to proceed....

 

This post brought to you by Ω (U+03a9, a.k.a. GREEK CAPITAL LETTER OMEGA)


# Raymond Chen - MSFT on 11 Oct 2006 1:12 PM:

The C language also makes it difficult, since streams support only one character of pushback. Since a UTF-8-encoded character can require three characters of pushback, where do you put the other two?

# Ben Bryant on 11 Oct 2006 2:17 PM:

A quick way of saying it is that Windows 'A' supports single and double byte code pages, and not any other kind of "multi-byte" encoding. Another problem is all of the documentation that refers to 'W' APIs as "Unicode" and 'A' as "non-Unicode". To support Unicode in 'A' APIs would turn a great deal of Win32 documentation on its head.

# Michael S. Kaplan on 11 Oct 2006 4:06 PM:

Nobody ever seems to read the documentation, so we are probably safe on that part. :-)

# Ben Bryant on 11 Oct 2006 4:55 PM:

:)

# Leo Kislov on 11 Oct 2006 7:07 PM:

Any hint when DOS console will be phased out? PowerShell seems like a good opportunity to do it.  Push DOS console into Accesories/System tools/ and introduce a new unicode console in Vista+1. Kill DOS console after 3 major releases after Vista.

# Dean Harding on 11 Oct 2006 7:43 PM:

I don't think you could ever KILL cmd.exe, but I wouldn't be adverse to relegating it to a more hidden place.

Actually, I was a bit dissapointed to see that PowerShell used the same host as cmd.exe. I hate the fact that you can't resize the window by just dragging the corners. It's so old-skool! Lets hope it's improved in version 2 :)

# Leo Kislov on 11 Oct 2006 8:39 PM:

Dean, I believe if there is a will there is a way to kill cmd.exe. Make PowerShell a parallel universe: separate resizable unicode window *without* ability to call old command line programs. You don't really believe we're stuck with cmd.exe for the next 10,000 years, do you? So the question is actually how long will it take. My suggestion to keep cmd.exe in a closet for 3 major releases is probably too aggressive, ok, let's make it 6 releases.

# Dean Harding on 11 Oct 2006 10:32 PM:

> You don't really believe we're stuck with cmd.exe for the next 10,000 years, do you?

Actually, yes I do. Assuming we're still using Windows in 10,000 years, anyway. As long as companies rely on batch scripts that were written for Windows NT 3.5 (and there's plenty of them, just ask Raymond Chen), then cmd.exe will have to hang around.

Think about it. If Windows 2015 removed cmd.exe, and you've got 1,500 lines of batch script that your organization DEPENDS on, what are you going to do? Spend 3 months re-writing it in PowerShell, testing, debugging and finally deploying it on your 900 desktops. Or are you just going to stick with Windows 2012, which ran them just fine? It's hard enough convincing companies to upgrade as it is...

# Bart van der Werf on 15 Oct 2006 11:49 AM:

Thanks for the response :)

Too bad the response with these kinds of issues allways seems to come down to problems with compatibility with sloppy implementations.

Porting these applications to utf16 is not really possible because of 3rd party components that either have no utf16 support, assume ucs2 or are discontinued.

# Michael S. Kaplan on 15 Oct 2006 12:43 PM:

Well, of course that was not the only reason I gave -- the 18 buttloads of work that could easily go into the billions is not to be sneered at, either? :-)

# Yuhong Bao on 21 Aug 2008 10:35 PM:

"Actually, I was a bit dissapointed to see that PowerShell used the same host as cmd.exe."

Which actually IS Unicode capable, but you have to use a TrueType font. And it is in csrss.exe, not cmd.exe.

# Michael S. Kaplan on 22 Aug 2008 2:56 AM:

Well, sort of. Managed code in the CMD-hosted console has many non-Unicode limitations that have little to do with font selection....

Yuhong Bao on 5 Mar 2010 8:54 PM:

"Managed code in the CMD-hosted console"

Which is not actually CMD-hosted, but CSRSS-hosted, or CONHOST-hosted in Win7 and later.

Michael S. Kaplan on 5 Mar 2010 9:06 PM:

Not a relevant detail in this case, though entirely accurate. You sound a little like a Microsoft employee. :-)

Yuhong Bao on 5 Mar 2010 9:19 PM:

Reminds me of this article:

http://blogs.msdn.com/michkap/archive/2007/05/11/2547703.aspx


referenced by

2008/08/15 Yet another time that UTF-8 can't be the ACP

2007/02/23 The MSL8 project? Cool!

go to newer or older post, or back to index or month or day