Myth busting in the console

by Michael S. Kaplan, published on 2010/10/07 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/10/07/10072032.aspx

I have been writing about internationalization and the console off and on for over five years now, and this fact highlights two problems with blogs:

Because of this, I thought I would take a little time to really summarize the current state of affairs, with examples in-line for shorter stuff and with links for the longer stuff. So you can just look here to get the full story in one place.

Random people trying to improve their console application story can use this blog to find out about everything they need to know. They are perhaps not in the ideal order here either, but at least they are all here. In a "myth-busting" format.

Myth #1: You can not detect within a console application whether a console handle has been redirected to a file.

Given some of the central differences between applications that are principally used within the console and ones that expect to spend the bulk of their time redirected, this is a pretty important issue, especially given other myths related to the "least common denominator" of each when it comes to Unicode support.

This myth, however, is demonstratably false. You can (for example) check out the IsConsoleRedirected function from this blog to see how easy this is to do. There are many other examples but this seemed like the most contained.

Myth #2: You cannot detect whether a console handle that has been redirected to a file is appending to an existing file or is creating a new file of its own.

Another myth to bust -- this one is trivial to detect, as I point out in the section entitled "First if all" in this blog:

A simple call to GetFileSizeEx will tell you that, immediately! Just pass in that stdout pointer that you have already determined is a redirected file, and then you will know by the size of the file if they redirected with a > or a >>.

Myth #3: You should never add a BOM (byte order mark) to console application output.

Given the truths behind Myth #1 and Myth #2, this one obviously is weird. If you

then you can write U+FEFF, the Unicode Byte Order Mark. Let the underlying encoding do its thing with the BOM and you don't need to worry about anything.

Myth #4: When appending console output to an existing file you don't/can't know what the encoding is.

Now this myth causes all kinds of problems because one can easily talk oneself into being unable to safely support anything beyond ASCII -- and I have seen people do this very thing.

But untrue is untrue, and in Orwellian terms this one is perhaps even doubleplusuntrue.

A quick call to GetFinalPathNameByHandle to get the path and you can look at the contents and see what is in there and make the appropriate decision (if you need it to be pre-Vista you can use code like this).

Remember that the file is guaranteed to be opened to you or else all of your write operations would fail. So you are the one person with access.

By combining the knowledge of Myth #3 and Myth #4 with the general truth that people who "redirect append" do so to files created by applications that use similar techniques or by the very same application you can look at the first few bytes in most cases to do rather precise detection verifying whether it is Unicode.

This is a pretty weird one, and between all of mystery surrounding the default use of CP_OEMCP in the console and the longstanding poor documentation and bugs surrounding the console in general and the CRT in particular it has been one of the most enduring myths of all time.

But when you use the knowledge behind Myth #1's debunking to determine you are in the console, the WriteConsoleW and ReadConsoleW Win32 API functions have supported Unicode for not quite as long as CMD.EXE has existed, but certainly on any version of Windows you are likely to see.

Myth #6: The console is able to support Unicode, but sometimes it doesn't work and for some characters it doesn't work -- and you can't ever tell what's what.

Thank goodness the myth was worded this way, so I can once again say it is wrong. Wrong, wrong, WRONG.

By the use of those two functions, you can know exactly what is supportable/supported. From there you can choose (if you so wish) to make intelligent decisions on how to proceed, even perhaps going so far as to warn people what they ought to be doing instead if they are likely to be unable to support the text your console application might want to output.

Myth #7: The Microsoft Visual C Run-time library (mvcrt*) doesn't support Unicode output to a Unicode console.

before calling Unicode console functions like wprintf and getting back error text is all you need to have the CRT do all the work behind Myth #1 and properly handle console output. This has been true since Visual Studio 2005 (VC++ 8.0).

that will do much wonderfuler things for Unicode text in the console using the CRT.

Myth #8: The Microsoft Visual C Run-time library (mvcrt*) completely supports Unicode text processing in a Unicode console.

Ah, I must have lulled you into complacency after Myth #7 was proven wrong, and you took it too far.

Because the truth is that there is a bug that has existed in one form or another in every version of the CRT since 2005 that makes it so that even though stdout and stderr can handle Unicode, stdin cannot.

I discuss the issue in this blog, if you are interested in details. It is one of my most fervent hopes that this bug is fixed at the next available opportunity in every place it can be fixed.

And since one cannot count on either hopes or prayers to make such things happen, I am trying to make sure that the fix is made by more conventional means within my job....

Myth #9: You should really use resource fallback to handle the scenario of a console that cannot fully support the text.

As this blog discusses at length, the claim that blogs like this one make about the need to force the resource loader to do fallback in order to avoid problems in the console has many flaws in it.

There are scenario based flaws (e.g. that the world is teeming with people writing console applications that output Arabic text), implementation flaws (e.g. the ridiculous fallback to en-us for most complex script locales guarantees that a lot of text such as that used by many European languages that might have succeeded will fail due to the en-us use of a 437 CP_OEMCP), conceptual flaws (e.g. fixing teh UI language does nothing for text created by the user locale like date formats which will have the same problem and no ready sensible solution).

As I exaplained in the debunking of Myth #6, the answer if you detect a case where the application may not fully support the text is at best to detect and warn the person, at worst output the junk but make sure that the right documentation or KB articles are available to tell people what is going on, so they can address the problem.

your console application does itself and its users a tremendous disservice by unconditionally falling back.

Now the world of MUI_CONSOLE_FILTER isn't the worst thing you can do to users, since it is not legally assault by a software developer. But it is still pretty bad....

Myth #10: Okay, I am convinced. You can support Unicode in pretty much all of the console.

Wow, that is incorrect. I'm sorry I lulled you into thinking everything works. Across all of the built in commands in CMD.EXE itself and all of the common binaries that extend the console, many support Unicode but not all of them do.

No list of what commands and executables fall in each category exists, and the rules about each one (for example find.exe never supports Unicode, while as I discussed in this blog the type intrinsic completely supports Unicode but the file must have a BOM in front of it). Such a list would be really cool, but it does not currently exist and no one seems to want to take the time to create it....

Plus occasionally other long-standing issues can exist like the one I talked about in this blog, a bug that none other than Mark Zbikowski ended up fixing for me in the Vista cmd.exe. The effort was truly appreciated.

Myth #11: All of the defaults in CMD.EXE like whether you are using a TrueType font are out of your control.

You can change this setting in any CMD.EXE shortcut -- something that every single VS, SDK, and WDK console shortcut should be doing!!!

I am totally serious here. I can understand why people would be afraid to change the defaults in CMD.EXE itself for back-compat reasons of changing behavior in legacy console apps, but the shortcuts? Seriously, these should get updated.

Myth #12: You can't change the setting of whether a console window is using a TrueType font.

I have used a few console API functions and that IsConsoleFontTrueType function from this blog to change the font within a console window to a TrueType font, from code running in the console window.

This is something I would never recommend in production code, mind you; I only did it because someone told me it wasn't possible and I was sure she was mistaken.

The impact of the accomplishment was interesting, mind you; she and I dated for about a month after that. ;-)

And the story there may be worthy of its own dedicated myth-busting blog, along with the code itself. If people are interested, I mean. Let me know....

Myth #13: There are thirteen incorrect myths about support of Unicode text within the console that this blog will talk about.

If I say you are wrog and that there are only 12 myths, then this is s genuine myth. Which means there are 13 myths.

But if there are 13 myths then you are right and this isn't a myth, it's just a fact. In which case there are just 12.

This post reminded me of an issue I had with Unicode output in the post build event of a Visual Studio project. When you build the one-line program below and run it from a new cmd or PowerShell window, I get the expected output of "'\u202F' is ' '". However, when I call it from a project's post build event, the VS output window shows "'\u202F' is '?'".

class Program { static void Main() { System.Console.WriteLine("'\\u202F' is '\u202F'"); } }

I've messed with System.Console.OutputEncoding and chcp, but can't get the issue to reproduce outside of VS, or get VS to behave nicely (I did find that calling "chcp 65001" first produces the interesting output "'\u202F' is 'ΓÇ»'", which seems to be the interpretation of 0x202F under the US codepage :).

Just curious if you had any insight into what VS is up to!

Don't forget about the fact that CJK can't be displayed in the console without the console codepage be set to the needed codepage, and that requires that the OEMCP be set to the same codepage too.

Actually, that is not true (yet another myth!). Using the rules I have given and Unicode input/output:

1) Redirection works with no help needed;

2) Console display will show square boxes in CMD that you can copy/paste to show valid text;

3) PowerShell ISE console display is perfect.