Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT?

by Michael S. Kaplan, published on 2008/03/18 14:01 +00:00, original URI: http://blogs.msdn.com/michkap/archive/2008/03/18/8306597.aspx


Please read disclaimer; content of Michael Kaplan's blog not approved by Microsoft!

Once upon a time, everybody knew that the Earth was flat.

And not too long after that, everybbody knew that the Sun orbited around the Earth instead of the other way around.

And most developers know that there is no way to get Unicode output support in the console that will work properly both in the console and when redirected to a file.

I was just looking at some internal guidelines for developers inside of Microsoft that bemoaned all the things that don't work in the console:

The main limitations can be summarized as:

  1. Unicode I/O with ReadConsoleW/WriteConsoleW is always supported; but default raster font will break it.
  2. Changing console codepage is always possible, but DBCS codepages are supported under DBCS system locales (not necessarily matching).
  3. The console does not support complex script languages such as Arabic or the various Indic languages that can only be rendered with Uniscribe. 
  4. Unicode I/O is supported through Win32, but with several limitations:

Now of these caveats that everybody knows, most of them are not true!

And not too long ago someone sent me a piece of mail about that second sub-bullet in #4 and the code he wrote to work around it:

I think I’m mostly ready. However, I’m still not quite certain about proper handling of redirection of output to a file. The site above suggests the option of using WriteConsole for normal output, and WriteFile+BOM for file output; this is the direction I’ve tried, but I wanted to get confirmation that I’m doing it correctly.

I’ve written the following function (mostly copied from other code):

static const WCHAR UNICODE_BOM = 0xFEFF;

void UPrint (LPCWSTR String) {
    DWORD ConsoleMode;
    BOOL ConsoleOutput;
    DWORD FileType;
    BOOL Result;
    HANDLE StdOut;
    DWORD StringCharCount;
    DWORD Written;

    //
    // StdOut describes the standard output device.  This can be the console
    // or (if output has been redirected) a file or some other device type.
    //
    StdOut = GetStdHandle(STD_OUTPUT_HANDLE);

    if (StdOut == INVALID_HANDLE_VALUE) {
        goto PrintExit;
    }

    //
    // Check whether the handle describes a character device.  If it does, then
    // it may be a console device.  A call to GetConsoleMode will fail with
    // ERROR_INVALID_HANDLE if it is not a console device.
    //
    FileType = GetFileType(StdOut);

    if ((FileType == FILE_TYPE_UNKNOWN) && (GetLastError() != ERROR_SUCCESS)) {
        goto PrintExit;
    }

    FileType &= ~(FILE_TYPE_REMOTE);

    if (FileType == FILE_TYPE_CHAR) {
        Result = GetConsoleMode(StdOut, &ConsoleMode);

        if ((Result == FALSE) && (GetLastError() == ERROR_INVALID_HANDLE)) {
            ConsoleOutput = FALSE;
        } else {
            ConsoleOutput = TRUE;
        }
    } else {
        ConsoleOutput = FALSE;
    }

    //
    // If StdOut is a console device then just use the UNICODE console write
    // API.  This API doesn't work if StdOut has been redirected to a file or
    // some other device.  In this case, write to StdOut using WriteFile.
    //

    StringCharCount = (DWORD) wcslen(String);

    if (ConsoleOutput != FALSE) {
        WriteConsoleW(StdOut,
                      (PVOID)String,
                      StringCharCount,
                      &Written,
                      NULL);
    } else {
        //
        // Write out a Unicode BOM to ensure proper processing by text readers
        //
        WriteFile(StdOut,
                  (PVOID)&UNICODE_BOM,
                  sizeof(UNICODE_BOM),
                  &Written,
                  NULL);

        //
        // The number of bytes to write to standard output must exclude the null
        // terminating character.
        //
        WriteFile(StdOut,
                  (PVOID)String,
                  (StringCharCount * sizeof(WCHAR)),
                  &Written,
                  NULL);
    }

PrintExit:
    return;
}

Based on a couple quick tests, this seems to do the right thing, but review from someone more familiar with the area would be much appreciated. :-)

Thanks!

Well, at the time the only comment he had gotten back was that that it was a little odd that the BOM was being written on every call since it should only be in the beginning of the file.

Anyway, remember the other day when I was mentioned in Some armchair root cause analysis of the suckage of lstrcmpi how I mentioned that STL dropped by and we were talking about stuff?

One of the things I mentioned was this problem, and related ones like how you had to use binary mode to write out Unicode text with the CRT functions, thus losing all of the newline and line semantics. He agreed this was lame.

Then last night he showed me how both Visual Studio 2005 and 2008 (well, Vissual C++ 8.0 and 9.0) that it was not true!

Basically he created a file something like this:

#include <fcntl.h>
#include <io.h>
#include <stdio.h>

int main(void) {
    _setmode(_fileno(stdout), _O_U16TEXT);
    wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
    return 0;
}

And then I compiled the file in the Visual Studio 2005 command line (where I already had my console font set to Lucida Console):

cl /W4 foo.c

And that FOO.EXE worked beautifully, outputting the Cyrillic and Ideograhic text (кошка 日本国) without corruption, to the command line....

And when I redirected it to a file, it worked then too, writing out the file as Unicode!

Notepad opened it fine and detected it as Unicode even without the BOM; I did have to save it in Notepad to have the console see it that way when I used the type command in the console.

Here was that console window:

When I copied those boxes from the console and pasted them into Notepad, I once again got the Unicode text both times.

So much for conventional wisdom. All that WriteConsoleW blitting, the binary file mode, the chcp, the console output CP crap. All to get an answer not as cool as the above.

The earth? It isn't flat.

The sun? It doesn't orbit around the sun.

And the CRT? Starting in 2005/8.0, it knows more about Unicode than any of us having been giving it credit for....

The heroes of the day? _O_U16TEXT and _O_U8TEXT, which I will probably talk about more at some point.

Or you can look at the _setmode and _wsopen topics (the latter is the only place that _O_U16TEXT and _O_U8TEXT seem to be mentioned:

_O_U16TEXT
Open the file in Unicode UTF-16 mode. This option is available in Visual C++ 2005.

_O_U8TEXT
Open the file in Unicode UTF-8 mode. This option is available in Visual C++ 2005.

_O_WTEXT
Open the file in Unicode mode. This option is available in Visual C++ 2005. 

But it works right here too.

Awesome, truly.

And conventional wisdom is quite retarded!

 

This post brought to you by U (U+0055, aka LATIN CAPITAL LETTER U)


comments not archived

referenced by

2010/12/01 The IN door can go a different way than the OUT door

2010/10/07 Myth busting in the console

2010/09/23 A confluence of circumstances leaves a stone unturned...

2010/06/27 Bugs hidden in plain sight, and commented that way too ANSWERS

2010/06/18 Bugs hidden in plain sight, and commented that way too

2010/05/07 Cunningly conquering communicated console caveats. Comprende, mon Capitán?

2010/04/07 Anyone who says the console can't do Unicode isn't as smart as they think they are

2009/12/01 When changing behavior is like killing puppies

2009/08/14 Header files are not retarded, aka What the @#%&* is _O_WTEXT?

2008/03/19 The forensic typographers found no link to Lucida Console, and the D.A. had nothing to fallback to

2008/03/19 Before you say "What's next?" you have to figure out the action items

go to newer or older post, or back to index or month or day