Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT?

by Michael S. Kaplan, published on 2008/03/18 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx


Please read disclaimer; content of Michael Kaplan's blog not approved by Microsoft!

Once upon a time, everybody knew that the Earth was flat.

And not too long after that, everybbody knew that the Sun orbited around the Earth instead of the other way around.

And most developers know that there is no way to get Unicode output support in the console that will work properly both in the console and when redirected to a file.

I was just looking at some internal guidelines for developers inside of Microsoft that bemoaned all the things that don't work in the console:

The main limitations can be summarized as:

  1. Unicode I/O with ReadConsoleW/WriteConsoleW is always supported; but default raster font will break it.
  2. Changing console codepage is always possible, but DBCS codepages are supported under DBCS system locales (not necessarily matching).
  3. The console does not support complex script languages such as Arabic or the various Indic languages that can only be rendered with Uniscribe. 
  4. Unicode I/O is supported through Win32, but with several limitations:

Now of these caveats that everybody knows, most of them are not true!

And not too long ago someone sent me a piece of mail about that second sub-bullet in #4 and the code he wrote to work around it:

I think I’m mostly ready. However, I’m still not quite certain about proper handling of redirection of output to a file. The site above suggests the option of using WriteConsole for normal output, and WriteFile+BOM for file output; this is the direction I’ve tried, but I wanted to get confirmation that I’m doing it correctly.

I’ve written the following function (mostly copied from other code):

static const WCHAR UNICODE_BOM = 0xFEFF;

void UPrint (LPCWSTR String) {
    DWORD ConsoleMode;
    BOOL ConsoleOutput;
    DWORD FileType;
    BOOL Result;
    HANDLE StdOut;
    DWORD StringCharCount;
    DWORD Written;

    //
    // StdOut describes the standard output device.  This can be the console
    // or (if output has been redirected) a file or some other device type.
    //
    StdOut = GetStdHandle(STD_OUTPUT_HANDLE);

    if (StdOut == INVALID_HANDLE_VALUE) {
        goto PrintExit;
    }

    //
    // Check whether the handle describes a character device.  If it does, then
    // it may be a console device.  A call to GetConsoleMode will fail with
    // ERROR_INVALID_HANDLE if it is not a console device.
    //
    FileType = GetFileType(StdOut);

    if ((FileType == FILE_TYPE_UNKNOWN) && (GetLastError() != ERROR_SUCCESS)) {
        goto PrintExit;
    }

    FileType &= ~(FILE_TYPE_REMOTE);

    if (FileType == FILE_TYPE_CHAR) {
        Result = GetConsoleMode(StdOut, &ConsoleMode);

        if ((Result == FALSE) && (GetLastError() == ERROR_INVALID_HANDLE)) {
            ConsoleOutput = FALSE;
        } else {
            ConsoleOutput = TRUE;
        }
    } else {
        ConsoleOutput = FALSE;
    }

    //
    // If StdOut is a console device then just use the UNICODE console write
    // API.  This API doesn't work if StdOut has been redirected to a file or
    // some other device.  In this case, write to StdOut using WriteFile.
    //

    StringCharCount = (DWORD) wcslen(String);

    if (ConsoleOutput != FALSE) {
        WriteConsoleW(StdOut,
                      (PVOID)String,
                      StringCharCount,
                      &Written,
                      NULL);
    } else {
        //
        // Write out a Unicode BOM to ensure proper processing by text readers
        //
        WriteFile(StdOut,
                  (PVOID)&UNICODE_BOM,
                  sizeof(UNICODE_BOM),
                  &Written,
                  NULL);

        //
        // The number of bytes to write to standard output must exclude the null
        // terminating character.
        //
        WriteFile(StdOut,
                  (PVOID)String,
                  (StringCharCount * sizeof(WCHAR)),
                  &Written,
                  NULL);
    }

PrintExit:
    return;
}

Based on a couple quick tests, this seems to do the right thing, but review from someone more familiar with the area would be much appreciated. :-)

Thanks!

Well, at the time the only comment he had gotten back was that that it was a little odd that the BOM was being written on every call since it should only be in the beginning of the file.

Anyway, remember the other day when I was mentioned in Some armchair root cause analysis of the suckage of lstrcmpi how I mentioned that STL dropped by and we were talking about stuff?

One of the things I mentioned was this problem, and related ones like how you had to use binary mode to write out Unicode text with the CRT functions, thus losing all of the newline and line semantics. He agreed this was lame.

Then last night he showed me how both Visual Studio 2005 and 2008 (well, Vissual C++ 8.0 and 9.0) that it was not true!

Basically he created a file something like this:

#include <fcntl.h>
#include <io.h>
#include <stdio.h>

int main(void) {
    _setmode(_fileno(stdout), _O_U16TEXT);
    wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
    return 0;
}

And then I compiled the file in the Visual Studio 2005 command line (where I already had my console font set to Lucida Console):

cl /W4 foo.c

And that FOO.EXE worked beautifully, outputting the Cyrillic and Ideograhic text (кошка 日本国) without corruption, to the command line....

And when I redirected it to a file, it worked then too, writing out the file as Unicode!

Notepad opened it fine and detected it as Unicode even without the BOM; I did have to save it in Notepad to have the console see it that way when I used the type command in the console.

Here was that console window:

When I copied those boxes from the console and pasted them into Notepad, I once again got the Unicode text both times.

So much for conventional wisdom. All that WriteConsoleW blitting, the binary file mode, the chcp, the console output CP crap. All to get an answer not as cool as the above.

The Earth? It isn't flat.

The sun? It doesn't orbit around the Earth.

And the CRT? Starting in 2005/8.0, it knows more about Unicode than any of us having been giving it credit for....

The heroes of the day? _O_U16TEXT and _O_U8TEXT, which I will probably talk about more at some point.

Or you can look at the _setmode and _wsopen topics (the latter is the only place that _O_U16TEXT and _O_U8TEXT seem to be mentioned:

_O_U16TEXT
Open the file in Unicode UTF-16 mode. This option is available in Visual C++ 2005.

_O_U8TEXT
Open the file in Unicode UTF-8 mode. This option is available in Visual C++ 2005.

_O_WTEXT
Open the file in Unicode mode. This option is available in Visual C++ 2005. 

But it works right here too.

Awesome, truly.

And conventional wisdom is quite retarded!

 

This post brought to you by U (U+0055, aka LATIN CAPITAL LETTER U)


# Ian Griffiths on 18 Mar 2008 1:50 PM:

The "flat earth" model was never a popular one, despite the surprisingly persistent myth to the contrary. There's no evidence to support this assertion, and plenty of evidence to support the idea that people have believed that the Earth was roundish for as long as there are records indicating that people were asking the question "What shape is the earth?"

If you're looking for good examples of conventional wisdom that's wrong, then ironically, "in the old days, people thought the earth was flat" turns out to be a good example, but not for the reasons you supposed. It's not that the belief in a flat earth was conventional wisdom. It's the belief in the belief in a flat earth that is the example of wrong conventional wisdom!

# Marc Durdin on 18 Mar 2008 5:21 PM:

Might help if the _wsopen documentation mentioned that _O_WTEXT is UTF-16 with a BOM and _O_U16TEXT is UTF-16 without a BOM.

Conventional wisdom is not retarded...  just the documentation.  Conventional wisdom dictates that the documentation should be authoritative.  Your typical garden programmer would peruse the _setmode documentation and deduce that _O_WTEXT, _O_U16TEXT and _O_U8TEXT are definitely not permissible:

"The mode must be one of two manifest constants: _O_TEXT or _O_BINARY."

Can that documentation be fixed?  So we can all participate in Unicode Console joy without feeling dirty because we are using 'undocumented' flags?

Finally, this doesn't resolve the original writer's problem of the BOM.  You still have to do magic if you want to stick a BOM in a redirected file -- in this case, _O_WTEXT appears to behave identically to _O_U16TEXT and does not output a BOM when redirecting to a file.

# Mihai on 25 Mar 2008 7:21 PM:

Let's not forget about the cmd.exe switches:

  /A      Causes the output of internal commands to a pipe or file to be ANSI

  /U      Causes the output of internal commands to a pipe or file to be Unicode

These might also change things.

# Michael S. Kaplan on 25 Mar 2008 10:59 PM:

They change some things, though not all things (and this code overrides these settings).

Wei Wang on 31 Aug 2010 9:48 PM:

I don't understand why the author thinks having those Japanese chars displayed as *boxes* in a console window is cool.

- In my console window, the font is set to Lucida truetype (which i think is capable of displaying unicode chars)

- In my machine's "Regional Settings" I have the 'language for non-unicode program' set to US English (which is just the default).

Is there a way to get true Japanese chars displayed in a console window under these settings? In other words, is the following statement under point 4. above true or false?

"CJK languages are supported only under CJK system locales"

Michael S. Kaplan on 31 Aug 2010 9:51 PM:

Cool is relative.

It is "cooler" than question marks, because at the very least one can use file redirection to get the text, or one can copy/paste to get the text from the console. If you have question marks, then the data is permanently gone.

Wei Wang on 31 Aug 2010 10:07 PM:

Wow, I didn't expect to get a reply in 3 mins.

Anyhow, guess the takeaway (for me at least) is that getting those empty squares displayed for asian characters is the best we can have under the current version of Windows. I do agree that being able to copy/paste those squares into notepad to get the correct asian chars is cool:)

Michael S. Kaplan on 31 Aug 2010 10:50 PM:

Well, you can also use PowerShell ISE and see the text directly in this case, even when the system locale isn't CJK. :-)

One never knows when I'll be around looking at the blog....

NB on 30 Nov 2010 5:25 AM:

Can you somehow use this to make a "pipe-utility-program" which converts the standard output of other console programs (outputting OEM/ANSI/"utf8") to O_U16TEXT?

Maybe there is something like that already?

Alf P. Steinbach on 3 Nov 2011 12:28 PM:

Well it would be nice with UTF-8 on the outside (generally), and for Windows programs, UTF-16 for internal strings, since that's most efficient wrt. to the API, and also to avoid confusion about ANSI versus UTF-8 for narrow character internal strings.

But I can't make wscanf & family convert to UTF-16 on the inside.

It's like, it simply Does Not Work(TM).

Also, when I used the above idea for output at the C level, using C++ wcout crashed.

It seems that this support is a bit flaky... But any help would be appreciated!

Cheers,

- Alf


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/12/01 The IN door can go a different way than the OUT door

2010/10/07 Myth busting in the console

2010/09/23 A confluence of circumstances leaves a stone unturned...

2010/06/27 Bugs hidden in plain sight, and commented that way too ANSWERS

2010/06/18 Bugs hidden in plain sight, and commented that way too

2010/05/07 Cunningly conquering communicated console caveats. Comprende, mon Capitán?

2010/04/07 Anyone who says the console can't do Unicode isn't as smart as they think they are

2009/12/01 When changing behavior is like killing puppies

2009/08/14 Header files are not retarded, aka What the @#%&* is _O_WTEXT?

2008/03/19 The forensic typographers found no link to Lucida Console, and the D.A. had nothing to fallback to

2008/03/19 Before you say "What's next?" you have to figure out the action items

go to newer or older post, or back to index or month or day