Unicode in the console (including STDIN)?

by Michael S. Kaplan, published on 2011/11/07 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/11/07/10234592.aspx


Over in the Suggestion box, Alf P. Steinbach asked:

In an earlier topic you discussed how to do Unicode output to a console window at the C library level, by using _setmode.

It would be nice with a discussion also of input at the C level (which does not seem to work).

And then, how to do this at the C++ level, using C++ iostreams?

Most of the console samples I have given here use ReadConsoleW and WriteConsoleW. Because as "Myth #8" in this blog points out, the CRT has had some problems in some versions for both STDIN and STDOUT and STDERR.

There is just a bug or two way from the CRT working here, with input being the final frontier.

At this point, assuming you want code that works in any circumstance (in any version), then I highly recommend you move to ReadConsoleW.

As for C++, I don't trust it in this case.

There are just too many times that streams will "helpfully" convert text to some code page because it thinks it would be best to do.

Truth be told, it's why I don't trust the CRT here, either.

Better to use the method that's been working since NT 3.1 than to use something that has had a variety of problems since then that are only now being totally eradicated....

Links from other 'myths' in this blog will get you the "if something is redirected" code, which is the principal nominal benefit of using things like the CRT. And I love the work of colleagues like Philip Lucido (former CRT owner used to be incredibly helpful to me!), but this one area simply has passed the mark of trust, for me....


Simon Buchan on 7 Nov 2011 2:19 PM:

And to think that all user-side complexity/brokenness could have been avoided if Microsoft had a time machine when they created NT, so they could use UTF-8 for everything. :(

Michael S. Kaplan on 7 Nov 2011 3:52 PM:

Our Flux Capacitor is in beta during FY12 Q1, so maybe we can take care of that? :-)

Yuhong Bao on 7 Nov 2011 6:03 PM:

Well, UTF-8 was invented in year 1992.

Simon Buchan on 7 Nov 2011 6:59 PM:

@Yuhong: They're much closer than I thought, actually: according to Wikipedia, UTF-8 started development early '92 was publicly announced January '93, while NT 3.1 was in development from '88 and released July '93 - which perhaps just makes it more disappointing that we were perhaps a couple of years off significantly less complex text handling in C on Windows :(.

Michael S. Kaplan on 7 Nov 2011 7:44 PM:

NT wasn't going to take further delays to either gut existing "W" functions, or add a third set of functions for UTF-8 (I shudder at trying to convince DaveC of either choice there!). This new encoding as not an official part of the Unicode standard then, and we were implementing Unicode -- not potentially interesting RFCs....

Yuhong Bao on 8 Nov 2011 7:13 PM:

What about allowing UTF-8 to be the ACP/OEMCP?

Michael S. Kaplan on 8 Nov 2011 10:27 PM:

Asked and answered many times.By you, even....

Yuhong Bao on 9 Nov 2011 10:06 AM:

I mean back in year 1993.

Michael S. Kaplan on 9 Nov 2011 10:24 AM:

Again, asked and answered -- we were Unicode focused, not RFC focused....


go to newer or older post, or back to index or month or day