File redirection corruption?
by Michael S. Kaplan, published on 2006/04/07 14:21 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/04/07/570980.aspx
A question I received in email:
In the FRA and ESN OSes, when I type some word on the command prompt with an acute-accented e like génération and redirect it to a file (eg: “echo génération > abcd.txt”) then the file contains a comma instead of the é. (The file has g,n,ration). But when I don’t redirect, I can see the character properly in the command prompt. I am also able to copy-paste to that file with the characters intact. Only the redirection is causing trouble.
Can you advise as to why this could happen? I would have thought that it has the wrong code page but the fact that I can see the characters properly on screen seems to preclude that guess. Any help would be very appreciated.
Of course, any time the difference is between a console application and a regular windows application, the first guess as to the problem is one of those OEMCP vs. ACP issues.
So, looking at some code pages (1252 and 437):
On code page 437, 0x82 is U+00E9 (é -- LATIN SMALL LETTER E WITH ACUTE), while On code page 1252, 0x82 is U+201a (‚ -- SINGLE LOW-9 QUOTATION MARK).
So the output was never different at all -- but the way that the underlying byte was being interpretted was....
This post brought to you by "‚" (U+201a, SINGLE LOW-9 QUOTATION MARK)
# Maurits [MSFT] on 7 Apr 2006 4:53 PM:
> I am also able to copy-paste to that file with the characters intact
So the copy/paste is switching code pages automatically? How does that work?
# Michael S. Kaplan on 7 Apr 2006 5:21 PM:
Hi Maurits --
Well, usually each application that is not smart enough to use Unicode (such as the console) is smart enough to properly pivot from the code page it is using TO Unicode (either converting and putting CF_UNICODETEXT on the clipboard or just putting up the code page and letting the clipboard map and convert through synthetic clipboard formats)....
# Maurits [MSFT] on 7 Apr 2006 6:17 PM:
I see... copying from the console gets you to Unicode (through WM_COPY, presumably) but output redirection is a naked string of bytes.
And for some reason (?) the console is using a different code page than Notepad.
So "type abcd.txt" shows the accents, and "notepad abcd.txt" shows the commas. (Verified)
# Maurits [MSFT] on 7 Apr 2006 6:27 PM:
# Gabe on 8 Apr 2006 12:43 AM:
I would just run "chcp 1252" so that the console code page was the same as the system code page.
# Srivatsn on 8 Apr 2006 3:12 AM:
Why is it that when i run chcp 1252 and paste é (U+00E9) from character map, it displays Θ which is E9 in cp 437? What is the conversion that happens here and on what basis?
# Michael S. Kaplan on 8 Apr 2006 7:09 PM:
Well, chcp affects the output code page -- but what you enter in the console is input, not output. So the OEMCP is used....
# Gabe on 9 Apr 2006 5:16 AM:
So if I type:
echo génération > abcd.txt
Notepad will show an eacute.
Now, by default the console uses the Terminal font, which has a theta at code point 0xE9. However, Lucida Console is a Unicode font and things show up as I expect them to.
I would just recommend that the original email user set his console font to Lucida Console and use chcp 1252, and he should get what he expects.
# Michael S. Kaplan on 9 Apr 2006 9:50 AM:
Or try the /U option and have some of those other scebarios work, too.... :-)
# Gabe on 10 Apr 2006 12:52 AM:
When I write batch files that require "international" characters, I put "chcp 1252" at the beginning of them because I can't guarantee that they'll be run by a Unicode cmd.exe.
# Michael S. Kaplan on 10 Apr 2006 1:49 AM:
Um, "international" is at a minimum worthy of a "chcp 65001", isn't it?
I mean, with 1252 being such a far cry from "international" ? :-)
# Dean Harding on 10 Apr 2006 9:46 PM:
I think that's why he put "international" in quotes... at least it's "more" international that US-ASCII.
Anyway, for serious international stuff, I'd say switch to Monad or something (if possible anyway)... it's much more consistent, being .NET and all Unicode internally.
# Maurits [MSFT] on 11 Apr 2006 6:11 PM:
There doesn't seem to be a code page that will allow "type" to print a UTF-16 text file. chcp 1200 and chcp 1201 both return "Invalid code page." This could be worked-around with some kind of utf16le_to_utf8.exe, which would read UTF-16LE and spit it out as UTF8:
rem make a utf16le file
cmd /c /u echo génération > utf16le.txt
rem switch console to the UTF8 code page
rem type the file back to the console with the utf16le_to_utf8 shim
type utf16le.txt | utf16le_to_utf8
# Maurits [MSFT] on 11 Apr 2006 6:15 PM:
Er, switch /c and /u in that cmd call:
cmd /u /c echo génération > utf16le.txt
# Maurits [MSFT] on 11 Apr 2006 8:07 PM:
Yup, that works.
Active code page: 437
C:\>cmd /u /c echo génération > utf16le.txt
(Opening utf16le.txt in Notepad and a hex editor confirms the UTF16-LE-ness of the file.)
Θ n Θ r a t i o n
Active code page: 65001
(type'ing a UTF16-LE-encoded file in a UTF8 code page doesn't work...)
C:\>type utf16le.txt | perl utf16le_to_utf8.pl
(... but piping it through a converter does.)
For the sake of completeness, here's the code for the converter:
# slurp whole files to avoid spurious line break issues with 0d 00 0a 00 etc.
# read text
my $text = <>;
# convert text
Encode::from_to($text, 'UTF-16LE', 'UTF-8');
# output converted text
(It's probably reasonably trivial to write a simple .exe to convert from UTF-16LE on wcin to UTF-8 on cout... that would obviate the need for Perl.)
Please consider a donation
to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
go to newer or older post, or back to index or month or day