'for' handling Unicode? That would be running against 'type'

by Michael S. Kaplan, published on 2006/09/27 05:04 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/27/773448.aspx

Tony's question was straightforward enough:

I have a CMD file that attempts to read and parse the contents of a text file but fails due to the file being Unicode. Is there any way to get For to process the file as Unicode or do I need to copy the file to ANSI?

I’ve attached Unicode and Ansi text files and if you execute the following “For” lines you’ll see that the Unicode version does not output any results.

C:\>For /F %a In (readmeBuildAnsi.txt) Do echo %a

C:\>echo Update

Update

C:\>For /F %a In (readmeBuildUnicode.txt) Do echo %a

C:\>

What Tony has run across is one of those pieces of the console that is unapologetically non-Unicode, even when you run CMD with the /U flag that "Causes the output of internal commands to a pipe or file to be Unicode".

It is hardly alone here, as there are many pieces of the console that do not support Unicode.

Though Stephen Malcolm suggested a good workaround for this particular issue:

Did you try?

for /f %a in ('type readmebuildunicode.txt') do @echo %a

It appears that the for command doesn’t recognize Unicode files natively but type does.

This is quite true -- the type command actually goes through special effort to support Unicode parsing of the files it processes, and it can be used to do the heavy lifting in many of these cases.

Stephen also pointed out another example of this kind of use of the type command, this time using the non-Unicode findstr.exe:

Another pain with Unicode files is trying to search them with findstr. This command doesn’t understand Unicode files but you can still you it by doing the following:

type <file>|findstr <search_string>

This works for the same reason, because type does recognize Unicode files.

The key here (and the reason that this is such an effective workaround, generically) is that what type does is put the text into one of the standard handles, and once it is there then any tool or command can have better luck processing it (because it will be converted out of Unicode and into the console's code page, which all of the non-Unicode command line tools and console commands can handle.

Of course the downside is that these non-Unicode tools will not be able to do meaningful processing on Unicode data outside of that code page; the workaround is simply making it easier for the non-Unicode tools to work as they always would (I would actually love a Unicode findstr!).

But I guess we'll have to wair for Monad....er, PowerShell, for a better Unicode story here in the bulk of the console's processing....

This post brought to you by U (U+0055, a.k.a. LATIN CAPITAL LETTER U)

Adam on 27 Sep 2006 5:58 AM:

What do you mean by "won't be able to do meaningful processing"? If the unicode file contains characters that are not available in the console's code page, what happens? Are all characters converted to some kind of marker ("?"?); is conversion stopped and an error generated? Something else?

Michael S. Kaplan on 27 Sep 2006 6:30 AM:

They get question marks, Conversion continues but data is clearly lost....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day