Encoding mysteries

by Michael S. Kaplan, published on 2006/01/03 05:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/03/508589.aspx

Jennifer Reyes asked:

Michael -

We have a system with terabytes of historical files, about 20% of which we are unable to view due to nonprintable ascii characters.  These files seem to be reports sent from an IBM mainframe.

Do you have any suggestions how we could begin to determine what code set is being used and how to translate it into something viewable?

Thanking you in advance

Wow, 20% is a pretty high amount of text!

IE's AutoDetect, which of course uses MLang's AutoDetect, probably won't do well here since it is geared toward internet data and not legacy mainframe data.

Now it is hard to comment without even knowing what bytes these are, but you could try two different approaches to this problem if you have no information about the original source:

But it is really hard to guess what the data might be without that kind of additional information.

It could even be that the non-printable ASCII is actually using the various control codes that used to be a part of some communication protocols with mainframes and devices. Looking at what the bytes are would be the best way to determine if that is the case....


This post brought to you by U+0016, known in some circles as the SYNCHRONOUS IDLE

# G on 3 Jan 2006 7:32 AM:

That's a large amount of data, but I've found unix file[1] command to be pretty smart at handling the identification of file coding systems.

It's part of cygwin.

[1] ftp://ftp.astron.com/pub/file

# Peter Ibbotson on 3 Jan 2006 8:09 AM:

Well the obvious guess is it's EBCDIC files. However on an IBM mainframe (historically) it could be almost anything depending on the slug train put in the printer. (APL springs to mind)
http://en.wikipedia.org/wiki/EBCDIC has a link to an online converter that could be tried

# Nicholas Allen on 3 Jan 2006 8:18 PM:

It looks like this post's featured character has killed the feeds for your blog.

# Michael S. Kaplan on 3 Jan 2006 9:00 PM:

Grrrr... Community Server bugs!

Fixed now....

# Mihai on 8 Jan 2006 4:28 AM:

Here is a trick to easily try various code pages:
- in MS Word "Tools" -> "Options"
- in "General" tab check "Confirm conversion at Open"
- "File" -> "Open" and select the desired file
- when asked, select "Encoded Text"
- in the "File Conversion" dialog check "Other encoding"
- select each encoding in the list and see the preview windows

go to newer or older post, or back to index or month or day