by Michael S. Kaplan, published on 2006/01/03 05:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/03/508589.aspx
Jennifer Reyes asked:
Michael -
We have a system with terabytes of historical files, about 20% of which we are unable to view due to nonprintable ascii characters. These files seem to be reports sent from an IBM mainframe.
Do you have any suggestions how we could begin to determine what code set is being used and how to translate it into something viewable?
Thanking you in advance
Wow, 20% is a pretty high amount of text!
IE's AutoDetect, which of course uses MLang's AutoDetect, probably won't do well here since it is geared toward internet data and not legacy mainframe data.
Now it is hard to comment without even knowing what bytes these are, but you could try two different approaches to this problem if you have no information about the original source:
But it is really hard to guess what the data might be without that kind of additional information.
It could even be that the non-printable ASCII is actually using the various control codes that used to be a part of some communication protocols with mainframes and devices. Looking at what the bytes are would be the best way to determine if that is the case....
This post brought to you by U+0016, known in some circles as the SYNCHRONOUS IDLE
# G on 3 Jan 2006 7:32 AM:
# Peter Ibbotson on 3 Jan 2006 8:09 AM:
# Nicholas Allen on 3 Jan 2006 8:18 PM:
# Michael S. Kaplan on 3 Jan 2006 9:00 PM:
# Mihai on 8 Jan 2006 4:28 AM: