Before jumping into the stream, you might want to peek at it

by Michael S. Kaplan, published on 2007/04/16 10:59 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/04/16/2154397.aspx


Chris asked:

I am using the following constructor for StreamReader:

StreamReader (String, Boolean)    Initializes a new instance of the StreamReader class for the specified file name, with the specified byte order mark detection option.

However, when I look at the “.CurrentEncoding” property of the StreamReader class; it always appears to be UTF8 no matter what the encoding is of the file that was opened.  How can I get the encoding of the file that was opened?

--Chris

Before anyone had a chance to respond (like maybe 10 minutes later!), he answered his own question though:

Never mind.  By executing the .Peek() method the .CurrentEncoding property gets set appropriately.

I thought about it later and looked at the docs, which had this to say about that bool parameter:

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.

A part of me is hoping that there is a small do. bug and that given the existence of the UTF32Encoding class that it is the first four bytes that are being looked at here, and not just the first three. :-)

But then I thought about the idea that the StreamReader.CurrentEncoding was not looking at even those first few bytes until after one started looking at the data in the stream. I couldn't really think of a case where this was weird other than code that was depending on the value to decide what to do and which therefore might be looking at the CurrentEncoding first. In which case one's code could make the wrong decision, right?

The moral of the story? Be sure to take a quick peek before you make any big decisions with the StreamReader!

 

This post brought to you by U+feff, a.k.a. ZERO WIDTH NO-BREAK SPACE, a.k.a. the BYTE ORDER MARK


# Paul Dempsey on 16 Apr 2007 6:26 PM:

I was just working in a similar area. I DID come across a mention in the documentation that you must read from the stream in order for detection to happen -- I just can't find it now to point you to the evidence that it's really in there somewhere.

The .NET docs are quite insufficient/infuriating in explaining in detail how and when detection happens. Mentioning this in the docs for the constructors that turn on detection would be appropriate places to add this important bit of information.

# Paul Dempsey on 16 Apr 2007 6:34 PM:

Ah, there is is, in the doc for StreamReader.CurrentEncoding:

"The current character encoding used by the current reader. The value can be different after the first call to any Read method of StreamReader, since encoding autodetection is not done until the first call to a Read method. "

# Dean Harding on 16 Apr 2007 8:13 PM:

Hehe, that's gotta be one of the best puns you've come up with :-)

# Washu on 17 Apr 2007 2:07 PM:

It would appear that it does check for UTF32Encoding, as seen below (pulled from the DetectEncoding() private method in StreamReader)

else if ((((this.byteLen >= 4) && (this.byteBuffer[0] == 0)) && ((this.byteBuffer[1] == 0) && (this.byteBuffer[2] == 0xfe))) && (this.byteBuffer[3] == 0xff))

       {

           this.encoding = new UTF32Encoding(true, true);

           flag = true;

       }

# Michael S. Kaplan on 17 Apr 2007 5:15 PM:

Yep, I assumed this was just a doc. issue.... :-)


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day