No byte order marks when using encodings in StreamWriters?

by Michael S. Kaplan, published on 2006/11/20 05:40 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/11/20/1108172.aspx


Sometimes the convenient shortcuts blind us to the functionality we actually need, forcing us to work a bit harder than we have to. Like when David asked not too long ago:

There are a handful of Encodings that have a preamble (byte order marks that get written to the file to identify the encoding).

Suppose I want to create a StreamWriter using a utf-8 encoding (or some other encoding that has a preamble) but do not want the byte order marks written to the stream. What is the easiest way to do it?

Can anyone suggest something better than defining the following and using new NoPreambleEncoding(Encoding.UTF8) as the encoding for the StreamWriter?

        class NoPreambleEncoding : Encoding {
            Encoding _e;

            public NoPreambleEncoding(Encoding e) : base(e.CodePage) {
                _e = e;
            }
            public override byte[] GetPreamble() {
                return new byte[] { };
            } 
            public override int GetByteCount(char[] chars, int index, int count) {
                return _e.GetByteCount(chars, index, count);
            }
            public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex) {
                return _e.GetBytes(chars, charIndex, charCount, bytes, byteIndex);
            }
            public override int GetCharCount(byte[] bytes, int index, int count) {
                return _e.GetCharCount(bytes, index, count);
            }
            public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) {
                return _e.GetChars(bytes, byteIndex, byteCount, chars, charIndex);
            }
            public override int GetMaxByteCount(int charCount) {
                return _e.GetMaxByteCount(charCount);
            }
            public override int GetMaxCharCount(int byteCount) {
                return _e.GetMaxCharCount(byteCount);
            }
        } 

Now of course the above could be made to work with a bit more effort, but it is way more complicated than the actual answer, which would be to not use the static Encoding.UTF8 property the built-in UTF8Encoding class, which has a specific UTF8Encoding(bool encodeShouldEmitUTF8Identifier) constructor overload. :-)

I often find myself trash talking these shortcut properties which make some things easier, at the cost of making it appear like certain other things are impossible. This is probably unfair to them since this approach only speaks to the people I see who have trouble due to specialized requirements; it says nothing to those whose problems are solved with the shortcuts....

(Perhaps others would have been happier if the default had been to not include the BOM, but that particular ship has long since sailed of course!)

 

This post brought to you by  "" (U+feff, a.k.a. ZERO WIDTH NO-BREAK SPACE, of course)


# b6s on 20 Nov 2006 8:12 AM:

To my knowledge, the static property UTF8Encoding.UTF8 was without BOM in .NET 1.1; however, it is inherited from Encoding.UTF8 now in .NET 2.0 and certainly with BOM. This difference makes me have to revise my several file I/O codes... :(

# Michael S. Kaplan on 20 Nov 2006 9:48 AM:

UTF8Encoding.UTF8 and Encoding.UTF8 are kind of the same thing, so if one changed they both probably would. Hmmm.... that's not very backward compatible, if it's true. That should not have changed....

# Dean Harding on 20 Nov 2006 5:40 PM:

I like shortcuts... after all, if programming languages didn't ever have shortcuts, we'd be stuck with "if" and "goto" and nothing else :-)


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/01/13 RichTextBox breaking ranks?

go to newer or older post, or back to index or month or day