by Michael S. Kaplan, published on 2008/09/16 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/09/16/8953434.aspx
Over in the Suggestion Box, Gregory asked:
Hi Michael,
I stumbled across a piece of code that implements a .NET HttpModule to remove whitespace and other junk from pages as you hit them (as can be seen here).
Speaking purely about encoding issues in the code, are there any? The particular lines I am worried about are:
class PageCleanFilter : Stream
{
public override void Write(byte[] buffer, int offset, int count)
{
byte[] data = new byte[count];
Buffer.BlockCopy(buffer, offset, data, 0, count);
string html = Encoding.Default.GetString(buffer);
...
}
This code makes me uncomfortable. Specifically:
Is it okay to assume that the request response has the encoding specified by Encoding.Default?
This code assumes (I think) that the byte buffer range it works with has all the character data - What if a two byte character happens to be split up by the calling method? I can't imagine this could be a good thing...
Thanks in advance
Well, Gregory has good instincts, the kind that I like to think SiaO helps to teach people about. :-)
This code basically takes a chunk of text and assumes it is in the default system code page.
If you follow that link, you'll see that between the time I got the message and now, the author took some feedback (from Gregory!) from a comment that he left there.
The code does now gets rid of an extraneous copy operation.
But it still has the same bad code page assumption, which could easily break depending on the server's settings and really the whole operation should be using UTF-8 here anyway.
On the one hand it does not matter much since the developer, and the site, are going to be in a single code page.
But on the other hand, this is advice for a useful bit of code ou can download, with descriptive information such as:
This article details a HttpModule that removes white space, certain javascript comments, as well as optimising ASPX post-back javascript. This is useful when trying to save on the bandwidth your blog is using, or just plain and simply trying to decrease load time of your pages. I've also implemented a custom configuration section to allow the consumer to enable only the functionality required - you can look here for information on creating your own custom sections.
And the sample is being posted on a blog on the World Wide Web, which means anyone looking for a code sample might find it, and want to use it.
Plus with methods like RemoveWhiteSpace and RemoveLineBreaks it is worth considering how incomplete their work might be without all of Unicode to work with (not to mention the additional string copies that each of these methods do but let's stay focused on the international stuff!).
and then, when all is said and done, the current code is still converting stuff in and out of the default system code page a lot when it does not have to. At the top of the Write method:
string html = Encoding.Default.GetString(buffer, offset, count);
and then at the bottom of the method:
byte[] outdata = Encoding.Default.GetBytes(html);
_sink.Write(outdata, 0, outdata.GetLength(0));
instead of (like I said) going through UTF-8 here -- which will only lose invalid data, rather than potentially losing anything off the (comparatively small) code page.
But (taking a step back) where is the flaw?
Is it truly in the people who misuse the tools, or is it in those who design tools that so easily suggest usages that are not ideal?
Or as I put it in the title:
Is the flaw is in the constructs, or in the one who constructs?
Which then puts me in an odd sentence, where identical words are being used and the only difference is in the pronunciation, you know:
kən-strŭkt' vs. kŏn'strŭkt'
I like reminders of this when people act skeptically about how when I mention Han ideographs have multiple pronunciations, since it clearly happens in English too. It helps keep people humble.
But to get back to the encoding question.
Gregory is right, the code is wrong on a few levels, though primarily the concerns I have would be:
The lots-of-copying stuff is something that others probably are much more interested in. I mean, I care -- but not for the purposes of this blog...
I think I am probably going to have to jump into ASP.NET a bit here, and see if I can put some samples together. I'll probably have to dig up a website I have a bit more control over that runs managed code, but I think it might be a useful exercise to do, if described later from soup to nuts.
Anyway sorry to dēkŏn'strŭkt' things so much. Or actually, to dēkən-strŭkt' them. I'll try to kən-strŭkt' a better kŏn'strŭkt' if I can, later.... :-)
This blog brought to you by ə (U+0259, aka LATIN SMALL LETTER SCHWA)
Mike Dimmick on 16 Sep 2008 10:09 AM:
The benefit you get from removing blank spaces is generally dwarfed by the benefit of compressing the page entirely. IIS 6.0 has two checkboxes to enable compressing static and dynamic pages.
Jan Kučera on 17 Sep 2008 2:11 PM:
I am able to offer public Windows Server 2008 / IIS7 / .NET 3.5 for your asp.net experiments if you would like, including switching the language settings / mui on the fly...
Jan