Is the text in XKCD broken?

by Michael S. Kaplan, published on 2010/04/01 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/04/01/9988607.aspx

The other day my friend Samantha (of "sam I am" fame, for regular readers, my part time music groupie friend) pointed me at an XKCD column.

She thought it might be right up my alley as it looked like it had one of those "cobe page" (that isn't a typo, she called it COBE page) problems.

Hover over it to see the problem, in the tooltip, it should look something like this:

Yet if you are using Internet Explorer (I was using 8.0, so was Sam) the browser looked to both of us as if not seeing it that way:

If you want to see the text right and you do not have FireFox and you don't want to right-click and change the encoding in IE, you can:

The text will suddenly look right as these two sequences are proof enough for Notepad to know this is UTF-8, not cp1252:

True, there is a (useless) encoding="utf-8" for the xml (useless because the default encoding for xml is utf-8).

But apparently the browsers (or at least IE) ignores the xml "envelope" and just deals with the html part. And there is nothing to indicate that is utf-8.

The HTTP response header says just

Content-Type text/html

instead of

Content-Type text/html; charset=UTF-8

And the html file does not have a Content-Type meta in in the head section.

(<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />)

So in the end I would not blame IE (too much :-)

It's messed up in Chrome and Opera.

It is an error for the xml-declaration and http-equiv meta tag not to match the HTTP content-encoding header, regardless of what the HTML, XML and XHTML standards say about the matter. It is an error, it can crash intermediate systems, and if an intermediate system (forward/reverse proxy, etc.) is in play, it can conceptually crash or cause unexpected results in the browser itself.

For example, a certain blue company that is rather big has a reverse proxy product that runs regular expressions against text types, and rewrites URLs. To keep it fast, they use one code path regardless of the type of text, and they assume that the HTTP content encoding is accurate.

So if that says shift-jis, and you're sending UTF-8, it's going to try to apply the UTF-8 decoder to your shift-jis text. I don't suspect that is going to work very well.

When they encode text, they encode it using the same content-encoding you specified. So if the content-encoding in the HTTP transaction was UTF8, but you're actually giving it UTF16 XML, you get (dependent on byte order) either the first less than, or nothing at all, being processed and passed along to the user.

The specs do technically allow you to specify the encoding via the XML declaration or the meta tag, which works fine for files on-disk or being transported as SMTP attachments (although content-encoding should match there as well). Most file systems don't have a useful standardized way of communicating the information.

The reason they don't catch this, at a spec level, is because it works for the most common case. The most common case is that the web server is serving US ASCII/ISO Latin as the content encoding (because that's usually the default), and since TCP is 8bit clean, and that is a single byte character set, it works OK.

However, the reality is, the intermediate systems involved in an HTTP transaction rely on the HTTP header being set correctly, and so you really do need for it to be correct.

IE doesn't do XHTML in any released version. It's planned for IE 9.

Use of the <?xml XML declaration causes IE 8 and earlier to miss the following DOCTYPE declaration and drop into Quirks mode. XKCD also has a processing instruction that IE doesn't understand before the DOCTYPE. Fundamentally it's an HTML, not XHTML, browser.

XKCD is relying entirely on this XML declaration - it specifies Content-Type text/html with no Charset (and this will cause some conforming browsers to drop into HTML, rather than XHTML, mode, and still ignore the XML declaration).

@Headache: the reason is that the server specifies utf-8. The actual HTTP header overrules any META tag in the document. The META HTTP-EQUIV feature was intended to instruct the server what HTTP headers it should output, but that feature does not work on any server that I'm aware of.

In addition that page has an HTML comment, apparently from the version control system, which precedes the DOCTYPE declaration and causes IE to drop into Quirks mode.

Mike: I know you don't manage the blog software, but you might like to make the people who do aware that the CAPTCHA image doesn't work a large proportion of the time. /blogs/JpegImage.aspx just redirects to the the aspxerrorpage.htm page - presumably dynamic error handling. You have to refresh a lot and be lucky if you want to leave a comment.