Is the text in XKCD broken?

by Michael S. Kaplan, published on 2010/04/01 07:01 -04:00, original URI:

The other day my friend Samantha (of "sam I am" fame, for regular readers, my part time music groupie friend) pointed me at an XKCD column.

She thought it might be right up my alley as it looked like it had one of those "cobe page" (that isn't a typo, she called it COBE page) problems.

It was this comic:

Using a ring to bind someone you covet into your dark and twisted world? Wow, just got the subtext there. Also, the apparently eager Beyoncé would've made one badass Nazgȗl.

Hover over it to see the problem, in the tooltip, it should look something like this:

Or maybe I could just show it right here:

Using a ring to bind someone you covet into your dark and twisted world? Wow, just got the subtext there. Also, the apparently eager Beyoncé would've made one badass Nazgȗl.


Let's look at the source of the page:

Aha, my regular readers might know what's going on here....

It may remind people of blogs like Consistent garbage text can be incorrect encoding identification (or detection) and Do not adjust your browser, a.k.a. sometimes two wrongs DO make a right, a.k.a. dumb quotes, and one of these is pretrty much what is going on here, as a matter of a fact.

I mean, the coDe page for that page is indeed marked UTF-8:

Yet if you are using Internet Explorer (I was using 8.0, so was Sam) the browser looked to both of us as if not seeing it that way:

On the other hand, there is Firefox.

My FireFix seems to do it right:

If you want to see the text right and you do not have FireFox and you don't want to right-click and change the encoding in IE, you can:

  1. Open Notepad on a machine with a cp1252 system default codepage;
  2. Paste the text in green above into Notepad;
  3. Save the file;
  4. Close the file;
  5. Open the file.

The text will suddenly look right as these two sequences are proof enough for Notepad to know this is UTF-8, not cp1252:

Using a ring to bind someone you covet into your dark and twisted world? Wow, just got the subtext there. Also, the apparently eager Beyoncé would've made one badass Nazgȗl.

Wow! That is like Behind 'How to break Windows Notepad', but in reverse!

Just like Sauron would have wanted it. :-)

Mihai on 1 Apr 2010 12:08 PM:

True, there is a (useless) encoding="utf-8" for the xml (useless because the default encoding for xml is utf-8).

But apparently the browsers (or at least IE) ignores the xml "envelope" and just deals with the html part. And there is nothing to indicate that is utf-8.

The HTTP response header says just

  Content-Type text/html

instead of

  Content-Type text/html; charset=UTF-8

And the html file does not have a Content-Type meta in in the head section.

(<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />)

So in the end I would not blame IE (too much :-)

Michael S. Kaplan on 1 Apr 2010 12:22 PM:

Sam's comment to me was that she doesn't right click. :-)

Skip on 1 Apr 2010 1:54 PM:

Interesting thing about this one - I read xkcd in google reader, inside of IE (I think IE8 on Win7 when this comic came out).   And as far as I know, the popup text was correct there, so I never saw this.  I guess google reader does this correctly.

Michael S. Kaplan on 1 Apr 2010 3:05 PM:

Well, as Mihai's words point out, "correct" is relative; I think HTML has a different default than XML such that technically IE might be more conformant by asuming no enceddng equals ISO 8859-1. Though of course more conformant to correctness also has its benefits as well. :-)

Yuhong Bao on 1 Apr 2010 6:53 PM:

BTW, the <?xml version=1.0> forces quirks mode in IE

Dave Bacher on 2 Apr 2010 12:42 PM:

It's messed up in Chrome and Opera.

It is an error for the xml-declaration and http-equiv meta tag not to match the HTTP content-encoding header, regardless of what the HTML, XML and XHTML standards say about the matter.  It is an error, it can crash intermediate systems, and if an intermediate system (forward/reverse proxy, etc.) is in play, it can conceptually crash or cause unexpected results in the browser itself.

For example, a certain blue company that is rather big has a reverse proxy product that runs regular expressions against text types, and rewrites URLs.  To keep it fast, they use one code path regardless of the type of text, and they assume that the HTTP content encoding is accurate.

So if that says shift-jis, and you're sending UTF-8, it's going to try to apply the UTF-8 decoder to your shift-jis text.  I don't suspect that is going to work very well.

When they encode text, they encode it using the same content-encoding you specified.  So if the content-encoding in the HTTP transaction was UTF8, but you're actually giving it UTF16 XML, you get (dependent on byte order) either the first less than, or nothing at all, being processed and passed along to the user.

The specs do technically allow you to specify the encoding via the XML declaration or the meta tag, which works fine for files on-disk or being transported as SMTP attachments (although content-encoding should match there as well).  Most file systems don't have a useful standardized way of communicating the information.

The reason they don't catch this, at a spec level, is because it works for the most common case.  The most common case is that the web server is serving US ASCII/ISO Latin as the content encoding (because that's usually the default), and since TCP is 8bit clean, and that is a single byte character set, it works OK.

However, the reality is, the intermediate systems involved in an HTTP transaction rely on the HTTP header being set correctly, and so you really do need for it to be correct.

Headache on 4 Apr 2010 1:10 PM:

When I look at the hostname I am quite amused that this page isn't shown "correctly" in neither IE nor firefox.

Mike Dimmick on 6 Apr 2010 10:52 AM:

IE doesn't do XHTML in any released version. It's planned for IE 9.

Use of the <?xml XML declaration causes IE 8 and earlier to miss the following DOCTYPE declaration and drop into Quirks mode. XKCD also has a processing instruction that IE doesn't understand before the DOCTYPE. Fundamentally it's an HTML, not XHTML, browser.

XKCD is relying entirely on this XML declaration - it specifies Content-Type text/html with no Charset (and this will cause some conforming browsers to drop into HTML, rather than XHTML, mode, and still ignore the XML declaration).

@Headache: the reason is that the server specifies utf-8. The actual HTTP header overrules any META tag in the document. The META HTTP-EQUIV feature was intended to instruct the server what HTTP headers it should output, but that feature does not work on any server that I'm aware of.

In addition that page has an HTML comment, apparently from the version control system, which precedes the DOCTYPE declaration and causes IE to drop into Quirks mode.

Mike: I know you don't manage the blog software, but you might like to make the people who do aware that the CAPTCHA image doesn't work a large proportion of the time. /blogs/JpegImage.aspx just redirects to the the aspxerrorpage.htm page - presumably dynamic error handling. You have to refresh a lot and be lucky if you want to leave a comment.

go to newer or older post, or back to index or month or day