If you lie, that replacement character might pop in (the one that isn't Paul Westerberg)

by Michael S. Kaplan, published on 2010/11/30 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/11/30/10098113.aspx

Late last month, JC Ahangama sent me the following question via email (to the trigeminal.com webmaster address):

Hi,

I am writing to this address because I could not find the address of great
Guru Michael Kaplan. The attached file explains my question as well as my
complaint. I do not have high connections to Unicode to do this but sure Dr.
Kaplan is the man for it.

It is about European characters getting step-motherly treatment.

Thanks.

JC (Ahangma)

Those characters would some up as square boxes (notdef glyphs) in Internet Explorer, and the diamond question mark character in FireFox.

Browsers must explicitly make some choices, and in this case, both Internet Explorer and FireFox are choosing to trust the charset meta tag.

And since the bytes from 0xA1 to 0xFF are illegal in UTF-8, each successive byte is converted to the replacement character (U+fffd).

As he noted, the problem is fixed if you change the charset meta tag -- at which point the page is no longer lying about how it is encoded....

The moral of the story is not to put the wrong charset meta tag in the page -- if you want it to be tagged as UTF-8, make sure it is saved as UTF-8.

The fact is that according to Mark Davis of Google, there are a lot of incorrectly tagged web pages out there, which they index using the correct encoding that the page is actually believed to be in. Now this leads to interesting problems since so many browsers will not display the page in the same way that Google indexed it (meaning you may not be able to see the text that you were searching for and Google claimed to find on such a page).

I wish that it were so easy to get, rather than U+fffd Replacement Character, the "Replacements characteer, Paul Westerberg.

Though if Mark is right about the amount of incorrectly tagged pages then that would mean one hell of a touring schedule (and a lot more common than saying 'Biggie Smalls' three times!)....

I think that Google do the right thing to index the pages with their true charset. Both IE and FF allow the user to override the declared character encoding, and for Russian webpages this has for many years been the panacea, because of the koi8 fight against win1251 (and others). Recently I find the pages render much better, not only due to the wide acceptance of UTF8, but also better educaton of Web masters and easier i18 configuration of ppular servers.

Anyway, both browsers display the high-ascii symbols of JC's example corectly as soon as the encoding menu is not auto.

I might agree if it was not so easy to misunderstand how to fix or the need to fix -- people are easily confused here; the browsers could all do better.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.