Never doubt that a program like Notepad can change the world. Indeed, it is of the only things that ever has!

by Michael S. Kaplan, published on 2011/10/03 14:01 +00:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/10/03/10219090.aspx

The question came in just the other day:

...customer says that when they specify a page as “UTF-8N”, it doesn’t render correctly on IE, but does on other browsers. I searched for “UTF-8N” and found references only on Japanese-language pages. The one English-language reference I found is this one where someone claimed that “UTF-8N” is simply UTF-8 without a BOM. Is that true? (How could anyone expect such a scheme to work?) Do we explicitly not support “UTF-8N”? Do we differ from competitors in this regard?

Peter pointed out what UTF-8N is:

UTF-8N was a proprietary designation that once appeared in some IBM documentation. It is not a valid charset identifier for use in HTML or XML. Valid charset identifiers are registered with IANA and can be found in this registry page:`

http://www.iana.org/assignments/character-sets

This is specified, for instance, in section 5.2 of the HTML 4.1 spec:

http://www.w3.org/TR/1999/REC-html401-19991224/charset.html

Now in the original thread it was NaseerBatt who pointed out the same meaning, without mentioning the "non-standard nature" of it:

UTF-8 without BOM is UTF-8N .Do you mean there no in-built mechanism or hack in C#/.NET that helps to detect this encoding (UTF-8N)?

Now the issue of the UTF-8 BOM is an interesting one that Microsoft has been in the center of since the beta of Windows 2000, where Notepad changed the world in its simple decision to always emit the Byte Order Mark (BOM) and time you save a file as UTF-8.

It made life easier fr many people working on other Microsoft products, since it is faster to work with the first few bytes of a file than to validate the encoding of the entire file.

Though outside of Microsoft, it isn't as well thought of.

So why do some other browsers seem to support if but not IE?

Well, remember that you can always have UTF-8 without a BOM -- in fact many people often prefer it.

Now there are two ways to deal with an HTML document that has a non-official "UTF-8N" encoding tag:

You can treat the file as some default encoding like ISO-8859-1 or Windows CP1252, or the CP_ACP, or
You an go through the various encoding detection steps that you would if there was no encoding tag.

There are strong philosophical issues underlying the choice of approaches. including the one that happens to appear to be used by IE this case: if the text is mislabeled, don't try to do any further work....

In the end, the best answer is to probably just don't tag text inappropriately....

I won't suggest inserting a BOM in front of UTF-8 unless you're sure you won't start complaining about why that's a bad idea.

comments not archived

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day