by Michael S. Kaplan, published on 2006/07/18 06:10 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/18/669509.aspx
Occam's Razor is a principle easily stated in Latin (entia non sunt multiplicanda praeter necessitatem) or English (entities should not be multiplied beyond necessity). Applying it to UTF-8 is an obvious matter -- it is the shortest form of the encoding is the correct one. :-)
Anyway, Tom Gewecke has been wanting me to talk about a particular issue to do with the problem of non-shortest forms for a few months now....
First he sent me the following via the Contacting Michael link:
I recently came across some bizarre behavior by Win IE 6 , where it will interpret 3 different illegal UTF-8 byte sequences the same as a legal one, presumably because only the last 6 bits of the last 2 bytes in 3 byte sequences are being read. I had never seen reference to this anywhere before, and don't know if it is an app issue or an OS issue. It made it almost impossible to convince the author of a particular Greek UTF-8 web page that his code was totally wrong, since Win IE displayed it correctly. Have you ever heard of this or know if it is supposed to get fixed (or is it perhaps not considered a bug)? A demo is at
Around the same time, a rather extensive thread was going on over on the Unicode List about the same issue... the fact that IE 6.0 and Outlook HTML mail were accepting non-shortest form UTF-8 and interpretting it.
A little while later he posted a comment in a thread here:
Some Windows programs (IE and Outlook that I know of) will also interpret invalid UTF-8 sequences as if they were real characters. A test of this is at
Of course there is one of those rules that if you wait long enough (I just had not gotten to posting about the issue yet? I've been busy!), the issues will resolve themselves. He asked me just the other day via the Contacting Michael link:
I have one report that IE 7 beta 3 fixes the problem with IE interpreting bad utf-8 as actual Unicode characters, so if that is correct my earlier query regarding this issue can be considered OBE.
First I want to apologize to Tom, I had not meant to wait that long. I have been keeping busy lately though, at some point I'll probably be able to explain what has been keeping me busy.
Second, it does indeed look like folks on the IE team at the very least read the Unicode List, and were able to turn the feedback into the effort to fix the issue. I do not know the exact build that the fix is in, but I too have been told that it has been addressed. :-)
And last, I did want to point out one thing, about the issue in general. Especially as it relates to places that the final fix does not reach....
While I believe that this was a very sensible issue to address, especially since a UTF-8 corrigendum went out some time ago that expressly stated that non-shortest forms of UTF-8 should not be accepted. The older text stated that while the non-shortest form should never be produced that it could be accepted by a process -- and it was with that older rule that a lot of the support of UTF-8 was done in MS products.
The truth is that when Unicode makes changes to the standard such as this, that it takes time before the change gets propogated (and the change does not always get propogated to every version of every product produced by every company)....
It is one of the reasons that Microsoft has started taking a more active interest in adopting the most recent versions of Unicode rather than 'hanging back', an issue that I will talk more about soon. To make sure that things can be implemented sooner, and of course because if more people are reviewing things that there is a greater likelihood of finding and avoiding problems.
In the end, everybody wins (well everyone other than people with non-shortest form UTF-8 web pages?). :-)
This post brought to you by ༺ (U+0f3a, a.k.a. TIBETAN MARK GUG RTAGS GYON)
# Tom Gewecke on 18 Jul 2006 1:29 PM:
# Michael S. Kaplan on 18 Jul 2006 2:15 PM:
# Nick Lamb on 18 Jul 2006 8:21 PM:
# Tom Gewecke on 18 Jul 2006 9:08 PM:
# Michael S. Kaplan on 19 Jul 2006 3:10 AM:
# Nick Lamb on 19 Jul 2006 5:03 AM:
# Tom Gewecke on 19 Jul 2006 9:35 AM:
go to newer or older post, or back to index or month or day