Occam's Razor, as applied to UTF-8

by Michael S. Kaplan, published on 2006/07/18 06:10 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/18/669509.aspx


Occam's Razor is a principle easily stated in Latin (entia non sunt multiplicanda praeter necessitatem) or English (entities should not be multiplied beyond necessity). Applying it to UTF-8 is an obvious matter -- it is the shortest form of the encoding is the correct one. :-)

Anyway, Tom Gewecke has been wanting me to talk about a particular issue to do with the problem of non-shortest forms for a few months now....

First he sent me the following via the Contacting Michael link:

I recently came across some bizarre behavior by Win IE 6 , where it will interpret 3 different illegal UTF-8 byte sequences the same as a legal one, presumably because only the last 6 bits of the last 2 bytes in 3 byte sequences are being read. I had never seen reference to this anywhere before, and don't know if it is an app issue or an OS issue. It made it almost impossible to convince the author of a particular Greek UTF-8 web page that his code was totally wrong, since Win IE displayed it correctly. Have you ever heard of this or know if it is supposed to get fixed (or is it perhaps not considered a bug)? A demo is at 

   
http://homepage.mac.com/thgewecke/badutf8.html

Around the same time, a rather extensive thread was going on over on the Unicode List about the same issue... the fact that IE 6.0 and Outlook HTML mail were accepting non-shortest form UTF-8 and interpretting it.

A little while later he posted a comment in a thread here:

Some  Windows programs (IE and Outlook that I know of) will  also interpret invalid UTF-8 sequences  as if they were real characters.  A test of this is at

http://homepage.mac.com/thgewecke/badutf8.html

Of course there is one of those rules that if you wait long enough (I just had not gotten to posting about the issue yet? I've been busy!), the issues will resolve themselves. He asked me just the other day via the Contacting Michael link:

I have one report that IE 7 beta 3 fixes the problem with IE interpreting bad utf-8 as actual Unicode characters, so if that is correct my earlier query regarding this issue can be considered OBE.

http://homepage.mac.com/thgewecke/badutf8.html

First I want to apologize to Tom, I had not meant to wait that long. I have been keeping busy lately though, at some point I'll probably be able to explain what has been keeping me busy.

Second, it does indeed look like folks on the IE team at the very least read the Unicode List, and were able to turn the feedback into the effort to fix the issue. I do not know the exact build that the fix is in, but I too have been told that it has been addressed. :-)

And last, I did want to point out one thing, about the issue in general. Especially as it relates to places that the final fix does not reach....

While I believe that this was a very sensible issue to address, especially since a UTF-8 corrigendum went out some time ago that expressly stated that non-shortest forms of UTF-8 should not be accepted. The older text stated that while the non-shortest form should never be produced that it could be accepted by a process -- and it was with that older rule that a lot of the support of UTF-8 was done in MS products.

The truth is that when Unicode makes changes to the standard such as this, that it takes time before the change gets propogated (and the change does not always get propogated to every version of every product produced by every company)....

It is one of the reasons that Microsoft has started taking a more active interest in adopting the most recent versions of Unicode rather than 'hanging back', an issue that I will talk more about soon. To make sure that things can be implemented sooner, and of course because if more people are reviewing things that there is a greater likelihood of finding and avoiding problems.

In the end, everybody wins (well everyone other than people with non-shortest form UTF-8 web pages?). :-)

 

This post brought to you by  (U+0f3a, a.k.a. TIBETAN MARK GUG RTAGS GYON)


# Tom Gewecke on 18 Jul 2006 1:29 PM:

Thanks much for your comments!  However, I don't think this particular bug had anything to do with "non-shortest form" or any earlier change in the definition of acceptable UTF-8.   It was just a matter of a  faulty UTF-8 decoder which interpreted some byte sequences which have never represented characters under any definition as if they were the same as a sequence that does represent one (all sequences having the same length).

# Michael S. Kaplan on 18 Jul 2006 2:15 PM:

Hi Tom,

The definition of non-shortest form UTF-8 is captured in these characters, which were once legal to accept (but not emit). We just needed the product to catch up to the standard, that's all....

# Nick Lamb on 18 Jul 2006 8:21 PM:

You're quite right Tom, your example sequences aren't over-long, they're just plain invalid. Don't worry about Michael, he makes a lot of mistakes and isn't very good at admitting it.

Anyway, now that we're here, let's look at one of the examples in binary...

E1 BC D0 = 11100001 10111100 11010000

The Internet Explorer decoder does exactly what Tom said, it sees that E1 should have two trail bytes, reads out the bottom six bits from the next two code units, combines them with the E1 lead byte to give the result U+1F10 which is incorrect.

A compliant decoder reads the first two code units OK, but then the third code unit should be trail byte, and the leading '11' bits show that it's a lead byte so that's an error, it emits U+FFFD* and tries again with the D0 lead byte, but there are no trail bytes and it emits another U+FFFD before continuing to decode the document.

The other examples behave similarly.

* Of course it would also be compliant to throw an exception, return an error result etc. but that's not very practical in a tag soup web browser.

# Tom Gewecke on 18 Jul 2006 9:08 PM:

Thanks for the comments, Nick.  My impression is that such byte sequences were illegal even before the Unicode Corrigendum #1.  TUC 3.0 (3.8, D31) seems to indicate that a decoder running into them should react by "signaling an error, filtering the code value out, or representing the code value with a marker such as U+FFFD."

# Michael S. Kaplan on 19 Jul 2006 3:10 AM:

Whether this particular example was contrived in error is not the real issue here. And neither is any sort of fetish about how wrong anyone thinks I am....

The fact is that in the original examples that concerned Tom there were actual UTF-8 pages that looked okay even though they were incorrect byte sequences. And IE, whose goal was to work with a wide variety of crap that existed on the web -- including crap created by MS/non-MS tools and for MS/non-MS browsers.

It was only later that people starting worrying about the security issues here -- and arguably IE is a little later coming to the party than some other products. That's all....

# Nick Lamb on 19 Jul 2006 5:03 AM:

That's right Tom, your examples are almost identical to those found in Unicode test suites as illegal lead/ trail sequences. No compliant decoder, regardless of the version of Unicode / IETF / ISO standard could ever have given the results that you see from Internet Explorer.

If you look at Ken Thompson's sketched code for processing FSF-UTF you might see a family resemblance to the Internet Explorer code. It has all the relevant bugs here (but of course that code was an illustration, and FSF-UTF isn't compatible with UTF-8).

Did you ever find out what was generating the invalid sequences in the Greek page?

# Tom Gewecke on 19 Jul 2006 9:35 AM:

One thing I find interesting about this is how little trouble has  apparently resulted from having a non-compliant utf-8 decoder in such wide use for a considerable time.  The (long-gone) Greek site which alerted me to  the issue is the only case of systematically illegal utf-8 on the web  that  I have ever seen or heard of, and it was the result of the author writing his own utf-16 to utf-8 encoder.  Even the case where unexpected Chinese is displayed, when Latin-1 accented chars are  read as if they were utf-8, seems  to be very uncommon in practice, or at least has never bothered anyone very much.   Whether this phenomenon could be a security issue in certain circumstances I still don't know.

go to newer or older post, or back to index or month or day