One of the coolest parts of my job is when I don't have to do it

by Michael S. Kaplan, published on 2011/06/02 07:01 -04:00, original URI:

Maybe I should explain the tite statement before it ends up having an effect on my upcoming review....

You see, the other day, Dan asked:

Subject: What are the F4 80 80 80 bytes in a UTF-8 HTML email body for?


I’m working on an issue involving UTF-8 encoding.  The customer is seeing these bits coming back from an Exchange EAS call which contains utf-8 encoded content which contains: F4 80 80 80. This content is in the HTML of an email body which contains simplified Chinese and English characters.   The customer is saying that these bits are not valid.  Are these some sort of special marker in UTF-8 encoding?

Thank you,

I saw the message on my phone, but I was at a late lunch (it was 1:34pm on a Wednesday). I figured I could answer it when I got back to my office.

But there really was no need....

Because at 2:08pm, just 34 minutes later, colleague Laurentiu resonded:

Hello Dan,

<F4, 80, 80, 80> is a well-formed UTF-8 sequence that represents the Unicode character U+100000 (five zeros), which is a Unicode supplementary private-use character.  Unicode contains three private-use areas: U+E000–U+F8FF in Plane 0 (Basic Multilingual Plane), and two supplementary planes, Planes 15 and 16 (Supplementary Private Use Area-A and -B).  U+100000 is the first character of Plane 16.

Although they can contain anything the user defines, supplementary private-use areas are typically used for CJK ideographs that are not encoded in Unicode.  So it looks like somebody is using a PUA character, possibly a private CJK ideograph.


And there it is -- knowing that there are other people who know the answers to many of the random globalization, localizability, world-readiness, and other issues to help out when many of these questions come up, and who are willing and able to respond is a huge help as more and more people are trying to do the right thing in their projects and products and support cases.

I still do answer many questions. But there are many others who help out as well.

And that is one of the coolest parts of my job -- the fact that so many others are around to help do it! :-)

There are still questions that no one else seems to know the answers to, so I still have some utility. But its nice to know that there are others around with knowledge and interest....

Andrew West on 2 Jun 2011 7:10 AM:

Not TUTF now ;-)

Michael S. Kaplan on 2 Jun 2011 8:29 AM:

Ah, there are still those areas I have unique and indispensible knowledge in that no one else seems to want to take the time to learn. :-)

Raymond Chen - MSFT on 2 Jun 2011 12:13 PM:

I like how the customer claimed the bytes were invalid by fiat. "These bytes are not valid because I say they aren't valid."

Van on 6 Jun 2011 7:24 AM:

I wonder if by "invalid", the customer means "showing up as a little box (or four)"?

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day