by Michael S. Kaplan, published on 2011/02/07 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/02/07/10125541.aspx
So it all started when Michael Everson started a new blog:
This is not a porn site.
Because that's not LATIN SMALL LETTER P there. It's LATIN SMALL LETTER THORN.
Aren't IDNs awesome? :-)
Anyway, David Starner noted:
On Sun, Feb 6, 2011 at 3:14 PM, Michael Everson <everson@evertype.com> wrote:
> Pleased to announce a new blog, http://žorn.info
And yet even today, Unicode email is reliably unreliable.
Interesting!
Others identified the problem, it's one that has been discussed before.
Michael was using Apple Mail to send the email.
Apple Mail encoded the mail as being ISO-8859-1. This is reasonable since every character in the email can fit in that code page.
And David was using gmail for his email client.
And this is where the wheels came off the wagon.
Alexandros (Αλέξανδρος) explained it most fully:
The original message was correctly tagged as ISO-8859-1, but it looks
like both people responding saw it interpreted as ISO-8859-13. Judging
from the Message-IDs, both seem to be posting from Gmail, so this must be
an example of Google's encoding guessing, which has been discussed here
in the past: since many web pages and mail messages in other encodings
are mistagged as ISO-8859-1, Google uses various heuristics which are
easy to go wrong when there's only a few non-ASCII characters in the
text.
As I recall, posting in UTF-8 makes the problem go away, although it's
hard to find fault with Apple Mail for going with the most conservative
and appropriate encoding for the content (i.e. ISO-8859-1).
I personally don't care for Google's behavior here.
There are certainly programs that get things wrong. Hell, the company I work for produced a version of FrontPage that had a serial inability to properly tag and use 8859-1 and CP1252.
A lot of what Alexandros was saying came from things said by Mark Davis of Google in the past. Mark has explained at length about all the data they work from and how much is tainted (incorrectly tagged).
But to be honest, this behavior? I find it to be disturbing.
Remember, this is a world that Google looks at the stats and detects that over 50% of the web is Unicode (as I discussed in >50% of the web is Unicode? Meh, I say. Meh.).
So maybe it is time for Google, which officially suggests people use Unicode to get correct results, and which works in a world where most modern clients produce correct results anyway, to start shirting over to being more trusting of the huge number of clients out here who aren't getting this wrong.
They are supposed to be running rings around everyone with the ability to use the whole Internet as a corpus.
So perhaps Google isn't being pwned by this problem, but this clear willingness to trust an algorithm that distrusts others (and that minimal investigation shows to be a detectable phenomenon, albeit one which Google doesn't go so far as to detect, though) is proving to be a bit of a þ (thorn) in their side....
L. on 7 Feb 2011 8:29 AM:
> a version of FrontPage that had a serial inability to properly tag and use 8859-1 and CP1252.
MS Office Outlook still (in Office 2010) tags CP1252 as iso-8859-1, by the way.
Yuhong Bao on 7 Feb 2011 9:18 AM:
Yea, the IETF charset mailing list archives seems to always tag as ISO-8859-1, for example.
JohnGalt on 7 Feb 2011 9:35 AM:
The problem is that Apple should have and should be encoding all messages as UTF-8 period end of conversation. In fact we should be killing off all code pages ASAP. There's no reason for them any longer other than for backwards compatibility.
And yes, google has to guess the code page because most email clients do not report the code page properly. Yet another reason to use UTF.
Michael S. Kaplan on 7 Feb 2011 9:41 AM:
Given how easy it is to detect the Apple mail client (which gets it right) and trust its claims, perhaps Google could do a bit more work here?
In any case, Google algorithm clearly has problems. They should either (a) abandon it and trust the claims of others, or (b) enhance it. Because it is much better to be wrong when it someone else's fault than your own!!!
Michael Everson on 12 Feb 2011 4:07 PM:
Þanks for your interest!