Consistent garbage text can be incorrect encoding identification (or detection)

by Michael S. Kaplan, published on 2006/05/27 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/05/27/608351.aspx

Mushy asked in the Suggestion Box:

Michael,
You may have an old post that explains the following situation, so point me to it if you do. Here is the issue. If you look on my blog page you'll see some quotation marks printed as â€™ instead of ". When I first posted the material they weren't there. Now, most of the " marks are in some other code. How do you get rid of them? Is this caused by Word, .doc, .txt, .rtc formats? Which is best to post in?

P.S. I like your blog and have bookmarked it.

Thanks,
Mushy

(For those who are interested, Mushy's blog is Cross+Hairs.)

Some people may be familar with the byte sequence; if you are then you are a geek. :-)

But to see what is going on, let us first consider Microsoft Notepad's detection behavior around encodings:

If a UTF-16 LE BOM is there, then it's UTF-16 LE.
If a UTF-16 BE BOM is there, then it's UTF-16 BE.
If a UTF-8 BOM is there, then it's UTF-8.
If it appears to be valid UTF-8 according to the old RFC2279 definition, then it is assumed to be UTF-8.
Otherwise, it is assumed to be in the default system code page, CP_ACP.

So, armed with this knowledge, lt's try the following:

Create a new text file in Notepad
Add the following string to it: â€™
Save the file and close it.
Open the file

What you will find is that the byte sequence 0xE2 0x80 0x99, which in code page 1252 (and the original saved file) looks like:

â€™

has been interpretted by the new instance of Notepad as:

’

in UTF-8 -- because sequence 0xE2 0x80 0x99 is what U+2019 (RIGHT SINGLE QUOTATION MARK) looks like in the underlying UTF-8 datastream.

If a web page is showing such sequences, this is usually caused by incorrect charset meta tag info on the page, incorrect header info from the server, incorrect code page detection on the client, or some combination of those issues....

If the problem occurs with other code pages, the exact representation will be different:

874 - โ€
932 - 窶
936 - 鈥
949 - ?
950 - ?
1250 - â€™
1251 - вЂ™
1252 - â€™
1253 - β€™
1254 - â€™
1255 - ג€™
1256 - â€™
1257 - ā€™
1258 - â€™

Slight differences in some of them, but it helps point out why strange garbage character sequences are often just not properly detecting UTF-8....

This post brought to you by ’ (U+2019, a.k.a. RIGHT SINGLE QUOTATION MARK)

# Sebastian Redl on 27 May 2006 9:06 AM:

I'm a geek. I recognized it.

# Michael S. Kaplan on 27 May 2006 9:53 AM:

Welcome to the club, Sebastian. We should get t-shirts. :-)

# Michael Dunn_ on 28 May 2006 12:52 PM:

While I didn't remember that â€™ was a smart quote, three random characters starting with an accented "a" is a sure sign of a misinterpreted UTF-8 character.

# Tom Gewecke on 29 May 2006 9:01 PM:

Some Windows programs (IE and Outlook that I know of) will also interpret invalid UTF-8 sequences as if they were real characters. A test of this is at

http://homepage.mac.com/thgewecke/badutf8.html

# Michael S. Kaplan on 29 May 2006 10:16 PM:

Yes Tom, I saw your message to the contact link, and it is on my list to consider for a future post....

I have many thoughts on this issue, and will be talking about it soon. So there is no need to handle it offtopic here. :-)

# cate on 30 May 2006 8:37 PM:

So... I'm still unclear. Is this something that can be fixed and if so, how?

# Michael S. Kaplan on 30 May 2006 9:54 PM:

If it is broken on the client side? Then sure, it can be fixed -- just pick a new encoding choice....

# rolfhub on 15 Jun 2006 4:05 PM:

Well, that sequence ("â€™") surely is ingraved into my brain ...

I had the problem that, after installing Kubuntu Linux on my laptop, the standard shell ("Konsole") ran very slowly, so I installed "rxvt" (very small and fast alternative), but the output of every manpage came out with many copies of the sequence mentioned.

I needed quite some time to figure out that Kubuntu is completely Unicode-based, but rxvt isn't, so the manpage parser was outputting correct unicode sequences, but rxvt didn't know that. In case anyone wants to know: just install "urxvt" (unicode-rxvt) -- that saved my problems.

Am I a propper unicode-geek now? ;-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/04/01 Is the text in XKCD broken?

2008/04/23 That brings new meaning to having "a ç-section" (Ãç§), doesn't it?

2007/10/17 CSI: Unicode?

2007/08/11 Should old aquaintance *not* be forgot, code pages may screw up their names anyhow

2007/07/21 Avoiding an international mailto maelstrom

2006/12/23 Do not adjust your browser, a.k.a. sometimes two wrongs DO make a right, a.k.a. dumb quotes

2006/07/18 Occam's Razor, as applied to UTF-8

2006/06/14 Behind 'How to break Windows Notepad'

go to newer or older post, or back to index or month or day