by Michael S. Kaplan, published on 2006/05/27 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/05/27/608351.aspx
Mushy asked in the Suggestion Box:
Michael,
You may have an old post that explains the following situation, so point me to it if you do. Here is the issue. If you look on my blog page you'll see some quotation marks printed as ’ instead of ". When I first posted the material they weren't there. Now, most of the " marks are in some other code. How do you get rid of them? Is this caused by Word, .doc, .txt, .rtc formats? Which is best to post in?
P.S. I like your blog and have bookmarked it.
Thanks,
Mushy
(For those who are interested, Mushy's blog is Cross+Hairs.)
Some people may be familar with the byte sequence; if you are then you are a geek. :-)
But to see what is going on, let us first consider Microsoft Notepad's detection behavior around encodings:
So, armed with this knowledge, lt's try the following:
What you will find is that the byte sequence 0xE2 0x80 0x99, which in code page 1252 (and the original saved file) looks like:
’
has been interpretted by the new instance of Notepad as:
’
in UTF-8 -- because sequence 0xE2 0x80 0x99 is what U+2019 (RIGHT SINGLE QUOTATION MARK) looks like in the underlying UTF-8 datastream.
If a web page is showing such sequences, this is usually caused by incorrect charset meta tag info on the page, incorrect header info from the server, incorrect code page detection on the client, or some combination of those issues....
If the problem occurs with other code pages, the exact representation will be different:
Slight differences in some of them, but it helps point out why strange garbage character sequences are often just not properly detecting UTF-8....
This post brought to you by ’ (U+2019, a.k.a. RIGHT SINGLE QUOTATION MARK)
# Sebastian Redl on 27 May 2006 9:06 AM:
# Michael S. Kaplan on 27 May 2006 9:53 AM:
# Michael Dunn_ on 28 May 2006 12:52 PM:
# Tom Gewecke on 29 May 2006 9:01 PM:
# Michael S. Kaplan on 29 May 2006 10:16 PM:
# cate on 30 May 2006 8:37 PM:
# Michael S. Kaplan on 30 May 2006 9:54 PM:
# rolfhub on 15 Jun 2006 4:05 PM:
referenced by
2010/04/01 Is the text in XKCD broken?
2008/04/23 That brings new meaning to having "a ç-section" (Ãç§), doesn't it?
2007/10/17 CSI: Unicode?
2007/08/11 Should old aquaintance *not* be forgot, code pages may screw up their names anyhow
2007/07/21 Avoiding an international mailto maelstrom
2006/12/23 Do not adjust your browser, a.k.a. sometimes two wrongs DO make a right, a.k.a. dumb quotes
2006/07/18 Occam's Razor, as applied to UTF-8
2006/06/14 Behind 'How to break Windows Notepad'