by Michael S. Kaplan, published on 2006/12/23 18:03 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/12/23/1354366.aspx
(The first part of the title is meant to be an allusion to The Outer Limits and the narration of the Control Voice, which is what I thought of when the inspiration for this post presented itself!)
The other day I was looking at a web site and it contained a bunch of text like the following:
as far as Iâ€™m concerned, weâ€™d give
and so on.
Now people who have been dealing with international issues for a while (and maybe people who have read this post or maybe this one) might think they know what is going on here -- we are looking at text that is UTF-8 being looked at as it were in Windows code page 1252.
Good guess. But no. In fact when I right clicked on the page to look at the encoding, I got the same results as you see in this blog:
So it wasn't UTF-8 encoded text being displayed as code page 1252. It already thought it was UTF-8, which means the underlying encoding had to be wrong.
Clearly, adjusting the browser was not going to improve the experience. Which is what made me think of the Control Voice....
Now by taking the encoding investigative techniques I talked about in Behind 'How to break Windows Notepad' and this post, first we'll put the text in Notepad:
then save it, close Notepad, and open the file in Notepad again. You will see:
So clearly there is a UTF-8 problem in the heritage here (by the way the above steps will only work for you if your default system code page is 1252). The only thing that makes this problem harder is that there is no easy way to fix broken content since the only fix is to interpret the text in a way that is technically wrong in order to correct the wrong that screwed it up in the first place (proving that if carefully planned, two wrongs can indeed make a right)....
What would we call that feature -- targeted code page mangling? How'd that look on a right click menu?
I guess we could also blame the problem on Microsoft Word (since the web page appeared to be a copy of an email written in Outlook via Word mail) and its conversion of ' (U+0027, a.k.a. APOSTROPHE) into ‘ (U+2018, a.k.a. LEFT SINGLE QUOTATION MARK) and ’ (U+2019, a.k.a. RIGHT SINGLE QUOTATION MARK) via that exciting "smart quotes" feature that in some cases is affected by this encoding problem that we could easily name "dumb quotes". :-)
This post brought to you by ‘ (U+2018, a.k.a. LEFT SINGLE QUOTATION MARK)
# Tom Gewecke on 24 Dec 2006 9:16 AM:
A very interesting case. I've never seen it on a web page, but a couple times where people tried to change the encodings of their ID 3 tags on songs to get non-Latin text to display right in iTunes, etc. It happens when someone opens UTF-8 or Big-5 for example in Latin-1, then resaves as UTF-8. The fix is usually to copy the final UTF-8, save it as Latin-1, reopen in the original encoding (UTF-8 or Big-5 in this case) and then use the text. Not a lot of fun if you don't know what they started with.
# Thorsten Glaser on 22 Mar 2007 10:04 PM:
You could use iconv(1) to fix that,
by telling it to convert from latin1
into utf-8 on the already-but-broken
utf-8 text. While I did a systemati-
cal conversion of some files (code &
data) in MirBSD to UTF-8, I acciden-
tally did that mistake because some,
usually append-only logs, were a mix
of latin1 and utf-8 data. But fixing
these places was easy with an editor
that can pipe blocks of the text in-
to an external programme, such as my
favourite jupp (wordstar-like key UI
and works with cygwin and Interix).
jupp -> http://mirbsd.de/jupp (GPL'd
iconv -> http://www.mirbsd.org/man1/iconv.htm
PS: This input field is too small in
lynx, good that it can spawn an,
of course jupp, external editor.
2010/04/01 Is the text in XKCD broken?
2007/10/17 CSI: Unicode?
go to newer or older post, or back to index or month or day