That brings new meaning to having "a ç-section" (Ãç§), doesn't it?

by Michael S. Kaplan, published on 2008/04/23 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/04/23/8417318.aspx

Content of Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)!
Regular readers should keep in mind that all I said in The End? still applies; the allusion to the X-Files continues for people who understand such references....

The question that was asked over in the microsoft.public.dotnet.international newsgroup by Samuel was:

I upload a text file with the following French character 'ç' and the server receives the following: 'Ã§' instead

Any explanation?

Thank you,
Samuel

Regular readers might get an astounding sense of deja vu at that, due to remembering past blog posts like Should old aquaintance *not* be forgot, code pages may screw up their names anyhow or Do not adjust your browser, a.k.a. sometimes two wrongs DO make a right, a.k.a. dumb quotes or Linguistic and Unicode considerations (or Language-specific Processing #4) or What's the encoding, again? or Consistent garbage text can be incorrect encoding identification (or detection),

It is UTF-8. :-)

If one looks at Windows code page 1252 one will see the following mappings:

Ã -- 0xc3 -- U+00c3, aka LATIN CAPITAL LETTER A WITH TILDE
§ -- 0xa7 -- U+00a7, aka SECTION SIGN

so it makes sense how someone could mix up those two bytes and think they are UTF-8.

In fact you can do it in Notepad! If your default system code page is 1252, take those two characters and save them to a text file, save it, close it, and open it.

You will see your

Ã§

become

ç

the same way.

In the end, Samuel's bug report has a few possible causes:

It could be a bug in the tool he uploads with marking the text as being in code page 1252 despite the fact that it is UTF-8;
it could be a bug in the place he is uploading to assuming it if code page 1252 despite the fact that it is UTF-8;
it could be a bug in a browser or other application looking at the uploaded content and misreaading it as code page 1252 despite the fact that it is UTF-8;
Various other permutations of the above with or without associated tagging.

so it may or may not be a bug still -- but with who or where the bug lies? To answer, more information is definitely required....

This post brought to you by ç (U+00e7, aka LATIN SMALL LETTER C WITH CEDILLA)

Gwyn on 23 Apr 2008 4:52 PM:

I'm envisioning a tool, I don't know if it exists or not, but maybe if this kind of encoding problem arises, this tool could be used to identify text blocks like this. You could enter in the garbled text on one side, and on the other side it would spit out a selection of code pages/encodings that possibly match, along with what the text would look like in that encoding. I wonder if such a thing exists?

mpz on 27 Apr 2008 5:55 PM:

I think such a thing exists. A popular filesharing software that goes by the name of *zureus pops up a dialog with the filenames decoded in a few different encodings, asking you to pick the correct one (if the torrent metadata isn't new enough that it is UTF-8 always). It might be a feature of Java.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day