That brings new meaning to having "a ç-section" (Ãç§), doesn't it?

Regular readers should keep in mind that all I said in The End? still applies; the allusion to the X-Files continues for people who understand such references....

The question that was asked over in the newsgroup by Samuel was:

I upload a text file with the following French character 'ç' and the server receives the following: 'ç' instead

Any explanation?

Regular readers might get an astounding sense of deja vu at that, due to remembering past blog posts like Should old aquaintance *not* be forgot, code pages may screw up their names anyhow or Do not adjust your browser, a.k.a. sometimes two wrongs DO make a right, a.k.a. dumb quotes or Linguistic and Unicode considerations (or Language-specific Processing #4) or What's the encoding, again? or Consistent garbage text can be incorrect encoding identification (or detection),

It is UTF-8. :-)

If one looks at Windows code page 1252 one will see the following mappings:

so it makes sense how someone could mix up those two bytes and think they are UTF-8.

In fact you can do it in Notepad! If your default system code page is 1252, take those two characters and save them to a text file, save it, close it, and open it.

You will see your




the same way.

In the end, Samuel's bug report has a few possible causes:

so it may or may not be a bug still -- but with who or where the bug lies? To answer, more information is definitely required....


This post brought to you by ç (U+00e7, aka LATIN SMALL LETTER C WITH CEDILLA)

Gwyn on 23 Apr 2008 4:52 PM:

I'm envisioning a tool, I don't know if it exists or not, but maybe if this kind of encoding problem arises, this tool could be used to identify text blocks like this. You could enter in the garbled text on one side, and on the other side it would spit out a selection of code pages/encodings that possibly match, along with what the text would look like in that encoding. I wonder if such a thing exists?

mpz on 27 Apr 2008 5:55 PM:

I think such a thing exists. A popular filesharing software that goes by the name of *zureus pops up a dialog with the filenames decoded in a few different encodings, asking you to pick the correct one (if the torrent metadata isn't new enough that it is UTF-8 always). It might be a feature of Java.

