Code pages don't overlap all that much

by Michael S. Kaplan, published on 2007/07/14 23:21 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/07/14/3873217.aspx

It was over two years ago that I mentioned how code pages are really not enough. But it is still a problem that comes up all the time....

Ahmet's question is very representative of the situation:

I need help for an encoding issue while creating a text file. A customer of mine is creating a text file with some strings and integer values. He is calculating byte representation of the integer and writing it to a text file.

For example, integer value is 222 and string contains “ĞİÜ”. If the customer uses ISO-8859-9 to keep Turkish characters unchanged, binary repr. of 222 changes to question mark, instead of Ş. However, if he uses ISO-8859-1, 222 is written as Ş but the Turkish characters are changed to nearest characters, for example, Ğ changes to G, İ changes to I (however, strangely, Ü stands same).

This really shows the most fundamental problem with code pages and multilingual text, doesn't it?

There is no one code page that contain all the characters in Unicode, with the exception of:

GB18030 (which promises a 1-to-1 mapping with Unicode),
UTF-16 (which is what Microsoft calls "Unicode"), and
UTF-8/UTF-32 (the other encoding forms of Unicode)

Any time you try to use some other code page, there are tens of thousands of characters that will not be able to survive the operation....

One last point of interest -- Ahmet's observation that Ü strangely is not affected. This is not all that strange, since both ISO 8859-1 and ISO 8859-9 both contain Ü in them -- these code pages really were designed to support various languages and that is a letter that happens to be used by several languages (in the case of this letter at a minimum both German and Turkish use it, though there are several others!).

This post brought to you by Ü (U+00dc, a.k.a. LATIN CAPITAL LETTER U WITH DIAERESIS)

Joku on 15 Jul 2007 4:28 AM:

Is there something I'm missing here, why would you "calculate byte representation of the integer and write it to a text file"?

Christoph Päper on 15 Jul 2007 6:27 AM:

Amongst ISO 8859, German is really well supported. The umlauts and eszett are not only included in every Latin variant, they even stay in the same places.

Michael S. Kaplan on 15 Jul 2007 7:13 AM:

Hey Christoph,

Subconscious efforts on the part of German standards folks? :-)

Cristian Secară on 16 Jul 2007 12:28 PM:

Codepoint 222 (DE hex) is Ş (S with cedilla below) in ISO-8859-9 (Latin 5), but Þ (Latin capital letter Thorn) in ISO-8859-1 (Latin 1).

Not that this would changes much the rest :)

Cristi

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day