Not all code pages work right

by Michael S. Kaplan, published on 2005/01/22 10:52 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/22/358675.aspx


Code pages are out there, and they are important.

A huge amount of legacy data exists in them, and we have to convert them all to Unicode to get anything done on Windows 2000, XP, Server 2003, or Longhorn.

But sometimes they are not designed very smartly.

Take for example code page 20269. It is intended to represent ISO 6937, a code page standard that is a little out of step. Basically, ISO 6937 has for characters single letters and combinations of a letter with a diacritic. Only those which occur in a list are legal, the "repertoire" of ISO 6937. The diacritic has to preceed the letter, but is not a character in and of itself. A diacritic as a free-standing character is created by coding a space behind the byte that represents the "diacritical mark". In this way some characters are coded with one, others with two bytes. The number of codeable characters is finite, the 333 of the repertoire.

The scheme of 6937 was abandoned in favor of the ISO-8859 scheme, which uses precomposed characters.

Unfortunately for ISO 6937, Windows and Unicode do things the other way around (base character followed by combining character). In order to properly handle conversions for ISO 6937, any of the following characters would have to (as a part of the conversion) be reversed with the character following it when calling WideCharToMultiByte(20269, ...) and the character preceeding it when calling MultiByteToWideChar(20269, ...)

Unicode    cp20269    Character              
U+0306     0xc6       Combining Breve
U+0307     0xc7       Combining Dot Above
U+0308     0xc8       Combining Diaeresis
U+030a     0xca       Combining Ring Above
U+030b     0xcd       Combining Double Acute
U+030c     0xcf       Combining Hacek
U+0327     0xcb       Combining Cedilla
U+0328     0xce       Combining Ogonek
U+0332     0xcc       Combining Low Line

Yet the code page is there in its current form, which converts everything in place. Every entry in the table above is converted as is, and thus if you have a string such as

         åėĭöŭ      U+0061 U+030a U+0065 U+0307 U+0069 U+0306 U+006f U+0308 U+0075 U+0306

in text that is properly using ISO 6937 it should (if it is following the standard) be represented as

               0xCA 0x61 0xC7 0x65 0xC6 0x69 0xC8 0x6F 0xC6 0x75

Unfortunately, code page 20269 would convert this to unicode as follows:

               ̊ȧĕïŏu         U+030a U+0061 U+0307 U+0065 U+0306 U+0069 U+0308 U+006f U+0306 U+0075

Oops! Slight change, huh? :-)

Now there is no other code page that is broken like this. But a few are broken in other ways.... like IBM EBCDIC Arabic, which is a Visual order Arabic code page. Converting that to Unicode (which represents Arabic in logical order) properly would require a very careful algorithm. But the simple table-based code page 20420 converts everything as is once again, and thus produces something that is not terribly useful. For example the string with the word for Arabic:

               العربية               U+0627 U+0644 U+0639 U+0631 U+0628 U+064a U+0629

would be a nightmare string to convert to or from IBM EBCDIC Arabic without having some sort of intelligent processor to convert between logical and visual order.

What better reason to just use Unicode, instead? :-)

 

This post sponsored by "ى" (U+0649, ARABIC LETTER ALEF MAKSURA)


# anonymous on 22 Jan 2005 2:40 PM:

When are you going to add mouse wheel support to charmap.exe? :)

# Michael Kaplan on 22 Jan 2005 3:33 PM:

Me personally? Probably not ever, sorry. :-(

Would something like that happen? It is hard to say. It's not the most robust piece of code in the universe, a fact that does shape ideas about it....

But with that said I will mention it to the owner. :-)

John Cowan on 14 Sep 2008 1:35 PM:

20269 should really be treated like SJIS or another 8/16-bit encoding rather than an 8-bit one.


referenced by

2008/09/14 Johab to be kidding me!

2007/08/30 The main criteria in determing whether a code page sucks? Suckage, of course!

2007/07/17 Sometimes people use code pages even when the code pages are really lame

go to newer or older post, or back to index or month or day