by Michael S. Kaplan, published on 2005/01/22 10:52 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/22/358675.aspx
Code pages are out there, and they are important.
A huge amount of legacy data exists in them, and we have to convert them all to Unicode to get anything done on Windows 2000, XP, Server 2003, or Longhorn.
But sometimes they are not designed very smartly.
Take for example code page 20269. It is intended to represent ISO 6937, a code page standard that is a little out of step. Basically, ISO 6937 has for characters single letters and combinations of a letter with a diacritic. Only those which occur in a list are legal, the "repertoire" of ISO 6937. The diacritic has to preceed the letter, but is not a character in and of itself. A diacritic as a free-standing character is created by coding a space behind the byte that represents the "diacritical mark". In this way some characters are coded with one, others with two bytes. The number of codeable characters is finite, the 333 of the repertoire.
The scheme of 6937 was abandoned in favor of the ISO-8859 scheme, which uses precomposed characters.
Unfortunately for ISO 6937, Windows and Unicode do things the other way around (base character followed by combining character). In order to properly handle conversions for ISO 6937, any of the following characters would have to (as a part of the conversion) be reversed with the character following it when calling WideCharToMultiByte(20269, ...) and the character preceeding it when calling MultiByteToWideChar(20269, ...)
Unicode cp20269 Character
U+0306 0xc6 Combining Breve
U+0307 0xc7 Combining Dot Above
U+0308 0xc8 Combining Diaeresis
U+030a 0xca Combining Ring Above
U+030b 0xcd Combining Double Acute
U+030c 0xcf Combining Hacek
U+0327 0xcb Combining Cedilla
U+0328 0xce Combining Ogonek
U+0332 0xcc Combining Low Line
Yet the code page is there in its current form, which converts everything in place. Every entry in the table above is converted as is, and thus if you have a string such as
åėĭöŭ U+0061 U+030a U+0065 U+0307 U+0069 U+0306 U+006f U+0308 U+0075 U+0306
in text that is properly using ISO 6937 it should (if it is following the standard) be represented as
0xCA 0x61 0xC7 0x65 0xC6 0x69 0xC8 0x6F 0xC6 0x75
Unfortunately, code page 20269 would convert this to unicode as follows:
̊ȧĕïŏu U+030a U+0061 U+0307 U+0065 U+0306 U+0069 U+0308 U+006f U+0306 U+0075
Oops! Slight change, huh? :-)
Now there is no other code page that is broken like this. But a few are broken in other ways.... like IBM EBCDIC Arabic, which is a Visual order Arabic code page. Converting that to Unicode (which represents Arabic in logical order) properly would require a very careful algorithm. But the simple table-based code page 20420 converts everything as is once again, and thus produces something that is not terribly useful. For example the string with the word for Arabic:
العربية U+0627 U+0644 U+0639 U+0631 U+0628 U+064a U+0629
would be a nightmare string to convert to or from IBM EBCDIC Arabic without having some sort of intelligent processor to convert between logical and visual order.
What better reason to just use Unicode, instead? :-)
This post sponsored by "ى" (U+0649, ARABIC LETTER ALEF MAKSURA)
# anonymous on 22 Jan 2005 2:40 PM:
# Michael Kaplan on 22 Jan 2005 3:33 PM:
John Cowan on 14 Sep 2008 1:35 PM:
20269 should really be treated like SJIS or another 8/16-bit encoding rather than an 8-bit one.
referenced by