Code pages are really not enough....

by Michael S. Kaplan, published on 2005/03/01 06:29 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/01/382289.aspx

Helen Custer, in Inside Windows NT, describes the situation back then in an interesting way:

The lowest layer of localization is the representation of individual characters, the code sets. The United States has traditionally employed the ASCII (American Standard Code for Information Interchange) for representing data. For European and other countries, however, ASCII is not adequate because it lacks the common symbols and punctuation. For example, the British pound sign is omitted, as are the diacritical marks used in french, German, Dutch, and Spanish.

The International Standards Organization (ISO) establish a code set called Latin1 (ISO standard 8859-1), which defines codes for all of the European characters omitted by ASCII. Microsoft Windows uses a slight modification of Latin1 called the Windows ANSI code set. Windows ANSI is a single-byte coding scheme because it uses 8 bits to represent each character. The maximum numbr of characters that can be expressed using 8 bits is 256 (2⁸).

A script is a set of letters required to write in a particular language. The same script is often used for several languages. (For example the Cyrillic script is used for both the Russian and Ukranian languages.) Windows ANSI and other single-byte coding schemes can encode enough charactrs to express the letters in Western scripts. However, Eastern scripts such as Japanese and Chinese, which employ thousands of separate characters, cannot be encoded usng a single-byte encoding scheme. These scripts are typically stored using a double-byte encoding scheme, which uses 16 bits for each character, or a multibyte encoding scheme, in which some characters are represented by an 8-bit sequence and others are represnted by a 16-bit, 24-bit, or 32-bit sequence. The latter scheme requires complicated parsing algorithms to determine the storage width of a particular character. Furthermore, a proliferation of different code sets means that a particular code might yield entirely different characters on two different computers, depending on the code set each computer uses.

I thought it was interesting the way some of the technology terms were framed. It definitely does not fit the terminology we use today for several different terms. But what really caught my eye was the implicit idea that each of these code pages was enough for a language, and that the only real problems were the lack of good cross-code page support and the difficulty of parsing some of the more complex cases.

The truth is much further from these points than you might guess. Because there are very few languages for which a code page (especially one of the 'Windows ANSI' code pages) actually has adequate coverage. I'd say that these code pages are perhaps 'good enough' for some languages but do not really contain all of the characters one might want to use to fully express information in most languages. Unicode in this context becomes more than just a luxury -- if you are missing letters you need in your language then it becomes a necessity.

There was a recent thread in the microsoft.public.win32.programmer.international forum entitled "Developing ANSI application for multi-national Windows" where someone was strongly advocating not moving to Unicode because they believed their application (written in C, over 1 million lines, with over 50,000 strings, heavily relying on pragmas giving the code page and locale per source file to get their work done) was better served by keeping it all out of Unicode and relying on code page support. Of course almost immediately there were problems:

My biggest wonderment, which perhaps you can answer or even solve, is why a non-Unicode localized application (for MBCS languages) will only run properly if the *system* default locale is set to the proper language.

I run the international versions of XP and 2000, but only Unicode applications run properly unless the system default locale is set; there are no provisions that I have found that let me say, "This application uses Japanese.Japan.932." Dialog boxes, drawn text, and other problems are abundant.

These issues are obviated by Unicode, but for a project my size that is an undertaking that will take quite a while and detract from product enhancements that are necessary for the marketplace.

Though people did point to AppLocale as a workaround, the fundamental problems in trying to make a complex application work with such methods will (in my opinion) quickly outweigh the "benefits" of avoiding the move to Unicode. Because in the end, code pages are not really enough....

This post brought to you by "©" (U+00a9, a.k.a. COPYRIGHT SIGN)
One of the most common code points people complain they lose in their non-Unicode applications since it is not on all ACPs

# CN on 1 Mar 2005 5:41 AM:

One could make a lot of silly jokes about what code pages would most appropriately lack the (c) symbol, in relation to their intellectual property laws...

# Jonathan Payne on 1 Mar 2005 5:42 AM:

Why did Microsoft use the term 'Windows ANSI' when the character set is not ANSI?

Why does the Microsoft documentation sometimes use the term 'Unicode' when it is really talking about Unicode limited to 2 bytes per character (UCS-16?)?

# Michael Kaplan on 1 Mar 2005 6:05 AM:

Well, if you look at code pages like Microsoft Shift-JIS, they became an industry standard. I don't think cp1252 was any worse than that. I think Helen's text gives the most coherent plausible source for calling it ANSI.

I personally find cp1252 to be more useful than ISO 8859-1. No one needs those control characters, and it supports more languages. The only bad thing is when web pages misreporte which they were.

As for the Unicode -- both UCS-2 (what I think you meant) and UTF-16 *are* Unicode. So why do the docs call it that? Because they are being correct.

But Windows is not limited to UCS-2 and has not been since Windows 2000 (or even NT4 in IE and Office).

# Mihai on 8 Mar 2005 10:17 AM:

And in the quoted text there is the same problem that confuses everybody:
==================
"Microsoft Windows uses a slight modification of Latin1 called the Windows ANSI code set"

"Windows ANSI is a single-byte coding scheme"

"Windows ANSI and other single-byte coding schemes can encode enough charactrs to express the letters in Western scripts."
==================
For Windows "ANSI Code Page = System Active Code Page (ACP)", so it can be double byte or non-Latin 1.
This is for the readers, not for Michael, he knows it too well :-)
http://blogs.msdn.com/michkap/archive/2005/02/08/369197.aspx

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/11/30 On not being well served by the mantra "must support Unicode"

2008/04/30 Why WC2MB needs a CP, chaver sheli!

2007/07/14 Code pages don't overlap all that much

2006/07/05 Custom code pages?

2006/01/07 Getting the characters in a code page

2005/09/10 Does size matter? And if so, how do you measure it?

2005/05/22 You may want to rethink your choice of UTF, #2 (Speed of operations)

go to newer or older post, or back to index or month or day