by Michael S. Kaplan, published on 2005/03/01 06:29 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/01/382289.aspx
Helen Custer, in Inside Windows NT, describes the situation back then in an interesting way:
The lowest layer of localization is the representation of individual characters, the code sets. The United States has traditionally employed the ASCII (American Standard Code for Information Interchange) for representing data. For European and other countries, however, ASCII is not adequate because it lacks the common symbols and punctuation. For example, the British pound sign is omitted, as are the diacritical marks used in french, German, Dutch, and Spanish.
The International Standards Organization (ISO) establish a code set called Latin1 (ISO standard 8859-1), which defines codes for all of the European characters omitted by ASCII. Microsoft Windows uses a slight modification of Latin1 called the Windows ANSI code set. Windows ANSI is a single-byte coding scheme because it uses 8 bits to represent each character. The maximum numbr of characters that can be expressed using 8 bits is 256 (28).
A script is a set of letters required to write in a particular language. The same script is often used for several languages. (For example the Cyrillic script is used for both the Russian and Ukranian languages.) Windows ANSI and other single-byte coding schemes can encode enough charactrs to express the letters in Western scripts. However, Eastern scripts such as Japanese and Chinese, which employ thousands of separate characters, cannot be encoded usng a single-byte encoding scheme. These scripts are typically stored using a double-byte encoding scheme, which uses 16 bits for each character, or a multibyte encoding scheme, in which some characters are represented by an 8-bit sequence and others are represnted by a 16-bit, 24-bit, or 32-bit sequence. The latter scheme requires complicated parsing algorithms to determine the storage width of a particular character. Furthermore, a proliferation of different code sets means that a particular code might yield entirely different characters on two different computers, depending on the code set each computer uses.
I thought it was interesting the way some of the technology terms were framed. It definitely does not fit the terminology we use today for several different terms. But what really caught my eye was the implicit idea that each of these code pages was enough for a language, and that the only real problems were the lack of good cross-code page support and the difficulty of parsing some of the more complex cases.
The truth is much further from these points than you might guess. Because there are very few languages for which a code page (especially one of the 'Windows ANSI' code pages) actually has adequate coverage. I'd say that these code pages are perhaps 'good enough' for some languages but do not really contain all of the characters one might want to use to fully express information in most languages. Unicode in this context becomes more than just a luxury -- if you are missing letters you need in your language then it becomes a necessity.
There was a recent thread in the microsoft.public.win32.programmer.international forum entitled "Developing ANSI application for multi-national Windows" where someone was strongly advocating not moving to Unicode because they believed their application (written in C, over 1 million lines, with over 50,000 strings, heavily relying on pragmas giving the code page and locale per source file to get their work done) was better served by keeping it all out of Unicode and relying on code page support. Of course almost immediately there were problems:
My biggest wonderment, which perhaps you can answer or even solve, is why a non-Unicode localized application (for MBCS languages) will only run properly if the *system* default locale is set to the proper language.
I run the international versions of XP and 2000, but only Unicode applications run properly unless the system default locale is set; there are no provisions that I have found that let me say, "This application uses Japanese.Japan.932." Dialog boxes, drawn text, and other problems are abundant.
These issues are obviated by Unicode, but for a project my size that is an undertaking that will take quite a while and detract from product enhancements that are necessary for the marketplace.
Though people did point to AppLocale as a workaround, the fundamental problems in trying to make a complex application work with such methods will (in my opinion) quickly outweigh the "benefits" of avoiding the move to Unicode. Because in the end, code pages are not really enough....
This post brought to you by "©" (U+00a9, a.k.a. COPYRIGHT SIGN)
One of the most common code points people complain they lose in their non-Unicode applications since it is not on all ACPs
# CN on 1 Mar 2005 5:41 AM:
# Jonathan Payne on 1 Mar 2005 5:42 AM:
# Michael Kaplan on 1 Mar 2005 6:05 AM:
# Mihai on 8 Mar 2005 10:17 AM:
referenced by
2011/11/30 On not being well served by the mantra "must support Unicode"
2008/04/30 Why WC2MB needs a CP, chaver sheli!
2007/07/14 Code pages don't overlap all that much
2006/07/05 Custom code pages?
2006/01/07 Getting the characters in a code page
2005/09/10 Does size matter? And if so, how do you measure it?
2005/05/22 You may want to rethink your choice of UTF, #2 (Speed of operations)