Can a codepage be changed? How about which codepage a locale points to?

by Michael S. Kaplan, published on 2005/02/06 13:37 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/06/368081.aspx

Earlier today I explored the question Can I get my characters into Unicode? but Ivan Petrov's question was also asking about what could be done about code page 1251, which also was missing these 20 cyrillic characters.

Unfortunately, there is nothing that can be done with it, for several reasons.

First, as Mike Dimmick tried to point out in a comment to that post (moderated to avoid spoilers, sorry Mike!), code page 1251 has only one free slot, and there is really no way to add 20 characters to it. This of course makes it impossible on its face to update cp1251.

Second, as a matter of policy, Microsoft does not update the so-called ANSI¹ code pages. Ever. We can't. We have tried, twice:

Some time between Windows NT 4.0 and 2000, (also between Windows 98 and Me), some but not all of the code points required for Farsi were added to cp1256 (there was not room for all of them).
In the Windows 2000 timeframe, almost all of the ANSI code pages were updated to include the Euro².

We are still dealing with the fallout of both of those changes, and have promised many interested parties both inside and outside of Microsoft that we would not make the same mistake again. It affects persistence formats, application compatibility, and platform/cross-plaftform compatibility to do so.

The Microsoft ANSI code pages are weird anyway. They are not an ANSI standard and most of them are modelled after ISO-8859 code pages. The main difference is that the C1 area (in ISO-8859 reserved for control codes that are also seen in Unicode) is used for characters. On the posiive side this makes them more honest-to-goodness useful; on the negative side the data is often mistaken for the analagous ISO-8859 code page and Microsoft gets to be called evil for messing up standards.

The simple fact is that for many languages, 8 bits are really not enough.

Dr. International was talking about it back in August of 2000, going so far as to suggest that in many cases the GetLocaleInfo/LOCALE_IDEFAULTANSICODEPAGE of a locale was more of a "best fit mapping" which may not contain all of the characters a language needs. in poker terms, I can see Ivan's concerns about cp1251 and Bulgarian, and raise him the doctor's examples for cp1256 (inadequate for Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Kurdish, Pashto, Sindhi, Uighur, and Urdu, two of which are supported now, some more of which will come later).

There are four different ways that this has traditionally been solved:

The "Microsoft" method, which I mentioned above where we added code points to fill in the unallocated spaces. We have abandoned this approach.
The "ISO" method, by which I am referring to the ISO-8859 series which if characters had to be added would issue a new code page as an update. Obviously this causes interoperability problems galore since only some people pick up the updates.
The 'IBM" method, by which I am referring to the original DOS "OEM" code pages and also the EBCDIC series. The solution is similiar to the ISO method but much more free about issuing new code pages. As far as I know iBM is not doing this anymore, though I could be mistaken (I do know that Microsoft is not picking up new OEM/EBCDIC code pages).
The "Unicode" method, by which I am referring to getting out of using the non-Unicode code pages.

This fourth method is the one Microsoft uses now (after seeing from our own experience and that of others how bad the other three can be).

Anyone is free to limp along as best they can without Unicode (easy for some languages, not so easy for others), or they can move to Unicode and see their language supported as well as the current definitions allow (which is actually quite a long way!).

Then there is an additional question, about changing the value of the code page that is used by a locale, i.e. changing the OEMCP value returned by GetLocaleInfo/LOCALE_IDEFAULTCODEPAGE so that a default system locale setting will have updated behavior. Doing this would cause any file previously saved from the console or in the OEM code page to be corrupted, and there is no possible way that the benefit to any language can outweigh the pain of data corruption. The answer here is also definitely Unicode.

The final related question that Ivan raised is to do with the Bulgarian MIK OEM Codepage, which is one we cannot add to Windows and even if it were there could not switch to have Bulgarian use. The time has come to move to Unicode, especially if you are using a language that needs it. Bulgarian is in the same spot as Urdu and about another 600+ languages for whom 8 bits are insufficient.

1 - Raymond Chen discusses why this code page is misnamed as "ANSI" in his post Why is the default 8-bit codepage called "ANSI"? in May 2004. I will probably add to it one day, as there is more to tell....
2 - If you know which one it is without looking at the Windows code pages, I would be impressed, but since I have no way of knowing whether you looked I will have to stay unimpressed today.

This post sponsored by "?" (U+003f, a.k.a. QUESTION MARK)
The character that appears for almost all code pages when you try to convert from Unicode into them and the character does not exist....

# Steve loughran on 6 Feb 2005 2:00 PM:

Every time I get some more detail on how things adapted to the euro symbol, I get more annoyed about how stupid they were to invent a whole new symbol. Nobody with a US keyboard can type it; old printers cant handle it, its just a disaster. What were our EU masters thinking?

# Brodie Thiesfield on 7 Feb 2005 2:39 AM:

Do you know if Microsoft ever entertained the idea of creating a UTF-8 codepage, providing Unicode support in the way most Un*x do. I know that there is a codepage code for conversions with MultiByteToWideChar, et al, but is this a real codepage that the system can be set to?

# Michael Kaplan on 7 Feb 2005 8:57 AM:

Thought? Perhaps... but facts trump thought here every time. It is not possible given both the current architecture (which must work in both user and kernel mode) and also the inherent assumption in several subsystems (like USER) that the ACPs maximum number of bytes per character is 2.

It is before my time but people have led me to believe it was considered for the "Unicode only" locales until it became clear that it wasn't really possible to change that much legacy....

# Ivan Petrov on 8 Feb 2005 3:14 PM:

Hi again Michael ;-).

First of all I want to tell you that I'm very satisfied with your answers!
I think I clearly understood you about the 'It's UNICODE time!' tendetion, but in that moment in me arised 2 simple questions:

1) What to do with the tones of OEM-encoded: text files and documents, strings in compiled command prompt (DOS) programs and utilities, etc., ESPECIALLY of those encoded with OEM codepages not supported by Windows (as Bulgarian MIK OEM Codepage for example), to just read them correctly using UNICODE?

2) How to type characters like 'CYRILLIC CAPITAL LETTER A WITH GRAVE' in applications like Word for example, as they are suported in UNICODE, but not supported in codepages like 1251 ?

Thank You in advance.

Regards,
Ivan.

# Michael Kaplan on 8 Feb 2005 3:20 PM:

Hi Ivan! Regarding your questions --

#1 -- If the code page exists then it is certainly outside the bounds of Unicode today -- so someone needs to do work to convert them to Unicode. As it is obviously a one-time operation per data file, a permanent mapping is likely not the best solution here. But a one-time tool that mapped each byte to the appropriate Unicode code point or code points would be best (note that a code page would not work anyway since there is no good way to map one byte to two Unicode characters with MultiByteToWideChar.

#2 -- A ligature can easily be authored in MSKLC for any such character that is needed (dead keys would not work here).

# Unni.Vishwanathan on 27 Aug 2005 10:57 PM:

Back in May of 2004, Quan Nguyen sent a message to Dr. International about Vietnamese collation...

Yuhong Bao on 19 Feb 2011 12:57 AM:

"The 'IBM" method"

IBM's codepages are much more precise than MS's codepages. For example, when MS added the euro sign to codepage 1252, IBM issued a new codepage instead.

Michael S. Kaplan on 19 Feb 2011 4:46 AM:

Um, that is what I said:

"The solution is similiar to the ISO method but much more free about issuing new code pages."

Yuhong Bao on 28 Dec 2012 12:33 AM:

"In the Windows 2000 timeframe, almost all of the ANSI code pages were updated to include the Euro"

Actually, I think it was in the Win98 timeframe, with updates issued for older versions of Windows dating back to NT 3.51 and Win3.x.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2009/08/17 On not looking at Uyghur through a Chinese prism

2007/08/04 The one code page that changed recently

2007/01/04 Whither intl.inf in Vista?

2007/01/03 UTF-8 and GB18030 are both 'NT' code pages, they just aren't 'ANSI' code pages

2006/07/14 Can the CP_ACP be UTF-8?

2006/07/05 Custom code pages?

2005/10/17 Round trip calls do not always go both ways

2005/08/27 Vietnamese is a complex language on Windows

2005/03/15 Emptying some items out of the suggestion box

go to newer or older post, or back to index or month or day