by Michael S. Kaplan, published on 2005/02/06 13:37 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/06/368081.aspx
Earlier today I explored the question Can I get my characters into Unicode? but Ivan Petrov's question was also asking about what could be done about code page 1251, which also was missing these 20 cyrillic characters.
Unfortunately, there is nothing that can be done with it, for several reasons.
First, as Mike Dimmick tried to point out in a comment to that post (moderated to avoid spoilers, sorry Mike!), code page 1251 has only one free slot, and there is really no way to add 20 characters to it. This of course makes it impossible on its face to update cp1251.
Second, as a matter of policy, Microsoft does not update the so-called ANSI1 code pages. Ever. We can't. We have tried, twice:
We are still dealing with the fallout of both of those changes, and have promised many interested parties both inside and outside of Microsoft that we would not make the same mistake again. It affects persistence formats, application compatibility, and platform/cross-plaftform compatibility to do so.
The Microsoft ANSI code pages are weird anyway. They are not an ANSI standard and most of them are modelled after ISO-8859 code pages. The main difference is that the C1 area (in ISO-8859 reserved for control codes that are also seen in Unicode) is used for characters. On the posiive side this makes them more honest-to-goodness useful; on the negative side the data is often mistaken for the analagous ISO-8859 code page and Microsoft gets to be called evil for messing up standards.
The simple fact is that for many languages, 8 bits are really not enough.
Dr. International was talking about it back in August of 2000, going so far as to suggest that in many cases the GetLocaleInfo/LOCALE_IDEFAULTANSICODEPAGE of a locale was more of a "best fit mapping" which may not contain all of the characters a language needs. in poker terms, I can see Ivan's concerns about cp1251 and Bulgarian, and raise him the doctor's examples for cp1256 (inadequate for Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Kurdish, Pashto, Sindhi, Uighur, and Urdu, two of which are supported now, some more of which will come later).
There are four different ways that this has traditionally been solved:
This fourth method is the one Microsoft uses now (after seeing from our own experience and that of others how bad the other three can be).
Anyone is free to limp along as best they can without Unicode (easy for some languages, not so easy for others), or they can move to Unicode and see their language supported as well as the current definitions allow (which is actually quite a long way!).
Then there is an additional question, about changing the value of the code page that is used by a locale, i.e. changing the OEMCP value returned by GetLocaleInfo/LOCALE_IDEFAULTCODEPAGE so that a default system locale setting will have updated behavior. Doing this would cause any file previously saved from the console or in the OEM code page to be corrupted, and there is no possible way that the benefit to any language can outweigh the pain of data corruption. The answer here is also definitely Unicode.
The final related question that Ivan raised is to do with the Bulgarian MIK OEM Codepage, which is one we cannot add to Windows and even if it were there could not switch to have Bulgarian use. The time has come to move to Unicode, especially if you are using a language that needs it. Bulgarian is in the same spot as Urdu and about another 600+ languages for whom 8 bits are insufficient.
1 - Raymond Chen discusses why this code page is misnamed as "ANSI" in his post Why is the default 8-bit codepage called "ANSI"? in May 2004. I will probably add to it one day, as there is more to tell....
2 - If you know which one it is without looking at the Windows code pages, I would be impressed, but since I have no way of knowing whether you looked I will have to stay unimpressed today.
This post sponsored by "?" (U+003f, a.k.a. QUESTION MARK)
The character that appears for almost all code pages when you try to convert from Unicode into them and the character does not exist....
# Steve loughran on 6 Feb 2005 2:00 PM:
# Brodie Thiesfield on 7 Feb 2005 2:39 AM:
# Michael Kaplan on 7 Feb 2005 8:57 AM:
# Ivan Petrov on 8 Feb 2005 3:14 PM:
# Michael Kaplan on 8 Feb 2005 3:20 PM:
# Unni.Vishwanathan on 27 Aug 2005 10:57 PM:
Yuhong Bao on 19 Feb 2011 12:57 AM:
"The 'IBM" method"
IBM's codepages are much more precise than MS's codepages. For example, when MS added the euro sign to codepage 1252, IBM issued a new codepage instead.
Michael S. Kaplan on 19 Feb 2011 4:46 AM:
Um, that is what I said:
"The solution is similiar to the ISO method but much more free about issuing new code pages."
Yuhong Bao on 28 Dec 2012 12:33 AM:
"In the Windows 2000 timeframe, almost all of the ANSI code pages were updated to include the Euro"
Actually, I think it was in the Win98 timeframe, with updates issued for older versions of Windows dating back to NT 3.51 and Win3.x.
referenced by
2009/08/17 On not looking at Uyghur through a Chinese prism
2007/08/04 The one code page that changed recently
2007/01/04 Whither intl.inf in Vista?
2007/01/03 UTF-8 and GB18030 are both 'NT' code pages, they just aren't 'ANSI' code pages
2006/07/14 Can the CP_ACP be UTF-8?
2006/07/05 Custom code pages?
2005/10/17 Round trip calls do not always go both ways
2005/08/27 Vietnamese is a complex language on Windows