Why ACP != OEMCP (usually)

by Michael S. Kaplan, published on 2005/02/08 11:57 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/08/369197.aspx

One thing people may notice right away when dealing with the command console is that the default ANSI code page (ACP) does not match the OEM code page (OEMCP) for most locales.

When there was DOS, the code page story was much more controlled by IBM than by Microsoft. Many "original" code pages came out of this time, from the interesting to the downright weird (yes, I am thinking of code page 437, G*!), though some of them may even predate IBM (not sure on this point).

Then, with Windows came the Windows code pages, engineered (if one can say that about data) to handle more languages at one time. Modeled after the ISO-8859 series but plugging in characters useful to more languages, for various markets (like basic support of French in the Arabic code page):

The idea was that DOS applications (not considered "legacy" then due to their prevalence) would have the same old code pages to fall back on, and Windows applications would have their own code pages to support more languages. They even added APIs to affect the base file system functions to work in one mode or the other, since the file system is the one thing both applications would have to access. And AreFileApisANSI, SetFileApisToOEM, and SetFileApisToANSI were born¹.

Ah, you say -- but why are the ACP and the OEMCP the same for the CJK locales? We have:

And they act as both the ANSI code page and the OEM code page for locales in most East Asian locales.

There are many reasons for this. One of the obvious reason that these four code pages, being originally based on specific standards (governmental or industrial) had no additional backcompat issue. And no one wanted to gratuitously start making up code pages.

One architectural reason that affects NT is the rules about code pages in kernel mode. APIs like RtlUnicodeStringToOemString and RtlUnicodeStringToAnsiString have some implicit assumptions in the architecture that the size of the string will never change if you move between the ANSI and the OEM code pages. And for the non-CJK code pages this is no problem since either the character is on the code page and it is one byte or the character is not and you will get a question mark (which is also one byte). But the CJK code pages could be two bytes versus one in many cases. That would have been really bad.

(These function also assume that every Unicode character is two bytes -- which they are, for ANSI and OEM code pages. And for those who are wondering, the functions in ntdll.dll do not have the question about precomposed versus composite Unicode -- they only support the precomposed form. And Julie takes no responsibility for them, though she did fix bugs at one point in the early days!)

Anyway, the EA code pages are relieved of needing two different code pages. Which is just as well since they have other things to worry about, such as functions like IsDBCSLeadByte....

Aren't you glad you are using Unicode and do not need to worry about any of this? :-)

1 - Don't get me started on a naming convention that has one kind of acronym (API) not fully capitalized and another kind (ANSI, OEM) are. Geez....

This post brought to you by "ì" (U+00ec, a.k.a. LATIN SMALL LETTER I WITH GRAVE)

Of course I did not include 874 (Thai) but that was not a "125x" series codepage....

>Aren't you glad you are using Unicode and do not need to worry about any of this? :-)

... and can instead devote your life to normalization issues!

"These function also assume that every Unicode character is two bytes -- which they are, for ANSI and OEM code pages. "

Which is why Shift_JIS-2004 can't be the ACP, BTW.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.