Oh I know that I am no sage but I won't be an ANSI code page

by Michael S. Kaplan, published on 2010/11/27 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/11/27/10097342.aspx

The title of this blog will seem a little less stupid if it is hummed along with the section of the Schoolhouse Rock song I'm Just a Bill that goes "Oh I hope and pray that I will but today I am still just a bill."

Over in the Suggestion Box, yuhong2 asked:

Shift-JIS_2004 and Big5 with HKSCS can't be the ACP as some of the characters convert to UTF-16 surrogates. Are there other requirements that must be met for a codepage to be the ACP/OEMCP?

Well, the first and most important rule is that it must be one of the existing ACP or OEMCP values since none are being added!

Beyond that, the rules don't have much to do with surrogates directly, though other rules disqualify them anyway.

I suppose the rules can be enumerated:

The code page must already be on the list of ACP or OEMCP code pages;
The Unicode side must be a single UTF-16 code point;
For non-DBCS code pages, the non-Unicode side can be only one byte;
For DBCS code pages, the non-Unicode side can by only one or two bytes;
The ACP and OEMCP values can never change the length of the string if the same string is round-tripped through one versus the other.

Now rule #2 does slam out all of the ones yuhong2 mentioned.

And rules #2 and #3 and #4 eliminate UTF-8.

It is rule #5 that causes locales that take one of the double-byte code pages (932, 936, 949 or 950) and force them to be both the ACP and the OEMCP; there are many unpleasant hard-to-predict consequences to them not always matching length....

Yuhong Bao on 27 Nov 2010 11:48 AM:

To be more precise, I think these (except number 1) are the requirement for ANY table-based codepage, as it is imposed by the format. Table-based codepages are the only codepages that can be ACP/OEMCP. Any other codepages have to be algorithmic and can't be the ACP/OEMCP.

Yuhong Bao on 27 Nov 2010 11:51 AM:

BTW, while the table-based codepage format (.NLS) is undocumented, it is very simple and would not be hard to figure out.

Yuhong Bao on 27 Nov 2010 12:15 PM:

For example by using GetCPInfoEx and looking at where the values are stored in the NLS file.

Michael S. Kaplan on 27 Nov 2010 12:19 PM:

Yes, but the code that uses the tables relies on rules 2, 3, 4, and 5.... you cannot build an ACP or an OEMCP based on HKSCS or JIS 2004....

Yuhong Bao on 27 Nov 2010 12:23 PM:

"Yes, but the code that uses the tables relies on rules 2, 3, 4, and 5.... "

What I mean is that (#2, #3, #4) constraints are imposed by the table format itself!

Michael S. Kaplan on 27 Nov 2010 12:28 PM:

No, the assumptions behind the rules are also implicit in the CODE that uses the data. Rules 2-4 are implicit in the format while Rule 5 is not enforced by the format at all. But the code that runs them often assumes all the rules are in force.

Yuhong Bao on 27 Nov 2010 12:32 PM:

Yep, #5 is old and dates back to the 16-bit Windows age and it's AnsiToOem and OemToAnsi functions, which for example allow in-place string conversion.

Michael S. Kaplan on 27 Nov 2010 12:38 PM:

I wasn't even thinking of that, I was thinking about places where the wrong assumption is made about which controls the length of non-Unicode and Unicode strings because the particular code does not know which one will be used later. Buffer overflows were protected from happening but the underlying wrong guess code is still there in places and a locale with ACP and OEMCP that violate the rule can produce corrupt data....

Yuhong Bao on 20 Feb 2011 1:14 PM:

Even if UTF-8 support as ACP could be added to Windows, many ANSI apps would still break with more than two byte characters. For example, Visual Basic had the Asc and Chr functions. Shift-JIS_2004 and Big5 with HKSCS would not be as bad though.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day