All code page architectures are created equal

by Michael S. Kaplan, published on 2005/07/26 09:30 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/26/443375.aspx


Yes, I said it -- all code page architectures are created equal. But in the most Orwellian sense, some are more equal than others....

First I will digress into a favorite Odgen Nash poem of mine, which is very short. I pretty much memorized it:

Let's talk about eggs:
Eggs have no legs.
Let's talk about chikens:
Chickens do have legs.
The plot thickens --
eggs come from chickens!
But they have no legs under 'em
What a conundrum!

Why this poem popped into my head may become apparent shortly. If not then it is still a nice poem (Ogden Nash at his finest!).

Anyway....

If you look at the official, sanctioned encoding architectures owned by the GIFT team, there are three of them:

(there is a fourth model for Kernel mode and the Rtl* functions that can be used in both kernel and user mode, but I will cover that another day -- for my purposes here just consider it for now like Win32 but more limited!)

If these were three entirely separate models, it all might be easier. However:

Talk about conundrums -- these three models are so interrelated even though there are so many times that their behavior differs that I doubt anyone will ever be able to sort out the behavioral differences.

It represents complex pieces of code in three code bases written across nine versions of Windows, three versions of IE, and three version of the BCL, using unmanged, managed, and COM based code. It is very hard to figure out what is a bug to fix, what is a bug we are stuck with for backcompat reaons, what is an intentional feature that only looks like a bug because the behavior was not documented well enough. You can get a headache trying to figure it out sometimes (and many have!).

So what does it all mean?

Well, as Shawn Steele, the owner of the bulk of this complex set of code bases likes to say, people ought to just be using Unicode. And Shawn is spot on here -- the more complex the code page work you do, the more likely you are to run into problems with the use.

Now I do not include UTF-8 (or even UTF-32 in the .NET Framework) with the rest of those code pages, since it is a Unicode encoding form and all, but just about everything else ought to be a "use if you have to convert something, but then once it is converted stop using!" model.

Bue please just try to use Unicode, like the opersting system and the .NET Framework prefer, and were basically designed for....

 

This post brought to you by "" (U+0ce1, a.k.a. KANNADA LETTER VOCALIC LL)


# Paul Ballard on 26 Jul 2005 11:06 AM:

My personal favorite Ogden Nash Poem is...

A cow is of a bovine ilk,
One end is moo and the other is milk

You just can't get finer literature than Ogden Nash! :-)

# Ivo on 26 Jul 2005 2:53 PM:

There is another one - the CRT functions. Are they wrappers on top of Win32 or Win32 on top of CRT?
I was doing some testing with the CRT on XP set to Japanese. GetACP() returns 932 (as expected). But I found that setlocale(LC_CTYPE,"") or setlocale(LC_CTYPE,".ACP") have no effect on the CRT functions. setlocale(LC_CTYPE,".932") works fine. According to the help they should produce the same result...

In another test I found that GetCPInfo recognizes codepages 20949 and 28063, but GetCPInfoEx doesn't. Is that a bug to fix, bug for compatibility reasons, or an intentional feature? :))

# Michael S. Kaplan on 26 Jul 2005 3:13 PM:

The CRT functions are basically Win32 wrappers. I do not include them since they really add no new functionality (but they do add a little model confusion, as you noted!).

That code page being recognized issue looks like a bug to fix if you ask me.

Are there any GIFT testers reading this? :-)

# HASEGAWA Yosuke on 26 Jul 2005 10:04 PM:

Hi.
wiconv program <http://openmya.hacker.jp/hasegawa/wiconv/wiconv-0.2.lzh> I wrote is the very tiny program Win32 based for convert codepage of strings.
This program supports two conversion method - Win32 NLS and MLang.

And I've noticed when writing this program, EUC-JP (most popular encoding for Japanized UNIX) is
defined as Codepage 51932, MLang functions supports CP51932 but Win32 NLS functions not support CP51932. So we use undocumented CP20932 instead CP51932 for Win32 NLS.

So some programs using Win32 NLS can't handle EUC-JP encoding correctly.

Sorry for buggy English.

# Michael S. Kaplan on 27 Jul 2005 2:35 AM:

No worries, I think your English is fine. :-)

Yep, there are all kinds of weirdnesses in the "DLL-based" range (the 5xxxx range)....

# Michel Lemay on 5 Aug 2005 10:45 AM:

Indeed.. there are lots of strange things happening in that range !

Here is the current status:
- WinXP sp2: using MLANG, it works well with most codepages I don't have a custom conversion routine: 932, 51932, 949, 50220, 50225.

The problem is: I must support Win2000 installs:
- mlang ConvertStringToUnicode fails on some broken chracter within a 932 string (it does well on XP)
- Using Win32 NLS MultiByteWoWideChar seems to fix the problem for my 932 issue but fails mysteriously for code pages 5xxxx (IsValidCodePage also returns false)

Possible solutions:
- Redistribute an updated version of Mlang to Win2k users (for what I've seen, this solution seems to do the job but might not be the best way to go because of the WFP feature of the OS, I could try LoadLibrary the dll from my application bin folder and call ConvertInet... manually)
- install missing NLS code pages on target computers (not sure how to do this since the Advanced options in Regional Settings does not have the check box for 51932 and also, IsValidCode page returns false for 5022x even if they seems to be enabled in the Regional settings)
- write custom conversion routines for all charsets I will ever use! (seems like a waste of time to me!)

Whats do you think of the possible solutions? What would be your horse?

Michel

Yuhong Bao on 2 Dec 2008 11:10 PM:

"(there is a fourth model for Kernel mode and the Rtl* functions that can be used in both kernel and user mode, but I will cover that another day -- for my purposes here just consider it for now like Win32 but more limited!)"

I'd just call it the Native API model.

Michael S. Kaplan on 3 Dec 2008 12:43 AM:

Well, that might cause more confusion for some people as "Native" has become the preferred term for the C++ team when referring to code that is not managed code (they prefer "native" to "unmanaged").

I know kernel mode devs had the term first, but there are fewer of them so they may not win that one. :-)

Yuhong Bao on 14 Nov 2010 7:52 PM:

"It represents complex pieces of code in three code bases written across nine versions of Windows, three versions of IE, and three version of the BCL, using unmanged, managed, and COM based code. It is very hard to figure out what is a bug to fix, what is a bug we are stuck with for backcompat reaons, what is an intentional feature that only looks like a bug because the behavior was not documented well enough. You can get a headache trying to figure it out sometimes (and many have!)."

Yea, the fundamental flaw is that MLang was originally part of IE and was layered on top of the Win32 codepage model. As such, it had to run on multiple versions of Windows, accounting for changes in the Win32 codepage model underneath between various versions of Windows. Often when the Win32 codepage model changed, MLang had to be changed as well (for example, removing a workaround for a bug that has been fixed in the Win32 codepage model depending on the version of Windows). Eventually MLang became part of Windows itself, but still retains most of the cruft.

Yuhong Bao on 14 Nov 2010 8:04 PM:

Add the fact that IE (which was what MLang was part of) was updated independently from Windows, so if a bug-fix from Windows interferes with a workaround from MLang, IE would have to be updated at the same time to fix MLang.


referenced by

2006/12/25 Anyone out there switching modes in JIS?

go to newer or older post, or back to index or month or day