'Unicode' doesn't corrupt, but 'ANSI' can corrupt, absolutely!

by Michael S. Kaplan, published on 2006/08/22 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/08/22/707665.aspx

The post that came from Chris Becke on the microsoft.public.platformsdk.gdi and the [now defunct] microsoft.public.platformsdk.localization newsgroups was:

We had a problem with DrawTextA corrupting chinese text.

Inside our Dialog's WM_PAINT handler we were getting text from the applications string table and drawing it :-

    char szBuffer[256];
    LoadString(hResrc,IDS_MESSAGE,&szbuffer,sizeof (szBuffer));

No problem at all. At least on most systems we tried it on. Until our chinese translators said the string was garbage on our simplified chinese localization. Windows XP specifically. Our Windows 2000 and 98 testers hadn't reported problems...

So I got hold of a simplified chinese XP box, traced through the code, and examined the contents of szBuffer, and found that it contained the expected byte sequence for the 936 codepage. The LoadString *had* used the correct codepage to convert the string to ansi from unicode.

The DrawText however is displaying garbage characters. Not blocks, or ??'s, but random looking 'chinese'ish characters (to my untrained western eyes)

On a hunch, I took the 936 byte sequence, and did a MultiBytetoWideChar (950,... conversion. 950 being the "traditional chinese" codepage. And it produced in the debug window the exact text the DrawText was producing.

So, LoadString is using the system code page. DrawText was using codepage 950 to render the string! Why would it do this? GetACP, GetThreadLocale etc *all* indicated a simplified, not traditional, chinese localized system. The ansi APIs in a system should all be using the same ansi code-page right?

Wait. The problem was traced to our font creation. When creating the font we had inadverdently specified the CHINESEBIG5_CHARSET, rather than the GB2312_CHARSET character set.

However - the parameters to CreateFont are meant to be hints to the font mapper: I can understand the incorrect font definition causing a font without symbol support being loaded. i.e. I'd expect the undisplayable character glyph to be used, perhaps even ??'s.

I would't expect DrawTextA to use a different code page. The ability of DrawTextA to use codepages other than the system ansi code page certainly isnt mentioned *anywhere* in MSDN or in the MS KB.

No question there obviously, since he figured out what the problem was. Though it does underscore that he did not see my What code page does MSLU convert with? post. Which was technically posted on an MSDN blog. :-)

As I mentioned there:

Some [functions] use the ACP based on a particular device context handle (HDC), which has a Charset associated with it;
(e.g. most of the GDI functions that take an HDC parameter, like GetTextExtentExPoint)

That post alludes to another interesting story. I remember a long time ago when I was talking about the non-Unicode nature of Win9x on The Unicode List and Microsoftie Chris Wendt (who, like many oldtimers, used to be on the Win9x team) sent me email to point out that I was wrong -- large parts of Win9x are actually 100% Unicode supporting.

And technically, Chris is right -- GDI (the piece in particular that is relevant here) is almost entirely Unicode internally, for example. It is even why there are several functions that support Unicode on Win9x within GDI.

I mean they were there anyway, so they figured why not expose them?

Imagine how silly it would be to provide a stub externally, while internally including the actual function?

Though there were times that was done in other places, so I guess it was not too silly. :-)

The problem is that most users and most developers never really get any real benefit from the pieces on Win9x that support Unicode since they were exposed so seldomly, which is why I think most people would tend to think of the point Chris was making as being like the old joke about MS employees:

A pilot is flying a single-engine charter plane to Seattle, with a couple of VIPs aboard, when he runs into thick fog. Visibility is less than a mile and the instrument panel starts to malfunction. He is in big trouble.

Unable to find the airport, he circles around looking for landmarks. After an hour he is low on fuel and his passengers are getting nervous. Then, through an opening in the fog, he spots a tall building with one guy working alone on the fifth floor. The pilot banks and shouts through the window:
"Hey, buddy, where am I?"
The lone worker replies:
"You're in an airplane."

The pilot banks instantly into a 275 degree turn and makes a perfect blind landing on the airport runway five miles away. Just as the 'plane stops, the engine coughs and dies from lack of fuel.

The stunned passengers ask the pilot how he did it.

"Easy," he says. "I just asked the guy in that building a simple question. The answer he gave me was 100 percent accurate but absolutely useless. So I knew it must be the Microsoft Support Office - which is five miles from the runway on heading 087!"

Whether the moral of the joke may be true or not, the simple fact is that since the non-Unicode functions had to convert to Unicode in order to work with the TrueType functions that were basically Unicode, they could have worked with CP_ACP, but why not use the extra information that they were being given if someone went to the effort to provide a FONT CHARSET in the device context that suggested another language? It kind of makes sense, in its own way.

Of course the number of times that people ever make use of this behavior is rather limited, which is probably why it does not come up much....

Though it is true that a more serious form of documentation than my blog would perhaps be in order, at this point (with Win9x no longer supported and all) I can see why the effort to update dozens of topics to describe the behavior that very few people ever actually run across may not be the biggest priority on anyone's list. :-)


This post brought to you by (U+a1df, a.k.a. YI SYLLABLE GIEX)

# Chris Becke on 22 Aug 2006 5:57 AM:

"Though it does underscore that he did not see my What code page does MSLU convert with? post. Which was technically posted on an MSDN blog. :-)"

You know, depressingly enough, I actually read that post.  Not too depressing though, I think perhaps it was the vague memory of that post that made me suspect that perhaps GDI was using a different code page which helped me resolve the problem that much faster.

One question remains in my mind, perhaps I can "fix" the code to work when the character set is misconfigured - is there any way - sort of hardcoding a table, to determine the code page GDI will use for any given character set value?

# Michael S. Kaplan on 22 Aug 2006 7:29 AM:

You can call code to convert a codepage to a font charset -- something like this:

UINT CpgFromHdc(HDC hdc)
   int chs;

   chs = GetTextCharset(hdc);
   if(TranslateCharsetInfo(&(DWORD)chs, &csi, TCI_SRCCHARSET))

As a way to simply know what GDI might do, on a per charset basis....

go to newer or older post, or back to index or month or day