WideCharToMultiByte vs. DrawTextW? In tennis terms, 15-Love!

by Michael S. Kaplan, published on 2012/04/05 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/04/05/10290985.aspx

Some long-time regular readers may recall when I blogged Cyrillic looks so spacy when viewed from some parts of East Asia.... from almost five years ago.

Here is the art I showed at the time:

See what happens with the "spacy" Cyrillic text? :-)

At the time, SDiZ commented:

I believe this is, indeed, a "feature".

In big5 encoding (at least in the big5 ETen-extension), Cyrillic scripts are full-width. To compatible with old system, Chinese fonts always comes with full width Cyrillic characters.

Though it can't really be that simple, since in cases like both the one above and the similar case we were looking at today involving a Japanese default system locale and a Russian user UI language where line breaking was being done wrong, all of the text is 100% Unicode.

If you look at Microsoft's code pages:

Japanese Windows Code Page 932 has some Cyrillic in Lead Byte 0x84, and
Korean Windows Code Page 949 has some Cyrillic in Lead Byte 0xAC, and
Simplified Chinese Windows Code Page 936 has some Cyrillic in Lead Byte 0xA7.

But Traditional Chinese Windows Code Page 950 has no Cyrillic in it whatsoever!

So, it is not a mere code page issue. Or at least not for the reasons one might usually expect.

The various non-full-width CJK fonts don't give the Cyrillic characters full width visually.

Though that is just common sense -- if they did, then instead of "spacy" Cyrillic, you'd just get "fat" Cyrillic!

So that isn't it either.

And it isn't some kind of Uniscribe issue (which I'll be honest was my first guess!), either.

I found this out yesterday afternoon, when one of the Uniscribe developers took a look at the bug, and while debugging through it found the actual problem without seeing the root cause, which was actually right there:

Ok, debugged this. (Michael, you may want to take a look if this is expected in regards to WideCharToMultiByte behavior)

The reason for this is indeed the use of the Russian UI on a Japanese system. The DrawText api in a word wrapping mode pays attention to the user’s codepage (it does DWORD dwCodePage = USERGETCODEPAGE(hdc)). The codepage in this case is 0x3a4. When analyzing text for wraps, it checks the current character if it is a Full Width character in the current codepage using

cChars = WideCharToMultiByte((UINT)dwCodePage,0,&wChar,1,NULL,0,NULL,NULL);

This API with these parameters reports that all Russian characters are Full Width (why?), i.e. it returns cChars=2. Then the following code treats this Russian character as a full width, and believes it can break anywhere after it:

                /*
                 * Otherwise, we just return the character that is next of FullWidth
                 * Character. Because we treat A FullWidth character as A Word.
                 */

So it never uses the Uniscribe for word breaking for Russian text in this configuration.

Aha!

Do you see the bug?

Admittedly, I didn't see it right away, myself.

I saw it about 15 minutes after I read and responded to the mail.

It isn't in WideCharToMultiByte at all!

Did you spot it?

The bug is in using WideCharToMultiByte to detect "wide" characters by converting it using a code page.

Because every CJK code page has some characters that are not double width that is situated one of the two byte ranges inside the code page (other characters that can hit this problem include random symbols and such).

Clearly to fix this in any version would require a different, more reliable test for full width characters!

For example, a simple call to GetStringTypeW(CT_CTYPE3, ...) checking for the C3_FULLWIDTH character type flag -- the non locale specific code that should have been there all along....

Of course, you should never call the "Ex" version, as I point out in To Ex or not to Ex? THAT is the question.!

At this point, the decision to apply that fix or not will have to depend on how worthwhile the scenario is considered.

The fact that 75% of the CJK code pages have Cyrillic in them and all four have code points that could run afoul of the bug may have some impact, though the fact that the bug has apparently been around for over a decade will have impact too.

Backcompat and people who might even be depending on the behavior could be an issue, too.

What would you do in the current not-yet-shipping version?

What would you do in prior versions?

MGetz on 6 Apr 2012 7:24 AM:

a) The backcompat solution sounds like it will break in CP950

b) The question as I see it is why is a unicode function paying attention to an ANSI Code Page?

c) This might be a case where it would be useful to post the question out to various East Asian programming sites and see what comes back, it's highly likely that this behavior is not wanted.

d) What happens in Unicode only Locales? Does the bug cause rendering issues there?

If A and D are true then I think a breaking change is justified, if not.... further investigation would probably be needed. In any case I would consider doing C.

Joshua on 9 Apr 2012 1:26 PM:

Screw it. Your recent attitude about codepages will soon result in massive breakage for any system not on 1252 anyway.

Random832 on 10 Apr 2012 1:03 PM:

"Because we treat A FullWidth character as A Word." was a bad decision to begin with, at most they should apply it to ideographs (maybe GetStringTypeW wasn't available back then).

Also, what width does GetStringTypeW return for ambiguous width characters? Is it not locale-dependent?

Michael S. Kaplan on 10 Apr 2012 11:10 PM:

GetStringTypeW was in the first build of NT that had DrawTextW. And there are no ambioguous characters from the view of GetStringTypeW as long as they are defined...

Random832 on 11 Apr 2012 7:18 AM:

So then what width class does it return for the Cyrillic characters, given they are full width in several codepages (and in the corresponding "monospaced" fonts)? How were the width classes of GetStringTypeW envisioned as being used? The only use I can think of is counting them up to determine how many character cells a string will take up in an east-asian-locale console window.

Michael S. Kaplan on 11 Apr 2012 8:07 AM:

It uses the Unicode character properties, of course!

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2013/02/13 The time has come to fix a bug that has been in Windows more than twice as long as I have...

go to newer or older post, or back to index or month or day