The problem with characters stuffing the ballot box

by Michael S. Kaplan, published on 2007/08/16 17:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/16/4421328.aspx


Standards are great, but sometimes deciding to follow them after the fact can be complicated.

This post talks about such a situation. 

The bug report that came in was reasonably straightforward (I'll provide the translated version to avoid that extra level of indirection):

In ANSI(MBCS) MFC application, if you write DBCS text to print preview pane using CDC::TextOut, there appears extra space between double-byte characters written in Meiryo font.

Not repro with MS Gothic or MS Mincho families(their internal leading values are zero.)

[Note]
- Same behavior on WinXP. Not a regression.
- Not repro on screen device context (the problem occurs on printer device context only).
- Not repro if you build the application with _UNICODE compile flag (the problem occurs in ANSI MFC application only).
- Not repro if you use CDC::ExtTextOut(without justification weight specified) instead of CDC::TextOut. 
- The problem constantly happens regardless of default printer settings
- The extra space does not appear on the actual printout

Some art was even included so people would be able to understand what was being discussed, with the Vista Meiryo font being used:

and with MS Gothic being used:

You can see the difference with the following test code that repros the metrics difference:

LONG GetAvgTextWidth(HWND hWnd, LPCTSTR szFontName) {
      PAINTSTRUCT ps;
      HDC hdc;

      hdc = BeginPaint(hWnd, &ps);

      HFONT hFont = ::CreateFont(-12*20, 0, 0,0, FW_NORMAL, 
            FALSE, FALSE, FALSE, SHIFTJIS_CHARSET, 
            OUT_DEFAULT_PRECIS, CLIP_DEFAULT_PRECIS, 
            DEFAULT_QUALITY, DEFAULT_PITCH||FF_MODERN, szFontName);

      TEXTMETRIC tm;

      SelectObject(hdc,hFont);
      GetTextMetrics(hdc, &tm);

      LONG lAvgTextWidth = tm.tmAveCharWidth;

      EndPaint(hWnd, &ps);
      TCHAR str[256];
      _sntprintf_s(str, 256, _T("%s:%d \n"), szFontName, lAvgTextWidth);
      MessageBox(hWnd, str, 0, 0);

      return lAvgTextWidth;
}

void Test(HWND hWnd) {
      LONG lMSPGothic = GetAvgTextWidth(hWnd, _T("MSPGothic"));
      LONG lMeiryo = GetAvgTextWidth(hWnd, _T("Meiryo"));
      LONG lArial = GetAvgTextWidth(hWnd, _T("Arial"));
      LONG lOpenTypel = GetAvgTextWidth(hWnd, _T("KaiTi"));
}

Now of course it is easy to dismiss this as just another case of Microsoft screwing with ANSI applications, but that isn't what is happening here, at all (as the above repro code shows, you can have the same problems in Unicode apps too, in some cases).

The difference is actually with the metrics of the two fonts, specifically that Meiryo is returning about double the size of the TEXTMETRIC structure's tmAveCharWidth member, which is simply defined:

Specifies the average width of characters in the font (generally defined as the width of the letter x). This value does not include the overhang required for bold or italic characters.

Now of course one only has to look at Meiryo and MS Gothic to see that one isn't expecting to see a tmAveCharWidth that is double the size.

So what is going on here?

Starting from the spec:

 

xAvgCharWidth

Format:2-byte signed short

Units:Pels / em units

Title:Average weighted escapement.

Description:The Average Character Width parameter specifies the arithmetic average of the escapement (width) of all non-zero width glyphs in the font.

Comments:The value for xAvgCharWidth is calculated by obtaining the arithmetic average of the width of all non-zero width glyphs in the font. Furthermore, it is strongly recommended that implementers do not rely on this value for computing layout for lines of text. Especially, for cases where complex scripts are used.

   

Well, there has for a long time been kind of a hack built in to a lot of East Asian fonts whereby the metrics behind tmAveCharWidth were not based on every character and dividing by the total number of them. Instead, it was being calculated by averaging all of the non-ideographic characters.

Perhaps as comments go, they should update to say not to rely on it for East Asian characters, either? :-)

Now as hacks go this is a fairly decent one, since following the letter of the TrueType spec here makes the value somewhat useless -- the ideographs have a fixed width among themselves that by convention will be bigger than most of the other characters. Dumping thousands (or even tens of thousands) of such characters into the mix to be averaged in does not really make the resultant tmAveCharWidth very useful for most purposes, and for the usage that MFC was trying to do something with in particular.

If you think of the average width value as an election then this is a great example of characters stuffing the ballot box!

Now Meiryo does not have this hack -- it contains the actual average width across every character. It is completely conformant to the TrueType specification and its definitions.

Of course the end result is not entirely pleasant if one tries to use the font:

So, what is an honest developer (or huge framework) to do?

Well, calculating the value using an explicit subset is one way (this is code that actually is used by Windows in some cases if more complex calls fail):

static WCHAR wszAvgChars[] = L"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
// Change from tmAveCharWidth. We will calculate a true average
// as opposed to the one returned by tmAveCharWidth.
GetTextExtentPointW(hdc, wszAvgChars,(sizeof(wszAvgChars) / sizeof(WCHAR)) - 1,&size)
iAve = ((size.cx / 26) + 1) / 2; // round up

And suddenly by using this value, things start looking right again!

Now I know from my own Unicode Layer days that making changes to MFC and recompiling it are not easy to do, so in the meantime the easier answer for most people might be to not use Meiryo for these kinds of situations until MFC updates might be available.

So maybe indirectly it is a problem with Microsoft and ANSI apps.

Though to be honest it is easier to blame it on piss-poor cross-business unit dogfood efforts (DevDiv not wanting to install Vista, Windows not wanting to use MFC).

Or you could even blame the lack of solid international app compat testing (that Japanese problem is not hard to hit) on Vista.

I guess this could be used as an excuse to rail against whatever cause one chooses to take up (well, other than the war in Iraq). :-)

 

This post brought to you by (U+30d8, a.k.a. KATAKANA LETTER HE)


grant on 19 Jan 2009 10:43 PM:

Still investigating a bug report from a client, but looks like a problem with Meiryo. This blog entry is the closest I have to throwing the light on the matter, until I take a closer look at the code.

The string "Chassis/Suspension/Wheels" (in English, despite using a Japanese locale) displays in a grid cell with the trailing 's' partly truncated. Would appear to be using a standard TextOut to draw the text in the cell.

Change the font to anything else works OK. Suspecting the '/' char, I tried a few variations; removing the forward slash and the problem goes away, adding more & more of the last char is truncated.

So this font (and only this font) mis-reports the length of the string.. maybe something with '/' vs '¥'  ?

Michael S. Kaplan on 20 Jan 2009 6:04 PM:

That would definitely be a separate issue, unrelated to the tmAveCharWidth/xAveCharWidth value in Meiryo, right?


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day