If the text is not complex, does it have to be treated like it is?

by Michael S. Kaplan, published on 2006/02/07 11:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/07/526359.aspx

James Brown asked in the microsoft.public.win32.programmer.international newsgroup:

The method itself should be sound (this type of use of ScriptIsComplex is very similar to the method that LPK.DLL (discussed previously) uses to determine whether to forward text rendering calls to Uniscribe or not.

(Of course in the case of LPK.DLL, Uniscribe is not called in the non-complex case, ExtTextOutW is; there may be a performance benefit to doing this since ScriptTextOut must evetually call ExtTextOutW to do the actual rendering -- so eliminating the extra overhead may be everyone's advantage).

No I am not entirely clear on why non-complex text would be broken into separate runs (especially text for which ScriptIsComplex resturns FALSE), so I will probably try to dig a little deeper on that point.

I had exactly the same problem. In particular, splitting strings like "the car, that" into three separate runs "the car|, |that" prevents kerning of 'r' and ',' for fonts that support advanced typographic features. (You can see this in Notepad if complex script support is turned on: Set the font to Palatino Linotype 72pt, and type "To the car, that"; note that kerning is applied to "To" but not to "r,".)

I resorted to substituting punctuation (C1_PUNCT) with space characters (U+0020) prior to calling ScriptItemize. ScriptItemize seems to accept space characters for any script, so it does not generate separate items for punctuation and alphabetic characters. Of course, this method has undesirable effects in some cases. For example, it should not be used if the string contains RTL text because it would hide the directional information implied by the punctuation. Also it might cause problems in ScriptShape if the SCRIPT_ANALYSIS returned for the space character does not actually support punctuation marks.

thanks (again) for picking this issue up from usetnet Michael! I'm part-way to finding an answer after stumbling over this quote from MSDN under the "Shaping Engines" section:

"Only scripts that have the property fComplex should be shaped with the script returned by the ScriptItemize function. All other runs may be merged and shaped with SCRIPT_UNDEFINED specified in the SCRIPT_ANALYSIS structure"

took me a while to figure out what this meant exactly. The fComplex flag belongs to the SCRIPT_PROPERTIES structure, however this is not directly related to any of the results returned by ScriptItemize, so how do I tell which script a SCRIPT_ITEM belongs to?. here's how:

BOOL IsRunComplex(SCRIPT_ITEM *item)
{
SCRIPT_PROPERTIES **propList;
int propCount;

// get pointer to the global script table
ScriptGetProperties(&propList, &propCount);

int scriptIndex = scriptItem.a.eScript;

return propList[scriptIndex]->fComplex;
}

This method was buried within the Uniscribe docs under "Determining if a script requires glyph shaping"... well at least it was mentioned, wouldn't really call it 'documented' at any rate.

So after this, any run which is _not_ complex can be merged, and the run's SCRIPT_ANALYSIS.eScript value set to SCRIPT_UNDEFINED. This will require careful coding as longer runs won't be breakable for word-wrapping so I guess this "non-complex-script" merging should happen after ScriptBreak has been called.

It's not perfect - some forms of English punctuation/whitespace still aren't mergable. After testing I found a 30% "reduction" overall. Better than nothing I guess..