If the text is not complex, does it have to be treated like it is?

by Michael S. Kaplan, published on 2006/02/07 11:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/07/526359.aspx


James Brown asked in the microsoft.public.win32.programmer.international newsgroup:

I have a fully working Uniscribe wrapper which renders a line of Unicode text, using the low-level ScriptItemize /Layout/Shape/Place/TextOut calls. Its working pretty well (very well in fact) but there is still one area I am not happy with. For a regular string of "english" text (i.e. non-complex), ScriptItemize always breaks the string into individual words. For a long line of text, containing much white-space and punctuation, this can result in quite a number of SCRIPT_ITEMs being returned.

This results in a large number of calls to ScriptTextOut to render the text, which is where the problem is - because I am required to call ScriptTextOut for each "item-run" in the text, this results in a fairly slow mode of operation - alot slower than calling ExtTextOut for the whole line for example. It's not that ScriptTextOut itself is slow, it is just the shear number of calls to the OS that is causing the problem.

So my idea is as follows:

After Shaping, all of the returned glyph-data for every item-run in the string is stored consecutively in a large buffer. Ordinarily I isolate each run in this buffer and draw the runs individually with ScriptTextOut.

However for a "simple" string of text (i.e one that ScriptIsComplex recognizes as such), I am proposing to pass the entire buffer of glyph/widths etc to ScriptTextOut in one go - so even if there was 30 runs of text, I would just treat this as one run and call ScriptTextOut just once - in essence, recombining all script-items into one single unit.

Assuming for the moment that I am using just one font, does anyone see any problem in this approach? The only issue I can see is specifying a correct SCRIPT_ANALYSIS structure (there is a unique structure per run so which would I specify?)

I have seen hints that maybe ScriptTextOut performs some trickery prior to calling ExtTextOut (for complex scripts) and that combining runs prior to calling it would be bad.....but for regular english text (code-points < 255 for example) would this be ok?

I have tested this method, and it does seem to work - and it is *much* faster this way... it would be nice for a Microsoft uniscribe/typography rep to comment on this approach.

The method itself should be sound (this type of use of ScriptIsComplex is very similar to the method that LPK.DLL (discussed previously) uses to determine whether to forward text rendering calls to Uniscribe or not.

(Of course in the case of LPK.DLL, Uniscribe is not called in the non-complex case, ExtTextOutW is; there may be a performance benefit to doing this since ScriptTextOut must evetually call ExtTextOutW to do the actual rendering -- so eliminating the extra overhead may be everyone's advantage).

No I am not entirely clear on why non-complex text would be broken into separate runs (especially text for which ScriptIsComplex resturns FALSE), so I will probably try to dig a little deeper on that point.

Does anyone have any theories? :-)

 

This post brought to you by "" (U+0f5c, a.k.a. TIBETAN LETTER DZHA)


# Robert on 7 Feb 2006 8:13 PM:

I had exactly the same problem. In particular, splitting strings like "the car, that" into three separate runs "the car|, |that" prevents kerning of 'r' and ',' for fonts that support advanced typographic features. (You can see this in Notepad if complex script support is turned on: Set the font to Palatino Linotype 72pt, and type "To the car, that"; note that kerning is applied to "To" but not to "r,".)

I resorted to substituting punctuation (C1_PUNCT) with space characters (U+0020) prior to calling ScriptItemize. ScriptItemize seems to accept space characters for any script, so it does not generate separate items for punctuation and alphabetic characters. Of course, this method has undesirable effects in some cases. For example, it should not be used if the string contains RTL text because it would hide the directional information implied by the punctuation. Also it might cause problems in ScriptShape if the SCRIPT_ANALYSIS returned for the space character does not actually support punctuation marks.

# Michael S. Kaplan on 7 Feb 2006 10:02 PM:

Wow, is this on all versions of Windows/Uniscribe, or is it specific to partcular versions?

# James Brown on 8 Feb 2006 12:01 PM:

thanks (again) for picking this issue up from usetnet Michael! I'm part-way to finding an answer after stumbling over this quote from MSDN under the "Shaping Engines" section:

"Only scripts that have the property fComplex should be shaped with the script returned by the ScriptItemize function. All other runs may be merged and shaped with SCRIPT_UNDEFINED specified in the SCRIPT_ANALYSIS structure"

took me a while to figure out what this meant exactly. The fComplex flag belongs to the SCRIPT_PROPERTIES structure, however this is not directly related to any of the results returned by ScriptItemize, so how do I tell which script a SCRIPT_ITEM belongs to?. here's how:

BOOL IsRunComplex(SCRIPT_ITEM *item)
{
  SCRIPT_PROPERTIES **propList;
  int propCount;

  // get pointer to the global script table
  ScriptGetProperties(&propList, &propCount);

  int scriptIndex = scriptItem.a.eScript;

  return   propList[scriptIndex]->fComplex;
}

This method was buried within the Uniscribe docs under "Determining if a script requires glyph shaping"... well at least it was mentioned, wouldn't really call it 'documented' at any rate.

So after this, any run which is _not_ complex can be merged, and the run's SCRIPT_ANALYSIS.eScript value set to SCRIPT_UNDEFINED. This will require careful coding as longer runs won't be breakable for word-wrapping so I guess this "non-complex-script" merging should happen after ScriptBreak has been called.

It's not perfect - some forms of English punctuation/whitespace still aren't mergable. After testing I found a 30% "reduction" overall. Better than nothing I guess..




referenced by

2007/03/13 We need to be optimizing for more than just the simple cases

2006/07/10 The PUA isn't complex enough

go to newer or older post, or back to index or month or day