Mixing MLang and Uniscribe

by Michael S. Kaplan, published on 2006/02/15 05:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/15/531653.aspx

First, I don't think your question I responded to in the post The Lack of Uniscribe Samples was specifically open-ended, other than (as I pointed out that post) that there needs to be more info on the actual scenarios using Uniscribe that are involved. Scenarios do exist, but samples do need to have the ones whose solutions are being illustrated explained a bit more fully....

Talking to David Brown about your question about Uniscribe and Asain text, he had this to say:

Now Ian -- you may have noticed what David has (perhaps intentionally?) provided here -- some good example scenarios for potential future samples.

And yes I am taking note and will see if I can provide some of these in the future! :-)

For that last point (well actually your second point -- I told you I was working out of order here!)....

The interaction between MLang font linking and Uniscribe font fallback is somewhat involved, to put it mildly.

MLang is creating a synthetic font that can be used to render text that spans script boundaries, as I discuss in Font substitution and linking #2. Awesome feature, but....

Since this support is external to GDI, MLang is not creating an updated CMAP table for this synthetic font, and it is indeed the CMAP table that Uniscribe uses to figure out whether the font supports all the characters. So how to get Uniscribe to use this synthesized font that may have been created by carefully getting the best glyphs possible?

Well, you were looking in the right place -- after that call to ScriptShape. But there are actually two things to check here:

The reason for the second check for East Asian text is that there are no typographic features required of fonts, so you wouldn't really expect to see USP_E_SCRIPT_NOT_IN_FONT. You basically must scan the output of ScriptShape for missing glyphs in order to discover when to do font substitution.

But if Uniscribe is using the font's CMAP tables, it becomes difficult to know how to tell it to just try to render and ignore what the CMAP table is telling it.

To accomplish things here you would have to take these East Asian runs and then (as David mentioned) implement font fallback, line breaking and vertical run handling yourself. Decidely non-trivial....

However, the truth is that combining MLang and Uniscribe for the East Asian scripts does not make as much sense since the fonts themselves are big enough that it is unlikely to require a synthetic font, and Uniscribe itself does not handle East Asian all that specially anyway. Picking a font that supports the characters to start with may well be the best answer for this case.

For the actual complex script cases, you can actually work to modify those 'missing character' entries if you know that the font supports the glyphs in question, of course.

This post brought to you by "囗" (U+56d7, a CJK Unified Ideograph meaning erect, proud, upright, or bald)

"This post brought to you by "囗" (U+56d7, a CJK Unified Ideograph meaning erect, proud, upright, or bald)"

... which I'm afraid is yet another Unihan mistake that has been around for years without anyone noticing.

U+56D7 囗 is the archaic form of U+570D wéi 圍 "to encircle, surround". In modern usage it is also used as an ultra-simplified form of U+56FD guó 国 "country". But it has never meant anything like erect, proud, upright or bald (!). The definition must be for a different character, although it's not immediately obvious which one.

None of the non-normative fields in the Unihan database can be relied on to any great extent, and the definitions are particularly unreliable. There are some *really* awful ones -- U+6A36 is one that I have noticed just now (prize to anyone who can explain the definition).

Given that in many cases it is impossible to accurately or meaningfully define a character's meaning without giving lengthy, dictionary-like entries, if it were up to me I would dispense with the definitions altogether (if nothing else, it would knock off over 1MB in the size of Unihan.txt).