Mixing MLang and Uniscribe

by Michael S. Kaplan, published on 2006/02/15 05:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/15/531653.aspx

Ian Treleaven asked in the Suggestion Box:

Thanks for getting to my Uniscribe question this past December, which I now realized was very open-ended.

I would like to see an example or hints on font substitution when using Uniscribe.  It seems you can use MLang to build a font, but there seem to be some tricks involved in getting that font used.  I assume the right place is after a call to ScriptShape that returns USP_E_SCRIPT_NOT_IN_FONT.

At the 2005 PDC, David Brown mentioned informally that Uniscribe should not be used to render Asian text.  Can you elaborate?

Hopefully Ian will not mind if I take some of his post out of order. :-)

First, I don't think your question I responded to in the post The Lack of Uniscribe Samples was specifically open-ended, other than (as I pointed out that post) that there needs to be more info on the actual scenarios using Uniscribe that are involved. Scenarios do exist, but samples do need to have the ones whose solutions are being illustrated explained a bit more fully....

Talking to David Brown about your question about Uniscribe and Asain text, he had this to say:

I don’t remember what the context of this was – that certainly seems too general a statement given the amount of East Asian support that has been added to Uniscribe over the years.

Background: Originally Uniscribe was designed to solve complex script rendering by a team (and team lead – me) that had little experience of East Asian issues. One particularly messy issue that took a few years before it was fixed was the rendering of surrogate codepoints in vertical text – before that fix went in surrogate codepoints were not rotated correctly.

I’m sure there are plenty of things still left that don’t work too well without extra work by the customer. ScriptBreak for example provides wordbreak points for complex scripts, but not for East Asian text, indeed it isn’t designed right to handle East Asian - it currently is passed individual runs, which doesn’t allow it to handle the junction between runs correctly.

Font fallback is designed entirely for complex scripts. To roll your own font linking life gets a lot more complex when you want to support both complex scripts and East Asian, with hindsight we should have made it easier....

The thing the we should probably tell customers about rendering East Asian text is that they will need to implement font fallback, line breaking and vertical run handling themselves, and that this requires more code than just supporting Western and complex scripts does. (And we don’t have a sample). Nonetheless they should be sending East Asian text through Uniscribe as it is required to get the OpenType features just coming in with the latest fonts.

Now Ian -- you may have noticed what David has (perhaps intentionally?) provided here -- some good example scenarios for potential future samples.

And yes I am taking note and will see if I can provide some of these in the future! :-)

For that last point (well actually your second point -- I told you I was working out of order here!)....

The interaction between MLang font linking and Uniscribe font fallback is somewhat involved, to put it mildly.

MLang is creating a synthetic font that can be used to render text that spans script boundaries, as I discuss in Font substitution and linking #2. Awesome feature, but....

Since this support is external to GDI, MLang is not creating an updated CMAP table for this synthetic font, and it is indeed the CMAP table that Uniscribe uses to figure out whether the font supports all the characters. So how to get Uniscribe to use this synthesized font that may have been created by carefully getting the best glyphs possible?

Well, you were looking in the right place -- after that call to ScriptShape. But there are actually two things to check here:

The reason for the second check for East Asian text is that there are no typographic features required of fonts, so you wouldn't really expect to see USP_E_SCRIPT_NOT_IN_FONT. You basically must scan the output of ScriptShape for missing glyphs in order to discover when to do font substitution.

But if Uniscribe is using the font's CMAP tables, it becomes difficult to know how to tell it to just try to render and ignore what the CMAP table is telling it.

To accomplish things here you would have to take these East Asian runs and then (as David mentioned) implement font fallback, line breaking and vertical run handling yourself. Decidely non-trivial....

However, the truth is that combining MLang and Uniscribe for the East Asian scripts does not make as much sense since the fonts themselves are big enough that it is unlikely to require a synthetic font, and Uniscribe itself does not handle East Asian all that specially anyway. Picking a font that supports the characters to start with may well be the best answer for this case.

For the actual complex script cases, you can actually work to modify those 'missing character' entries if you know that the font supports the glyphs in question, of course.

A very rich area for future posts!


This post brought to you by "" (U+56d7, a CJK Unified Ideograph meaning erect, proud, upright, or bald)

# Andrew West on 15 Feb 2006 5:44 PM:

"This post brought to you by "囗" (U+56d7, a CJK Unified Ideograph meaning erect, proud, upright, or bald)"

... which I'm afraid is yet another Unihan mistake that has been around for years without anyone noticing.

U+56D7 囗 is the archaic form of U+570D wéi 圍 "to encircle, surround". In modern usage it is also used as an ultra-simplified form of U+56FD guó 国 "country". But it has never meant anything like erect, proud, upright or bald (!). The definition must be for a different character, although it's not immediately obvious which one.

None of the non-normative fields in the Unihan database can be relied on to any great extent, and the definitions are particularly unreliable. There are some *really* awful ones -- U+6A36 is one that I have noticed just now (prize to anyone who can explain the definition).

Given that in many cases it is impossible to accurately or meaningfully define a character's meaning without giving lengthy, dictionary-like entries, if it were up to me I would dispense with the definitions altogether (if nothing else, it would knock off over 1MB in the size of Unihan.txt).

referenced by

2007/02/06 MLang and GDI and Uniscribe, oh my!

go to newer or older post, or back to index or month or day