Documented, schmockumented! It's still kind of cool....

by Michael S. Kaplan, published on 2007/09/24 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/24/5082103.aspx

(No, this post is not about my social life or anything related to it, though I suppose there may have been times the title might have been partially descriptive¹; this is a technical post and also a world premiere discussion of an obscure but accidentally not-yet-documented-but-nevertheless-included-in-the-SDK flag for two of the most important GDI functions for rendering text!)

I still have a few posts left in that series I've been working on, though being out of town has caused a minor break in the rhythm there. It will be resuming soon.

Meanwhile over in the Suggestion Box, posts have been building up, and I figured I should pick a few of them off....

Fer² instance, there's Tihiy's question:

Hello Michael!

Can you suggest me a way to covert font character glyphs string back to Unicode string? I'm intercepting ExtTextOut to be able to read word under the cursor and i want to have human-readable string when ETO_GLYPH_INDEX is passed!

Funny question there, one that just came up almost in the end game of Vista before it shipped.

The answer won't exactly be a direct answer to Tihiy's question, but it will supply the method and go from there....

Regular readers may remember when I was talking about device fonts in posts like Printing TrueType as graphics and Device fonts are people too.

Well, one of the things that happened in Vista is that a lot more printing via glyph ID values was happening, a factor which (among other things) forces device fonts to not be used.

Glyph ID values are strongly tied to specific fonts, you see -- so device fonts were being taken out of the running.

Now while this is all well and good for Uniscribe and especially complex scripts, it is not so good for many of the East Asian scripts that I was talking about in that second post, where people were relying on device fonts that were fully loaded (containing all the glyphs that were needed) and were more performant than the system that had now become the default in so many circumstances.

The difference in performance for some cases was bad enough to be considered a legitimate regression, and therefore something had to be done.

I took a look in the Vista SDK header files and saw that the constant being used to trigger this effort was not documented, and so with the permission of the folks behind the code³ in Uniscribe and GDI, I am going to break the news here -- I am sure it will be in some future update to the SDK docs.

The constant is in WinGdi.h, circa line 185:

#if (_WIN32_WINNT >= _WIN32_WINNT_LONGHORN)
#define ETO_REVERSE_INDEX_MAP 0x10000
#endif

This new in Vista constant, ETO_REVERSE_INDEX_MAP, will basically (in a TextOut/ExtTextOut call to a device like a printer) try and convert a bunch of glyph ID values back to characters again, using a map that it builds up from the font's "Format 4" Microsoft/Unicode subtable of the CMAP.

This code works great with simple fonts that don't do other mappings -- because

any time a string you pass has no such simple mapping back to a character for every glyph ID, the code will just cut out and use the glyph ID values;
any time a string you pass has multiple characters mapping to the same glyph ID⁴, the code will detect this ambiguous case and again use the glyph ID values as they are;
any time the glyph ID values would actually have been obtained via other means (like the GSUB glyph substitution table or via more advanced features like VERT for vertical writing), no characters will be found.

Of course for the specific scenarios that inspired the work to be done, the feature is sufficient, but for even mildly complex cases involving ligatures or glyph substitutions or complex scripts in general, it will not assist at all, really.

And of course Tihiy's question of how obtain the results of the mapping are not helped at all (unless one is a printer driver!).

But for all the limitations here, it certainly does provide a roadmap to how one might do the work to try to reverse the process and convert glyph ID values to characters if one wanted to handle more complex cases, whether the ones I suggested above or more complex ones like digit substitution based on settings, reordering found in some Indic scripts, or even accepting ambiguous mappings and taking the first mapping as it is, etc.

One would have to dig into OpenType a bit, but a few GetFontData calls and some code that starts as a reverse to the code in KB241020, opening up additional OpenType tables and subtables as desired for the text in question, and one is in business!

Now in the long run, this is the kind of thing I would love see built in, but obviously features can't be decided solely on the basis of what people like me (or maybe people like you!) think is cool; there have to be real scenarios like measurable performance issue found with Japanese device fonts not being utilized. Until then, this could be an exciting project for some ISV to work on, or maybe a sample I could try to and put together at some point (my last bit of digging into OpenType stuff was over 18 months ago in Getting all of the localized names of a font, I think I might be due at some point!).

Also, keep the concepts behind the post in mind, they will provide assistance in a quite unrelated feature I will be posting about over the next few weeks sometime....

1 - Perhaps fodder for future, non-technical blog posts if there is sufficient interest!
2 - Intentionally misspelled to try to give the illusion of SiaO being "jus country"^5,6!
3 - Thanks, Sergey and Mike!
4 - An exception to this rule is some specific characters commonly mapped to the same glyph, like U+0020/U+00a0/U+2002/U+2003/U+3000 mapping to the space and U+002d/U+00ad/U+2010/U+2011/U+2012 for the hyphen; in such cases, the mappings will not be considered ambiguous, even though the text mapped may or may not match the original code points that were converted to glyph ID values.
5 - Another intentional misspelling.
6 - Also, an unintentional misuse of the idiom due to lack of knowledge of how to represent it and spell it!

This post brought to you by ‒ (U+2012, a.k.a. FIGURE DASH)

# Tihiy on 24 Sep 2007 11:19 AM:

I've managed to solve my task by using GetGlyphIndices on whole Unicode range and reverse-looking up character.

# Michael S. Kaplan on 24 Sep 2007 11:37 AM:

That's cool -- as long as you recognize the similar limitation that "reverse lookup map" will have. But otherwise it is not an unlike solution, at all. :-)

# Mihai on 24 Sep 2007 12:14 PM:

<<I've managed to solve my task by using GetGlyphIndices on whole Unicode range>>

Except that GetGlyphIndices does not work outside BMP, so the Unicode range it is not quite whole :-)

# Mike on 24 Sep 2007 12:59 PM:

I really want to stress that if you are required to get characters back out of a glyph index string your first effort should be to change your design, not to start writing code to do the reverse mapping. It's completely impossible to do in general because there is a many-to-many mapping between characters and indexes. You can get it to mostly-sorta-work-on-occasion but it will never be fully reliable.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/01/05 Short-sighted text processing #5: PU[A]! That pad THAI is pretty spicy....

2010/09/16 Providing more information is the best way to assure correct information is received

2010/09/06 Acrobat PDF: the Yugo vs. the BMW vs. the Ferrari

2008/01/13 On reversing the irreversible (the introduction)

go to newer or older post, or back to index or month or day