Working beyond the BMP is going off script (according to GDI)

by Michael S. Kaplan, published on 2006/06/29 08:06 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/29/650680.aspx


Yesterday when I was talking about WYSIWYG font dropdowns and using the GetGlyphIndices function to determine if a font supported a character or set of characters, regular reader Mihai mentioned in a comment:

GetGlyphIndices does not handle chars outside BMP (true in XP SP2 and an older Vista build)

This is true, it doesn't. As a function it looks at each individual UTF-16 code unit and checks to see if it is in the font.

Now in theory you could create a font that puts surrogate code points in the ligature table that might work here, but (a) that won't help GetGlyphIndices since it will just globally say the high surrogate and low surrogate are there and not whether there is an actual ligature defined, and (b) it is not generally considered good typographic practice to create an "off the BMP" font that way.

(We don't mind it so much in keyboards, but that's a story for another day!)

There is actually a real solution that can handle things off the Basic Multilingual Plane -- the ScriptGetCMap function, which is described as being able to retrieve "...the glyph indexes of the Unicode characters in a string according to either the TrueType cmap table or the standard cmap table implemented for old style fonts."

And it will do the extra work with supplementary characters defined as surrogate pairs, and it is even easier than GetGlyphIndices in terms of determining whether the font supports the string since it will simply return S_FALSE if one or more of the code points were mapped to the default glyph.

There is one warning in the documentation which is theoretically troubling:

Note that some code points can be rendered by a combination of glyphs as well as by a single glyph — for example, 00C9; LATIN CAPITAL LETTER E WITH ACUTE. In this case, if the font supports the capital E glyph and the acute glyph but not a single glyph for 00C9, ScriptGetCMap will show 00C9 is unsupported.

Although in practice the situation is very uncommon, because in general any time a particular composite sequence is suppprted in a font, the precomposed character (if it exists) is also supported. It is just the way things usually work in fonts, and it is much less common for the precomposed character to not be supported.

Though there is info in the docs on how to handle that situation if one suspects it may be happening, anyway, via the ScriptShape function....

Now I was talking to Peter about this whole issue yesterday, and he pointed out that people could simply special case their code to use GetGlyphIndices for BMP cases and ScriptGetCMap for when things are off the BMP (in truth GDI handles nothing off the BMP anyway, so it is hardly an artificial split -- if you are using supplementary characters, you are using complex scripts as far as GDI is concerned).

Though in truth if one is going to use Uniscribe here, it seems using it all the time when it is available is probably better, or at least more consistent. And why not use Uniscribe in little ways explicitly if it is going to be use implicitly in bigger ways like rendering anyway? :-)

 

This post brought to you by (U+09ab, a.k.a. BENGALI LETTER PHA)


# Nick Lamb on 29 Jun 2006 11:26 AM:

"you could create a font that puts surrogate code points in the ligature table"

This would not be legal so far as I can see. Surrogate code points are not Unicode characters, they're just an artefact of UTF-16.

# Michael S. Kaplan on 29 Jun 2006 12:22 PM:

re: "not legal"

Well, notice I am not recommending it, and also notice that the meaning of "not legal" is unclear when there are no law enforcment personnel who are going to pull over a font. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/12/16 How best to keep the font switcheroo from happening?

2007/10/11 If you aren't adequate, I guess that means you're inadequate; if you're not complex, I suppose that means you're simple?

go to newer or older post, or back to index or month or day