Stick a fork in GetCharacterPlacement

by Michael S. Kaplan, published on 2005/09/23 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/23/472656.aspx

Joseph asked in the newsgroups the other day:

I have a problem rendering some Unicode characters. I have a particular instance although it is representative of a more general problem.

I have a sequence of five Unicode characters. They are 0A38 0A36 0A47 0A30 0A47, where the 0A47 characters are the problem. These are GURMUKHI VOWEL SIGN EE, a nominally zero-width diacritical mark.

My problem is that I have to render one "character" at a time. In this case, a "character" is the pair 0A36+0A47 or 0A30+0A47. My problem is to figure out how to render such sequences as single glyphs.

I find a lot of the Uniscribe documentation fairly incomprehensible. I tried to use GetCharacterPlacement to get the offsets, figuring that if I saw two offsets the same I'd have the pair to render. However, when presented with this string, and flags of 0, I get a return value of 0; the GCP_RESULTS values are all left unset, yet when I call GetLastError, I get a return value of 0 (S_OK).

What I'm looking for is a way to partition the sequence of characters into a sequence of glyphs that I can render. What I should be able to obtain here is the three substrings

0A38 0A36+0A47 0A30+0A47

which I can then render using TextOut (I'm using Arial Unicode MS as my font, in a Unicode application)

thanks
joe

The awful story is pretty clearly (some would say baldly) laid out in the GetCharacterPlacement documentation in the Platform SDK, the text I marked in pink:

The GetCharacterPlacement function retrieves information about a character string, such as character widths, caret positioning, ordering within the string, and glyph rendering. The type of information returned depends on the dwFlags parameter and is based on the currently selected font in the specified display context. The function copies the information to the specified GCP_RESULTS structure or to one or more arrays specified by the structure.

Although this function was once adequate for working with character strings, a need to work with an increasing number of languages and scripts has rendered it obsolete. It has been superseded by the functionality of the Uniscribe module. For more information, see Uniscribe.

The simple truth is that you can stick a fork in GetCharacterPlacement; it's done.

So, the string in question is U+0a38 U+0a36 U+0a47 U+0a30 U+0a47, which is to say:

GURMUKHI LETTER SA; GURMUKHI LETTER SHA; GURMUKHI VOWEL SIGN EE; GURMUKHI LETTER RA; GURMUKHI VOWEL SIGN EE

It would look something like this when it is all put together (assuming you are on a machine with rendering support): ਸਸ਼ੇਰੇ.

But the big question here is WHY only one "character" can be rendered at a time, when the point of complex script rendering is to not think that way but instead to do runs of text? It is all well and good to want to render it in chunks like

ਸ ਸ਼ੇ ਰੇ

but it is not entirely clear how that would satisfy the needs of the script. Why not just work with the full run? If you are using TextOutW or ExtTextOutW, it will forward the call to Uniscribe to make sure all the shaping happens for you....

As it turns out, Joe did have a specific reason for the request:

The problem is that the text layout requires special rendering. I can't change the requirement because it relates to how the layout is being done. My problem is that somehow I have to make fonts like this work within that constraint. Since the layout is also dynamic based on a maximum width of the widest character, I need to first compute the maximum width required, the render the characters to fit in the area. The height of the font is based on the number of character cells I need to fit into the space. So the initial thought of using DrawText won't work either. TextOut will render the text correctly if all the text were being displayed in a single line, but it is not; it actually must be rendered vertically, e.g.,

ABC

would render as

A
B
C

but I have to deal with rendering

Win

as

W
i
n

with the characters centered in the bounding box determined by the widest character.

I realize that the requirement is the problem, but I can't really change the requirement, and it was written with English in mind. Then there was a decision to localize, and I've been running "torture tests" trying to get worst-case scenarios. This was one of the first I fell over, and it represents a problem that will apply to many languages and many character sequences.

I would consider this a case of good internationalization at a superficial technical level and poor internationalization at a user level.

Why do I say that? Well, Joe is taking a UI paradigm and trying to apply it to every language, whether it makes sense or not. Andd there are a myriad of languages where this will make no sense, or destroy meaning due to breaking expected shaping behavior, or truly annoy users. Which is really not a good way to go, at all.

You could also think of it as the software coding equivalent of the difference between translation and localization -- one is an attempt to just transfer from a source to some targets, one is an attempt to truly understand the differences between languages and cultures and work with that knowledge to provide compelling features.

With that said, if you truly want to know where those boundaries are, then you can look into some specific Uniscribe functionality to query for the info, namely the ScriptBreak function, which will return the breaks in the string (ScriptBreak requires you to call ScriptItemize first to retrieve a set of SCRIPT_ANALYSIS structs, one for each chunk of text. And then ScriptBreak will return an array of SCRIPT_LOGATTR structs that will return information about each code point:

typedef struct tag_SCRIPT_LOGATTR {
BYTE fSoftBreak :1;
BYTE fWhiteSpace :1;
BYTE fCharStop :1;
BYTE fWordStop :1;
BYTE fInvalid :1;
BYTE fReserved :3;
} SCRIPT_LOGATTR;

So Joe, if you need a small sample written up, let me know -- but this one is pretty easy to just do (unlike some Uniscribe requirements out there!)....

This post brought to you by "ੇ" (U+0a47, GURMUKHI VOWEL SIGN EE)

# Gabe on 23 Sep 2005 3:26 AM:

I must say that writing English vertically has the same problem of breaking expected shaping behavior and truly annoying users. Are there any scripts besides CJK ones that don't have a problem when written vertically?

# Michael S. Kaplan on 23 Sep 2005 4:32 AM:

Hi Gabe -- there are a few that expect some form of vertical writing to be possible, like Mongolian and CJK. But as you mention, for most languages it is at least an annoyance....

# Nicholas Allen on 23 Sep 2005 9:30 AM:

Even CJK text won't quite work if you try to do it manually like this. There are different forms and positions between some horizontal and vertical writing.

# Michael S. Kaplan on 23 Sep 2005 9:31 AM:

Absolutely, Nicholas -- and the answer is Uniscribe! :-)

# David on 23 Sep 2005 5:51 PM:

Sometimes when Uniscribe is incomprehensible, I've found it useful to examine other approaches. Such as IBM's "International Component for Unicode" library (ICU). It's nicely implemented in both C and Java, and since it's open source, you can figure out exactly what it's doing!

See: http://www-306.ibm.com/software/globalization/icu/index.jsp

# Michael S. Kaplan on 23 Sep 2005 7:51 PM:

David -- ICU does not have full support for rendering as many scripts.... and this really is a rendering issue.

Pierre on 13 Nov 2011 7:55 AM:

For some applications, TextOutW or ExtTextOutW are too simplistic. In my app I have to display the letters along a curved arc. Each letter requires creating an individual font with a different lfEscapement and lfOrientation. It would be nice to display the letters correctly.

Uniscribe only works on Vista and WIndows 7. A lot of my customers are still using XP.

Michael S. Kaplan on 13 Nov 2011 8:12 AM:

Um, Uniscribe was added in Windows 2000 (and updated in XP and XP SP2). It requires complex script support to be installed until Vista, but that is a far cry from unavailable!

Pierre on 13 Nov 2011 9:28 AM:

The documentation for some functions (ex: ScriptGetFontFeatureTags, ScriptGetFontAlternateGlyphs, ScriptGetFontLanguageTags) says it requires "Minimum supported client: Windows Vista".

See:

msdn.microsoft.com/.../dd368547%28v=VS.85%29.aspx

Also, Visual Studio 2010 documentation.

Pierre on 13 Nov 2011 10:15 AM:

I'm trying to do something rather basic, and would prefer to use GetCharacterPlacement rather than wade into Uniscribe.

The problem I'm having is that for simple English script, the characters returned in lpGlyphs[] bear no resemblance to the original input. I must be doing something wrong.

Here is my code:

GCP_RESULTSW gcp_results;

wchar_t ww[] = L"Gnarfle";

int len = wcslen(ww);

memset(&gcp_results, NUL, sizeof(gcp_results));

gcp_results.lStructSize = sizeof(gcp_results);

gcp_results.lpOrder = (UINT *)calloc(len, sizeof(UINT)); // order after reordering

memset(gcp_results.lpOrder, NUL, len * sizeof(UINT));

gcp_results.lpGlyphs = (WCHAR *)calloc(len, sizeof(WCHAR));

memset(gcp_results.lpGlyphs, NUL, len * sizeof(WCHAR));

gcp_results.nGlyphs = len;

gcp_results.lpGlyphs[0] = 0; // normal ligation

GetCharacterPlacementW(hdc, ww, len, 0, &gcp_results,

GCP_GLYPHSHAPE| GCP_REORDER | GCP_LIGATE);

Results after calling GetCharacterPlacementW:

ww: [0]=0047/G [1]=006E/n [2]=0061/a [3]=0072/r [4]=0066/f [5]=006C/l [6]=0065/e

gcp_results.lpOrder [0]=0 [1]=1 [2]=2 [3]=3 [4]=4 [5]=5 [6]=6

gcp_results.lpGlyphs [0]=002A/* [1]=0051/Q [2]=0044/D [3]=0055/U [4]=0049/I [5]=004F/O [6]=0048/H

What is in 'lpGlyphs'??

I did a test with Arabic, and lpGlyphs returns with non-Arabic characters. Why doesn't this work?

Michael S. Kaplan on 13 Nov 2011 10:47 AM:

Okay, the docs are talking about some of the new functions they added -- not all functions. And none of those new functions are GetCharacterPlacement replacement functions.

GetCharacterPlacement is not being updated. If you want to do more than it does and you want to support XP, than moving to Uniscribe is your only option.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/10/15 UCS-2 to UTF-16, Part 5: What's on the Next Level?

2007/03/27 GPOS w/o GSUB in a TTF on XP SP2 can be FUBAR

go to newer or older post, or back to index or month or day