by Michael S. Kaplan, published on 2005/09/23 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/23/472656.aspx
Joseph asked in the newsgroups the other day:
I have a problem rendering some Unicode characters. I have a particular instance although it is representative of a more general problem.
I have a sequence of five Unicode characters. They are 0A38 0A36 0A47 0A30 0A47, where the 0A47 characters are the problem. These are GURMUKHI VOWEL SIGN EE, a nominally zero-width diacritical mark.
My problem is that I have to render one "character" at a time. In this case, a "character" is the pair 0A36+0A47 or 0A30+0A47. My problem is to figure out how to render such sequences as single glyphs.
I find a lot of the Uniscribe documentation fairly incomprehensible. I tried to use GetCharacterPlacement to get the offsets, figuring that if I saw two offsets the same I'd have the pair to render. However, when presented with this string, and flags of 0, I get a return value of 0; the GCP_RESULTS values are all left unset, yet when I call GetLastError, I get a return value of 0 (S_OK).
What I'm looking for is a way to partition the sequence of characters into a sequence of glyphs that I can render. What I should be able to obtain here is the three substrings
0A38 0A36+0A47 0A30+0A47
which I can then render using TextOut (I'm using Arial Unicode MS as my font, in a Unicode application)
thanks
joe
The awful story is pretty clearly (some would say baldly) laid out in the GetCharacterPlacement documentation in the Platform SDK, the text I marked in pink:
The GetCharacterPlacement function retrieves information about a character string, such as character widths, caret positioning, ordering within the string, and glyph rendering. The type of information returned depends on the dwFlags parameter and is based on the currently selected font in the specified display context. The function copies the information to the specified GCP_RESULTS structure or to one or more arrays specified by the structure.
Although this function was once adequate for working with character strings, a need to work with an increasing number of languages and scripts has rendered it obsolete. It has been superseded by the functionality of the Uniscribe module. For more information, see Uniscribe.
The simple truth is that you can stick a fork in GetCharacterPlacement; it's done.
So, the string in question is U+0a38 U+0a36 U+0a47 U+0a30 U+0a47, which is to say:
GURMUKHI LETTER SA; GURMUKHI LETTER SHA; GURMUKHI VOWEL SIGN EE; GURMUKHI LETTER RA; GURMUKHI VOWEL SIGN EE
It would look something like this when it is all put together (assuming you are on a machine with rendering support): ਸਸ਼ੇਰੇ.
But the big question here is WHY only one "character" can be rendered at a time, when the point of complex script rendering is to not think that way but instead to do runs of text? It is all well and good to want to render it in chunks like
ਸ ਸ਼ੇ ਰੇ
but it is not entirely clear how that would satisfy the needs of the script. Why not just work with the full run? If you are using TextOutW or ExtTextOutW, it will forward the call to Uniscribe to make sure all the shaping happens for you....
As it turns out, Joe did have a specific reason for the request:
The problem is that the text layout requires special rendering. I can't change the requirement because it relates to how the layout is being done. My problem is that somehow I have to make fonts like this work within that constraint. Since the layout is also dynamic based on a maximum width of the widest character, I need to first compute the maximum width required, the render the characters to fit in the area. The height of the font is based on the number of character cells I need to fit into the space. So the initial thought of using DrawText won't work either. TextOut will render the text correctly if all the text were being displayed in a single line, but it is not; it actually must be rendered vertically, e.g.,
ABC
would render as
A
B
Cbut I have to deal with rendering
Win
as
W
i
nwith the characters centered in the bounding box determined by the widest character.
I realize that the requirement is the problem, but I can't really change the requirement, and it was written with English in mind. Then there was a decision to localize, and I've been running "torture tests" trying to get worst-case scenarios. This was one of the first I fell over, and it represents a problem that will apply to many languages and many character sequences.
I would consider this a case of good internationalization at a superficial technical level and poor internationalization at a user level.
Why do I say that? Well, Joe is taking a UI paradigm and trying to apply it to every language, whether it makes sense or not. Andd there are a myriad of languages where this will make no sense, or destroy meaning due to breaking expected shaping behavior, or truly annoy users. Which is really not a good way to go, at all.
You could also think of it as the software coding equivalent of the difference between translation and localization -- one is an attempt to just transfer from a source to some targets, one is an attempt to truly understand the differences between languages and cultures and work with that knowledge to provide compelling features.
With that said, if you truly want to know where those boundaries are, then you can look into some specific Uniscribe functionality to query for the info, namely the ScriptBreak function, which will return the breaks in the string (ScriptBreak requires you to call ScriptItemize first to retrieve a set of SCRIPT_ANALYSIS structs, one for each chunk of text. And then ScriptBreak will return an array of SCRIPT_LOGATTR structs that will return information about each code point:
typedef struct tag_SCRIPT_LOGATTR {
BYTE fSoftBreak :1;
BYTE fWhiteSpace :1;
BYTE fCharStop :1;
BYTE fWordStop :1;
BYTE fInvalid :1;
BYTE fReserved :3;
} SCRIPT_LOGATTR;
So Joe, if you need a small sample written up, let me know -- but this one is pretty easy to just do (unlike some Uniscribe requirements out there!)....
This post brought to you by "ੇ" (U+0a47, GURMUKHI VOWEL SIGN EE)
# Gabe on 23 Sep 2005 3:26 AM:
# Michael S. Kaplan on 23 Sep 2005 4:32 AM:
# Nicholas Allen on 23 Sep 2005 9:30 AM:
# Michael S. Kaplan on 23 Sep 2005 9:31 AM:
# David on 23 Sep 2005 5:51 PM:
# Michael S. Kaplan on 23 Sep 2005 7:51 PM:
Pierre on 13 Nov 2011 7:55 AM:
For some applications, TextOutW or ExtTextOutW are too simplistic. In my app I have to display the letters along a curved arc. Each letter requires creating an individual font with a different lfEscapement and lfOrientation. It would be nice to display the letters correctly.
Uniscribe only works on Vista and WIndows 7. A lot of my customers are still using XP.
Michael S. Kaplan on 13 Nov 2011 8:12 AM:
Um, Uniscribe was added in Windows 2000 (and updated in XP and XP SP2). It requires complex script support to be installed until Vista, but that is a far cry from unavailable!
Pierre on 13 Nov 2011 9:28 AM:
The documentation for some functions (ex: ScriptGetFontFeatureTags, ScriptGetFontAlternateGlyphs, ScriptGetFontLanguageTags) says it requires "Minimum supported client: Windows Vista".
See:
msdn.microsoft.com/.../dd368547%28v=VS.85%29.aspx
Also, Visual Studio 2010 documentation.
Pierre on 13 Nov 2011 10:15 AM:
I'm trying to do something rather basic, and would prefer to use GetCharacterPlacement rather than wade into Uniscribe.
The problem I'm having is that for simple English script, the characters returned in lpGlyphs[] bear no resemblance to the original input. I must be doing something wrong.
Here is my code:
GCP_RESULTSW gcp_results;
wchar_t ww[] = L"Gnarfle";
int len = wcslen(ww);
memset(&gcp_results, NUL, sizeof(gcp_results));
gcp_results.lStructSize = sizeof(gcp_results);
gcp_results.lpOrder = (UINT *)calloc(len, sizeof(UINT)); // order after reordering
memset(gcp_results.lpOrder, NUL, len * sizeof(UINT));
gcp_results.lpGlyphs = (WCHAR *)calloc(len, sizeof(WCHAR));
memset(gcp_results.lpGlyphs, NUL, len * sizeof(WCHAR));
gcp_results.nGlyphs = len;
gcp_results.lpGlyphs[0] = 0; // normal ligation
GetCharacterPlacementW(hdc, ww, len, 0, &gcp_results,
GCP_GLYPHSHAPE| GCP_REORDER | GCP_LIGATE);
Results after calling GetCharacterPlacementW:
ww: [0]=0047/G [1]=006E/n [2]=0061/a [3]=0072/r [4]=0066/f [5]=006C/l [6]=0065/e
gcp_results.lpOrder [0]=0 [1]=1 [2]=2 [3]=3 [4]=4 [5]=5 [6]=6
gcp_results.lpGlyphs [0]=002A/* [1]=0051/Q [2]=0044/D [3]=0055/U [4]=0049/I [5]=004F/O [6]=0048/H
What is in 'lpGlyphs'??
I did a test with Arabic, and lpGlyphs returns with non-Arabic characters. Why doesn't this work?
Michael S. Kaplan on 13 Nov 2011 10:47 AM:
Okay, the docs are talking about some of the new functions they added -- not all functions. And none of those new functions are GetCharacterPlacement replacement functions.
GetCharacterPlacement is not being updated. If you want to do more than it does and you want to support XP, than moving to Uniscribe is your only option.
referenced by