On character justification (in both senses)

by Michael S. Kaplan, published on 2010/01/26 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/01/26/9952907.aspx

A few days ago, via several different methods (the Visual C++ Development Center forum, email to my non-Microsoft account, the contact link here, multiple off-topic comments with increasing impatience apparent in each for a solution), Rajesh asked:

Hello Michael:

I have visited your blog, and know that you are an expert in Windows Uniscribe, here I have some questions about Uniscribe to ask you.

Inter-character spacing for labeling results in a composite text collection with each character being split as a separate one. Hence each character is presented as a separate one and cannot arrive at a combination character. Problem with combinational characters is not only specific to right to left language( Arabic Language- Example:يُساوِي), the problem can exist with left to right language(Hindi Language - Example:ठऑक्षझॉ) also.

So,Please let us know if there exists any API that identifies the given set of pre composed characters comprises a composite character.

Thanks in advance,
Rajesh Reddy

Now of course I generally can't do the kind of 1-on-1 support that the many messages entailed, and people who are looking for support like that really need to find a more appropriate method, as I point out in my Contacting Me link.

But the question is an interesting one, and the blog that was going to be put in for today has to have a bit more done to it, so I thought I'd take a stab at it.

For starters we'll have to take the word composite out if the mix. Not that the word isn't descriptive enough, just it carries some baggage with it. It can confuse people into thinking the question is more about code pages and the difference between what Microsoft calls composite vs. precomposed sequences. This is the problem that the support engineer had in this forum thread at first.

Now the biggest problem is in the assumption that simply adding space in between every character is the right thing to do, as any language/script that does shaping when certain characters are placed next to each other will fail -- and this is the very problem that Rajesh points out.

What someone trying to do a complex operation like full justification could use is the information that Uniscribe returns in its ScriptString_pLogAttr Function (if one is using the ScriptString* functions) or the ScriptBreak function (if one is calling the fuller low level Uniscribe functions) -- in particular the array of SCRIPT_LOGATTR structures that each function returns that will, for each character in the list of characters Uniscribe is processing will return all of the following information:

whether breaking the line in front of the character, called a "soft break", is valid;
whether the character is one of the many Unicode characters classified as breakable white space, which can break a word;
whether the character is a valid position for showing the caret upon a character movement keyboard action (set for most characters, but not on code points inside Indian and Southeast Asian character clusters - it can be used to implement LEFT ARROW and RIGHT ARROW operations in editors);
whether the character is a valid position for showing the caret upon a word movement keyboard action (it can be used to implement the CTRL+LEFT ARROW and CTRL+RIGHT ARROW operations in editors);
whether the character is or is part of an invalid or undisplayable combination.

Now once one has all of this information, one knows the safe places where space can be inserted if one is trying to extend the width of a line in order to make the justification match other lines, if one is using simple space insertion to do so.

But this is the wrong approach.

Note that in pretty much all cases such an algorithm has a pretty fundamental flaw, which is that the actual widths one might need to insert can be different and using full characters between the words will make the text jagged on the far edge (as can the different widths of the words themselves).

The better way to perform such operations is by use of the ScriptJustify Function as possibly modified by a more advanced editor, as the function indicates:

This function provides a simple implementation of multilingual justification. It establishes the amount of adjustment to make at each glyph position on the line. It interprets the SCRIPT_VISATTR array generated by a call to ScriptShape, giving top priority to kashida. The function uses interword spacing if no kashida points are available. It uses intercharacter spacing if no interword points are available.

Note: Sophisticated text formatters might generate their own delta dx array by combining formatter-specific features with the information retrieved by ScriptShape in the SCRIPT_VISATTR array.

The application should pass the justified advance widths generated by ScriptJustify to ScriptTextOut in the piJustify parameter.

ScriptJustify creates a justified array containing updated advance widths for each glyph. When an advance width for a glyph is increased, the extra width is rendered to the right of the glyph, with a white space or, for Arabic text, a kashida.

This is the Uniscribe model for dealing with the kind of advanced justification one might see in a program like Word or PowerPoint or Publisher -- as it can be used to precisely place text to allow desired justification to take place....

For the other issue, the way of getting my (or anyone's) attention, I expect in most cases if one just thinks of me not as an employee of your company or you personally but as someone who has a job and really just blogs because it is fun and interesting to talk about the things that interest me (such as Uniscribe). If you met such a person, how would you approach them? If you had their email address, how would you word the email? And what would your expectation be? I expect the majority of people who frame the question that way will come up with an appropriate answer.

If the answer is needed urgently (which I assume it is) then there are many more formal support options that will guarantee the timeliness of the response, much more effectively than shouting the question from the rooftops (sometimes I end up involved with those too, and I serve at the pleasure of the customer).

I mean if they have an interesting enough question maybe I'll answer anyway. But my interests are pretty hard to pin down sometimes, and even the girl I go out with wonders how she catches my eye (though she does and I suppose once one catcheas my eye and not my ire then the hardest part is taken care of!).

And all of that is ignoring the challenges of figuring out my blogging schedule!

Kemp on 27 Jan 2010 10:42 AM:

I'm not even famous, and I can sympathise with people taking that approach. I can't begin to imagine what it's like for you. I had one guy who, after realising that I knew some aspects of coding, decided that every time he had a problem it should result in me receiving PMs on a forum we frequented, IMs (even if I wasn't logged in at the time), emails at least once, and finally an actual post on the forum where he actually had a chance of being answered by someone else. I have no idea how people's minds work sometimes.

Michael S. Kaplan on 27 Jan 2010 11:31 AM:

Well of course the harder part is the notes that get more insistent. I used to be a consultant (and still am, kiind of) so usually I'd point outy my rates (at least USD$350/hour) when people want to go that route....

Kemp on 27 Jan 2010 11:42 AM:

s/I can sympathise with people taking that approach/I can sympathise with you regarding people taking that approach/

I've just realised that means something a bit different.

Michael S. Kaplan on 27 Jan 2010 1:50 PM:

Both make sensxe, though yes they mean two very different things - essentially who is most sympathetic!

MG on 29 Jan 2010 11:39 AM:

As I must remind my family every holiday, just because I work in the computer industry does not mean I will spend my vacation fixing your computer.

As I said to my dad, "If you have a friend who is a dentist, do you demand a free cleaning every time you meet?"

rajesh on 5 Feb 2010 3:28 AM:

Thank you Michael. Above blog is very useful and trying to get the soltion for my problem. And we have one more question on this issue.

We need to split the Unicode characters honoring combined characters and ligatures. Given a substring we can identify whether it is a combining character or not using IsScriptShape or .Net’s StringInfo.CombiningCharacters API. However we need an API that identifies whether a given substring wholly forms a ligature so that we will not split such substrings.”

Michael S. Kaplan on 5 Feb 2010 7:24 AM:

See the first part of this very blog, the non-solutiuon for justification, for the complete information on valid break points -- just don't use that as the sole way to justify as you originally intended but as the way to find break poiints, and you will have what you want....

Rajesh on 2 Jul 2010 5:12 AM:

I have tried with ScriptShape and ScriptBreak Uniscribe API and was not successful.

My requirement is to split the string as shown below. The “Original String” should be split as shown under “Correct Split” preserving the combined characters as it is.

Original String: شارع الخراج

Correct Split: شا ر ع ا لخر ا ج

Wrong Split: ش ا رع ا ل خ ر ا ج

Michael S. Kaplan on 2 Jul 2010 5:29 AM:

Since I said that was the wrong way to do it, I am not surprised (though since you did not describe how they were used it is unclear that they were used correctly, either). I said that the ScriptJustify Function is the right way justify here....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/08/31 And how exactly do you justify those frigging kashidas?

go to newer or older post, or back to index or month or day

On character justification (in *both* senses)

On character justification (in both senses)