Short-sighted text processing #3: The Protcols of the EDIT for i18n

by Michael S. Kaplan, published on 2010/12/30 07:01 -05:00, original URI:

Previous blogs in this series:

Today would be another case where the role of Uniscribe is misunderstood, and the behavior of the things that aren't Uniscribe might be called into question a bit.

Now over in the Suggestion Box, Shachar Shemesh asked

I wanted to ask about Uniscribe's suggested policy, of passing strings to have their BiDi levels calculated and reordered only after the word wrap. It seems as that suggestion would create a BiDi display that is compatible with neither the Unicode BiDi algorithm nor common sense. What's worse, it seems like Windows (at least on Windows XP) actually does this this way.

One example is using LRO over the entire text. Windows, at least in an edit control, seems to "forget" the override as soon as the line breaks comes. A more subtle, but also more serious (as it doesn't use any control characters that no one has heard of) is the following. Assume an LTR paragraph:

english HEBREW 123 AND MORE.

If the text fits within a single line, Windows correctly reorders it as:

english EROM DNA 123 WERBEH.

Notice how the "123" is on the right of the "AND MORE". If a line break takes place immediately after the "HEBREW", the text looks like this:

english WERBEH


The 123 is on the left of the "AND MORE", which neither makes sense nor is standard conforming.

Funny how the text went right to talking about Uniscribe policies and Unicode conformance.

Anyway, it isn't actually Uniscribe.

It was Andrew who unraveled the mystery for me.

What he found was that the error seems to be in Notepad, and the simple shell EDIT control. Although both Wordpad/RichEdit and Word retain the directional state, Notepad/EDIT forgets it on a line wrap. Since the String* Uniscribe functions that are being used here are workig at a per-line level, Notepad considering each line to terminate a run, regardless of whether it is a paragraph or not, is not entirely unreasonable.

Now Wordpad/RichEdit's behavior is slightly different here, though still perhaps less than perfect. It does not consider a new line to be new run, though it does consider a hard line break to be a run boundary. Thus an inserted RLO will e broken across paragraph boundaries, not line boundaries.

You can see them contrasted here:

And those hard returns can show the different RichEdit behavior, as right here:

Now neither of these things can rightfully be called Uniscribe policy, though when one considers that they are in the two core text edit controls (Shell EDIT and RichEdit), these two distinct behaviors are going to be pretty prevalent.

The different (smarter) behavior in Word of course indicates that anyone can do things their own way and not be required to behave similarly. Though in practice most people won't, so the behaviors of these two controls will be more common.

The Shell EDIT/Notepad behavior is following a particular design and I wouldn't necessarily feel comfortable trying to push for a change there.

The RichEdit/Wordpad behavior I am more willing to perhaps consider it a bug since Word compatibility in editing experience is often a goal of the control.

Though the owners of the control might disagree.

In general it would make more sense if both behaviors were configurable since a case could be made that neithe is ideal for every case. Though I imagine it would be hard to find anyone willing to spec that work, do it, and test it....

and Uniscribe isn't doing any of it by is own policies; it is The Protcols of the EDIT of Microsoft....

Cheong on 30 Dec 2010 6:08 PM:

I wonder if the missing 'o' in Protocal is intented because it's both wrong in the title and the last paragraph.

Michael S. Kaplan on 30 Dec 2010 11:32 PM:

Ssshhhh! Don't mention that, it's foreshadowing for part 4!

referenced by

2011/01/06 Short-sighted text processing #6: OpenType and Apple and OpenType

2011/01/05 Short-sighted text processing #5: PU[A]! That pad THAI is pretty spicy....

2011/01/04 Short-sighted text processing #4: Squeezing every bit of text you possibly can out of MacOffice 2011

go to newer or older post, or back to index or month or day