The Bidi Algorithm's own SEP Field

by Michael S. Kaplan, published on 2008/08/25 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/08/25/8893033.aspx


 

There are many nice things that I can (and sometimes do) say about Unicode Standard Annex #9 (Unicode Bidirectional Algorithm), which I will call for the rest of this blog the UBA in order to avoid the repetitive and tiresome nature of "Unicode Bidirectional Algorithm". I know thast it is not a pronoun, but saying all those nouns over and over again really does wear you down so whatever shortcut works. :-)

Anyway, what was I talking about?

Oh yeah, about how there are so many nice things that I can (and sometimes do) say about UBA.

This blog is not about any of them.

Instead, this blog is going to focus on two particular limitations that in my opinion make the UBA less useful in software.

I am thinking mainly about Windows, but after listening to people who work on the Mac and in Linux I think this is really a platform agnostic set of issues.

Now I know some people think the issues are with input, but really they aren't. I mean I mentioned in blogs like Mirroring and Keyboards are complicated but that isn't what makes this really hard for application developers, most of the time. And it isn't why applications have bad or inconsistent behavior, by and large.

In fact, it is not the input itself that is to blame but the rendering -- so cursor movement and all that are interesting but most of it is okay often enough that people would probably not notice problems if other things weren't going on.

Plus, those other items are kind of subject to some variability based on platform and expectations, so while recommendations are nice these are not the blocking issues.

I am therefore going to be looking elsewhere.

The two issues I am focusing on here are:

Now these items are ones I started really jumping into with other blogs like Mixing it up with bidirectional text and The Bug(s) Spotted, aka Design flaws are worse than bugs and The mythical nature of bidirectional support, and where the wheels come off the wagon.

The simple problem is best stated as:

The Unicode Bidirectional Algorithm cannot handle text from both left-to-right and right-to-left languages together in the same line of text.

That is it, right there.

Sure the UBA has all of that hand-wavey text about "higher level protocols" but all theyr eally did was create their own SEP field.

You know what an SEP is, right?

It's a Douglas Adam thing, so I'll let him explain it:

An SEP is something we can't see, or don't see, or our brain doesn't let us see, because we think that it's somebody else's problem.... The brain just edits it out, it's like a blind spot. If you look at it directly you won't see it unless you know precisely what it is. Your only hope is to catch it by surprise out of the corner of your eye.

This basically also explains why Unicode hasn't dealt with the issue, since they rely "...on people's natural predisposition not to see anything they don't want to, weren't expecting, or can't explain..." and talk about higher level protocols as a way of saying that someone else has to deal with it.

But I can look at things like this:

and this:

and I know that there are quite a few inadequate somebody elses out there.

Even my Mac runs into those same problems. Even when the text is plain:

http://www.trigeminal.com/images/TextEditBidi.png

The section in the UBA about Higher Level Protocols show how much clients are left on their own to figure stuff out:

4.3 Higher-Level Protocols

The following clauses are the only permissible ways for systems to apply higher-level protocols to the ordering of bidirectional text. Some of the clauses apply to segments of structured text. This refers to the situation where text is interpreted as being structured, whether with explicit markup such as XML or HTML, or internally structured such as in a word processor or spreadsheet. In such a case, a segment is span of text that is distinguished in some way by the structure. 

HL1.

Override P3, and set the paragraph embedding level explicitly.

  • A higher-level protocol may set the paragraph level explicitly and ignore P3. This can be done on the basis of the context, such as on a table cell, paragraph, document, or system level.
HL2. Override W2, and set EN or AN explicitly.
  • A higher-level process may reset characters of type EN to AN, or vice versa, and ignore W2. For example, style sheet or markup information can be used within a span of text to override the setting of EN text to be always be AN, or vice versa.
HL3. Emulate directional overrides or embedding codes.
  • A higher-level protocol can impose a directional override or embedding on a segment of structured text. The behavior must always be defined by reference to what would happen if the equivalent explicit codes as defined in the algorithm were inserted into the text. For example, a style sheet or markup can set the embedding level on a span of text.
HL4. Apply the Bidirectional Algorithm to segments.
  • The Bidirectional Algorithm can be applied independently to one or more segments of structured text. For example, when displaying a document consisting of textual data and visible markup in an editor, a higher-level process can handle syntactic elements in the markup separately from the textual data.
HL5. Provide artificial context.
  • Text can be processed by the Bidirectional Algorithm as if it were preceded by a character of a given type and/or followed by a character of a given type. This allows a piece of text that is extracted from a longer sequence of text to behave as it did in the larger context.
HL6. Additional mirroring.
  • Characters with a resolved directionality of R that do not have the Bidi_Mirrored property can also be depicted by a mirrored glyph in specialized contexts. Such contexts include, but are not limited to, historic scripts and associated punctuation, private-use characters, and characters in mathematical expressions. (See Section 6, Mirroring.)

Clauses HL1 and HL3 are not logically necessary; they are covered by applications of clauses HL4 and HL5. However, they are included for clarity because they are more common operations.

As an example of the application of HL4, suppose an XML document contains the following fragment. (Note: This is a simplified example for illustration: element names, attribute names, and attribute values could all be involved.)

ARABICenglishARABIC<e1 type='ab'>ARABICenglish<e2 type='cd'>english

This can be analyzed as being five different segments:

  1. ARABICenglishARABIC
  2. <e1 type='ab'>
  3. ARABICenglish
  4. <e2 type='cd'>
  5. english

To make the XML file readable as source text, the display in an editor could order these elements all in a uniform direction (for example, all left-to-right) and apply the Bidirectional Algorithm to each field separately. It could also choose to order the element names, attribute names, and attribute values uniformly in the same direction (for example, all left-to-right). For final display, the markup could be ignored, allowing all of the text (segments a, c, and e) to be reordered together.

When text using a higher-level protocol is to be converted to Unicode plain text, for consistent appearance formatting codes should be inserted to ensure that the order matches that of the higher-level protocol.

This information is so helpful that implementers can't even have their text look wrong in a consistent way -- every implementation has their own mistakes.

Even in plain text, when the whole higher level protocol is arguable.

And yes you can solve all such cases with RLM and LRM and RLE and LRE and PDF, sure. But with no standard on how to apply these in plain text, or how to make the standard itself pass my own "smart as an 8-year old" test (something those eight-year olds can do in cases like the above and in harder cases like in The mythical nature of bidirectional support, and where the wheels come off the wagon).

Certainly some cases are exceptional, but the default case is mixed language text is broken now.

More importantly, the "islands of text of one language in a sea of another language" is also broken. For no good reason, really.

Perhaps the organization that Microsoft and all of these other big companies pay ten times the price of an Optimus keyboard a year to needs to start doing a bit of higher level work here, rather than passing the buck to random protocols.

Because it is clearly our problem (and everyone else's)....

Which makes it theirs! :-)


This blog brought to you by U+200e and U+200f (aka LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK)


# anon on 28 Aug 2008 6:02 AM:

Do you think this problem is solvable?  Can the UBA do much better without knowing the meaning of the text it's laying out?


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/07/23 It used to be Windows doing it right, and Office following. But now...

2009/06/29 UCS-2 to UTF-16, Part 11: Turning it up to Eleven!

2008/09/15 Hi, I'm a PC. And I have a MAC. Wait, isn't that backwards? No worries, we're talking Bidi here!

2008/09/01 Bidi, in your face[book]

go to newer or older post, or back to index or month or day