When did Bidi happen?

by Michael S. Kaplan, published on 2005/08/03 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/03/447052.aspx

ronab49 asked in the suggestion box:

:) Do you know the story of bidi and the timeline of its adoption by Unicode and Windows.

Interestingly enough, a conversation has been going on over on the Unicode List and some history was discussed!

Ken Whistler noted:

Visual order Arabic and Hebrew implementations on computers were probably "invented" in the 70's, and saw fairly widespread use in that timeframe on mainframes and later in the 80's on PC's. A lot of that work was done by IBM. An inherent bidirectionality algorithm was invented at Xerox PARC in the 80's, I think, although others might have had an earlier hand in it. It was implemented on the Xerox Star system in that timeframe. You can see it discussed in Joe Becker's 1984 Scientific American article, for example. And that was the immediate precursor of Arabic and Hebrew support on the Macintosh, as well as the inspiration for the Unicode bidirectional algorithm.

[Some historians on the list can, no doubt, nail this stuff down more precisely...]

...

...nobody has claimed that the Arabic *language* is inherently bidi. Nor has anybody claimed that the Arabic *script* is inherently bidi. So try understanding what the people implementing these systems *are* claiming.

Any functional information processing system concerned with textual layout that is aimed at the Hebrew or Arabic language markets *must* support bidirectional layout of text. That is simply a fact.

Furthermore, to do so interoperably -- that is, with the hope that Implementation A by Company X will lay out the same underlying text as Implementation B by Company Y in the same order, so that a human sees and reads it as the "same" text -- they depend on a well-defined encoding of the characters and a well-defined bidirectional layout algorithm. One possible choice is consistent visual ordering. One possible choice is consistent logical ordering and an inherent bidirectional algorithm. The Unicode Standard chose the latter, for a number of very good reasons. Trying to mix the two is a quick road to hell.

Jony Rosenne then added some additional info:

Visual order for Hebrew ("inverted") was used in the 1950's on "unit record", i.e. punched cards. Arabic was more complicated, because of shaping, and I am not qualified to discuss it, but I believe it was around at about the same time.

When PDP's became common in the 70's everybody did his own version of Hebrew, there were about 8 of them, all visual, some inverted and some not. The main determinant was whether the application required just Hebrew or both Hebrew and English.

The IBM 3270 had a button to determine whether the screen should be right to left or left to right, and a way to reverse numbers and English text during data entry.

The IBM 5250 used non-inverted visual order.

An important landmark was the article ARABIC WORD PROCESSING by Joe Becker of Xerox, Communications of the ACM, July 1987, Volume 30 Number 7.

"Recently developed word processing software can correctly format the cursive, interacting letters of the Arabic script. Moreover, new layout procedures can automatically intermix right-to-left Arabic writing with left-to-right text in European or other languages."

Patrick Andries then commented about this quote:

Well, I believe ASMO 708 may have been the real landmark in 1986, it allowed the encoding of mixed Latin and Arabic texts and required contextualization. It later became ISO 8859-6:1987 which was itself integrated in ISO 10646.

A previous employer of mine (now defunct Alis technologies) was founded as Arabic Latin Information System and developed several contextualization Arabic and Bidi solutions soon after its founding in 1981 : terminal emulations, printer firmware or even participating in MS-DOS Arabic in 1987 (I think, I wasn't there!) before coming up with the first Unicode Arabic browser (Tango).

In other words, Joe Becker identified and documented a trend in the industry that had already led to the standardization of Arabic character sets that allowed for mix Latin and Arabic texts requiring contextualization (and shared Latin LTR digits).

And then finally Mark Davis said:

The choice of whether or not to clone characters was made consciously. We had experience with the other model: I wrote the first implementation of Arabic and Hebrew on the Mac back in 1986ish, and in that implementation cloned the common characters, giving the clones RTL directionality.

We found many problems with this, because identical-looking characters had bizarre effects when cut and pasted into different fields. Arabic and Hebrew users are not working in a vacuum; they will be cutting and pasting in text from a variety of sources, including LTR sources. Cloning parentheses (or interpreting them according to visual appearance) meant that every program that analyzed text for open/close parentheses (eg regex) failed. And we didn't do numbers as LSDF (least-significant digit first); that would have caused huge problems in compatibility because software is just not set up to recognize LSDF numbers. And this is not to speak of the security problems with these clones (see http://www.unicode.org/reports/tr36/).

Thus when it came time to do the original BIDI algorithm, we decided not to use the cloning approach.

The BIDI algorithm is not an impediment to the development of software globalized for BIDI. Most programs will simply use OS-supported text widgets that handle all the details for them. Text/Word processors can use the lower-level implementations of the BIDI algorithm: there are plenty of solid implementations around, either supported by the OS or in libraries like ICU. The barriers that I have seen to people globalizing their products for BIDI are more the other aspects, such as dialog layout in the applications, etc.

Moreover, it would be certainly possible for a program to use visual layout on the screen, then translate that internal format to and from logical layout for transmission as Unicode. Quite frankly, while you find the BIDI algorithm difficult to use, all of the other approaches had such serious problems that it is really the only practical approach.

(Notwithstanding that, if I had the chance to go back in time and undo a few things, I would have simplified the weak processing to make numbers independent of their surroundings. But that's water far, far under the bridge.)

That should kind of cover a bit of the history that went into the Unicode Bidirectional algorithm....

This post brought to you by "ڇ" (U+0687, a.k.a. ARABIC LETTER TCHEHEH)

# ronab49 on 22 Aug 2005 9:32 AM:

thanks michael. I have not been reading your blog recently, and just discovered this entry of yours. I (re)-discovered the logical ordering and my own variant of BiDi around 1991, which means I was at most 10 years behind!

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2006/05/25 Where'd I find that?

go to newer or older post, or back to index or month or day