Even if the text is right underneath, it may look wrong close up....

by Michael S. Kaplan, published on 2008/04/19 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/04/19/8409778.aspx


Content of Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)!
Regular readers should keep in mind that all I said in The End? still applies; the allusion to the X-Files continues for people who understand such references....

Regular reader Jan Kučera asked over in the Suggestion Box:

Hi again,

I know the behaviour I mention here is not problem you can solve, but I'm interested in handling RTL fragments in "plain text". What I've encounted is like this.

I have both IMAP and web access to my e-mail. I don't have a SMTP server, so I send the mails from web and read them in Outlook 2007. One day, I wanted to know the author of Hebrew lyrics to the Maya the Bee song, so I wrote an e-mail to an Izrael TV which had a page about the series. The title of the e-mail was "Maya the Bee (הדבורה מאיה)" and I repeated these words in the message body. Need to say, I have the web mail configured to write plain text e-mails.

The surprise came with the answer. On the web, everything was okay, as I had written it. But in the Outlook, although the title remained ok (the hebrew phrase being selected from right to left), in the message body, I saw הדבורה first, followed by מאיה, letting the user select the row with hebrew text without troubles, char by char, from left to right.

When I copied it and pasted to the notepad, everything was ordered and behaving okay again.

The mail was encoded in 1255 and the sender used Thunderbird 2, but I don't think this is too important since in IE and other applications the text is formatted as it should.

What is more important is the title, encoded as "Subject: Re: Maya the Bee ( =?windows-1255?Q?=E4=E3=E1=E5=F8=E4_=EE=E0=E9?= =?windows-1255?Q?=E4=29?=" which could prevent Outlook from interpreting badly the title too.

E-mail reply was in HTML.

Now, the question is, beside whether this is a bug at all, how could be RTL phrase rendered in LTR, and what could we, as developers, do to avoid this issue in our programs.

PS: The answer to my question is Dan Zakai (דן זכאי). Or... דן and זכאי, as shown by Outlook? :)

It is actually not that hard to discern the relationship between

הדבורה מאיה

and the weird part of the string in

"Subject: Re: Maya the Bee ( =?windows-1255?Q?=E4=E3=E1=E5=F8=E4_=EE=E0=E9?= =?windows-1255?Q?=E4=29?="

Just look at that Windows code page 1255 chart:

So it is some kind of encoding of text into cp1255 with the text in appropriate logical order that anyone who understands the format should be able to use to decipher the text.

And on the other hand anything that doesn't understand the encoding technique is quite apt to misinterpret it and not show what us expected....

For the body, if whatever control is holding the body knows how to properly use the Unicode Bidi algorithm then it will properly display the text, though the behavior Jan describe that at least some pieces do not know how to interpret the text properly. The fact that it does not corrupt the text makes it somewhat easier to be okay with the interim display issues. :-)

Avoiding this kind of issue? More or less the answer us to avoid processing text in these interim stages, since it is likely way too easy to corrupt the text in the meantime.

Other recent posts of mine like this one and this one and this one jump into the handling of RTL fragments with LTR text and LTR fragments within RTL text. Which is not easy under the best of circumstances though tune in a I might suggest some additional methodologies to consider. :-)

 

This blog brought to you by ה (U+05d4, aka HEBREW LETTER HE)


# Jan Kučera on 23 Apr 2008 6:39 AM:

Thanks for answering and references.

Actually I get it was encoding in the title, but the way and place it was inserted caused the rendering thing to render it correctly even without the direction marks (I guess the switch to 1255 codepage works as the mark, doesn't it?).

What I've found strange with the selection is that the letters in individual words where actually rendered in the RTL order, just the selection worked from left.

Okay, I've just found the message and tried one thing more, to select only part of the text and paste it into notepad.. wow, it copied different text than selected... :) This really is a bug..but there is no place for us to file it...

Picture for curiosity: http://195.122.199.198/Maya.jpg

The last thing I found is that for the title, RichEdit20WPT class is used, for the body... ummm... _WwG ... :-D and the whole body html is marked as 1255, no direction marks.

(Ctrl+Insert does not work in this strange control either :'(, by the way)

So If I get your post right, we can't run into this issue if we are using existing controls, like richedit, WinForms, WPF...


go to newer or older post, or back to index or month or day