Mixing it up with bidirectional text

by Michael S. Kaplan, published on 2007/01/06 06:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/01/06/1421178.aspx

So the question that Ziv asked was:

Hi,

I’m trying to display both English and Hebrew text in a single WinForms RichTextBox. Basically, the user types a string in one RichTextBox control (in either languages) and I’m appending it to the contents of another RichTextBox control.

The problem is that “ambivalent” characters (such as “!” and “:”), while they get displayed correctly when the user types them in, are not displayed correctly once appended to the other RichTextBox.

For example, if the user types the following two strings:

Hello!
שלום!‏

Appending those strings to the existing RichTextBox yields the following display (if RightToLeft is set to “No”):

Hello!
שלום!

And yields the following display (if RightToLeft is set to “Yes”):

!Hello
שלום!

How can I trick the RichTextBox into behaving correctly?

Thanks, Ziv.

This is kind of like a problem I have discussed before in posts like this one, with a new twist -- the fact that one does not know what the text might be here -- whether it will be Hebrew or English. If one knows then one can properly use U+200e (LEFT-TO-RIGHT MARK) and U+200f (RIGHT-TO-LEFT MARK) before these potentially visually leading/trailing characters that have a more neutral directionalty.

If you have no idea whether things are LTR or RTL though, then you don't know what to insert.

Either way, you probably need to get the data out about the various Bidi categories of all of the characters.

To do that in the .NET Framework, you currently have to use reflection to get at an internal method that some others have found spluenking through the IL information of the .NET Framework. At one point there was discussion of making it public but that did not end up happening. Though the method works and enough people have puzzled this one using reflection out that I would just post it now and perhaps keep the next 100 people from having to do it. :-)

Here is a simplified example:

using System;
using System.Reflection;
using System.Globalization;

class CharUnicodeInfoReflection
{
    [STAThread]
    static void Main() {
        string st = "Hello!\r\nשלום!";
        Type typeCharUnicodeInfo = Type.GetType("System.Globalization.CharUnicodeInfo");
        BindingFlags bf = BindingFlags.NonPublic | BindingFlags.Static | BindingFlags.Instance | BindingFlags.InvokeMethod;
        MethodInfo getBidiCategory = typeCharUnicodeInfo.GetMethod("GetBidiCategory", bf);

        for(int ich = 0; ich < st.Length; ich++) {
            Object [] parameters = new Object[2] {st, ich};

            Object o = getBidiCategory.Invoke(typeCharUnicodeInfo, bf, null, parameters, CultureInfo.InvariantCulture);

            Console.WriteLine("U+" + ((ushort)st[ich]).ToString("x4") + "    " + o.GetType().ToString() + "    " + o.ToString());
        }
    }
}

This code will return the following when run:

U+0048    System.Globalization.BidiCategory    LeftToRight
U+0065    System.Globalization.BidiCategory    LeftToRight
U+006c    System.Globalization.BidiCategory    LeftToRight
U+006c    System.Globalization.BidiCategory    LeftToRight
U+006f    System.Globalization.BidiCategory    LeftToRight
U+0021    System.Globalization.BidiCategory    OtherNeutrals
U+000d    System.Globalization.BidiCategory    ParagraphSeparator
U+000a    System.Globalization.BidiCategory    ParagraphSeparator
U+05e9    System.Globalization.BidiCategory    RightToLeft
U+05dc    System.Globalization.BidiCategory    RightToLeft
U+05d5    System.Globalization.BidiCategory    RightToLeft
U+05dd    System.Globalization.BidiCategory    RightToLeft
U+0021    System.Globalization.BidiCategory    OtherNeutrals

So we've learned that we can get the Unicode bidi class of any Unicode character. In fact, we can probably get the type explicitly and use it more directly than this quick example if we wanted to create a wrapper to make it easier to call while hiding the reflection stuff. anyond want to try and take a stab at that? ;-)

And now we have the key here to solving Ziv's issue -- any time one finds neutral characters at which ever end of the string is going to be stuck on another string, one has to add either an RLM or an LRM matching the last character with some direction we found, before the append. And for good measure we do it on the other end of the string too, so that a neutral on the other end is not misinterpreted.

Thus in this case (for example), where the string ends with ! (U+0021, a.k.a. EXCLAMATION MARK), we have to walk backwards in the string to the first character that has some direction. We see it is U+05dd and that this character is RightToLeft, so we add a U+200f to the end before we append or prepend another string (and we do something similar if the string we are appending/prepending has neutral characters at its ends, too).

Should this be built in?

Well, maybe.

It is hard to imagine the exact semantic of such a method or what we would call it (or even what object would it go on, exactly).

In this world where the .NET Framework supports neither parsing nor formatting with LRM and RLM like Win32 does, it just seems a little premature to start adding code that will insert these characters so freely. Know what I mean? :-)

One special note -- the GetBidiCategory method does not seem to have a method that takes a single char (a developer asked me about this a few weeks ago and wondered if he was missing something; he wasn't); it only has one that takes a string and an index (a signature I have discussed previously), which means if you pass a supplementary character in UTF-16 as a high surrogate and a low surrogate, you will get the bidi category of the supplementary character. This is what you would want for any code, but note that the code above would have to be modified so that any time one has a high and a low surrogate one knows to not get the bidi category of the low surrogate by itself....

If someone really wanted to take a stab at the generic function that would do all this, I think it meets the compleity level of a difficult interview question and I'd likely be impressed by code that would do the trick.:-)

This post brought to you by ! (U+0021, a.k.a. EXCLAMATION MARK)

# Erzengel on 6 Jan 2007 8:25 PM:

Isn't this relying on "implementation details"?

# Michael S. Kaplan on 6 Jan 2007 8:31 PM:

Since the markers will only ever do to text what you wanted to happen anyway, they will allow you (if you follow this technique) to be independent of implementation-specific details.

If the undocumented piece is what you are referring to, you can find some other source for Bidi info and use it instead if you want. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/07/23 It used to be Windows doing it right, and Office following. But now...

2010/04/06 Arabic? English? Both? Neither?

2008/08/25 The Bidi Algorithm's own SEP Field

2008/04/19 Even if the text is right underneath, it may look wrong close up....

2008/04/18 The mythical nature of bidirectional support, and where the wheels come off the wagon

2008/04/07 Fight the Future? (#8 of ??), aka The Bug(s) Spotted, aka Design flaws are worse than bugs

go to newer or older post, or back to index or month or day