Making a mark in code windows

by Michael S. Kaplan, published on 2006/02/19 17:29 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/19/535148.aspx

Takes you right back to What are directional marks -- chumps who point? but the other day, Cyrus Najmabadi asked on an internal alias about one of those interesting mixed LTR an RTL text that cause so much grief such as a function signature in Visual Studio like:

The string above has LRM (U+200e) inserted before all the neutral characters such as <, >, (, and ).

The problem that was coming up was all the times that the wrong string would come up in a left to right context, then (due to not having those characters), if Uniscribe support was turned on it would look something like the following:

Or if you take my "fixed" string above and put it in a right to left context, it turns out to also have problems:

Which leads me to make two observations and ask one question of everyone who is reading this....

The first observation is that this display issue does not affect the behavior -- you will not have code that does not compile.

This of course leads to the second observation, which is that if the code looks this bad, it does not seem terribly useful or usable, whether it works or not.

This leads to the question -- how do you think the problem should be solved (and you cannot use "disallow other languages" as a suggested solution!)....

Now if you respond, you can describe the solution (if you have trouble entedring it do to comments limitations!).

If < > ( and ) were not neutral, but strictly left-to-right, would that look right? Of course, this means that VS.NET would have to have extra custom procressing to ensure that they're counted as strictly left-to-right, and knowing what I know about Uniscibe (i.e. nothing) I don't know how feasible that is... or if it would even work!

I think we should first check with a programmer using a RTL language. He would expect a RTL environment, or a LTR one.

Where do we need Arabic in a program code?
- variable/class/etc names
- comments
I consider hard-coded strings, which is bad, bad, bad, English, Arabic, or Klingon, I don’t care!

My feeling is that a LTR is more appropriate, since we are talking a program (mostly English keywords and separators) with Arabic variable names, not an Arabic document with some English quotation. But I might be wrong :-)

Now, if we say LTR environment, where do we have the problem? Both strings look like this “void arabic‎<arabic‎>‎(arabic arabic‎)” (only that in the “fixed” string the Unicode character is visible).
I cannot reproduce it in Dev. Studio 2003 or 2005. I can reproduce it in Notepad, but this is not the issue, is it?
So, what is his/your environment? I am trying in Dev Studio 2003 and 2005, Windows XP SP2, English OS, English UI, user and system locales set to English US. (and I agree this is not what an Arab programmer would use).

The other thing: I would change the example to use different strings for function name, template parameter, argument and argument type.
The way it is not visible if the argument and argument types get switched.
void aaa<bbb>(ccc ddd) or void aaa<bbb>(ddd ccc)

And once we get some idea what is expected, we can also tackle the digits in that context aaa<bbb>(ccc ddd12 = 1234) :-)

As with Mihai, I feel that it's wrong to allow either Microsoft's "reading order" flags or the UAX #9 paragraph level rule to decide the high level flow of text which is essentially structured rather than natural language.

You probably want LTR flow here as a matter of policy, and you also want to avoid treating whole source code lines or blocks as UAX #9 paragraphs, the language grammar is probably the right place to decide blocks, and conveniently it's also the unit used by the syntax highlighting, so you should be able to get this working with minimal performance hit (e.g. in a Pango system you'd just add explicit runs along with the coloring for different pieces of syntax).

The result would be (usual conventions)

in memory: void FOO<BAR>(BAZ QUX)
on screen: void OOF<RAB>(ZAB XUQ)

This retains readability for individual symbol names that are RTL, while retaining the visual semblance of the grammar, which is important to anyone inspecting the code, regardless of their native or adopted reading order.

If someone develops a popular language in which the keywords and grammar are naturally expressed RTL you'd want to re-visit for that specific language. I don't expect any such language to appear.

A similar problem on a smaller scale with URIs has been discussed by an IETF WG (I think) many moons ago, traces of it should be in bidi@unicode.org if Michael is a subscriber.

It is easy (well, it is onvenient) to take a position that:

1) things are complicated
2) they should be simpler
3) it should be done the way I prefer it.

It is not going to be a defensible position that either Unicode or Microsoft will accept, given existing legacy practice which does not match this attempted simplifcation.

To see if it is fair or not, work with the oppsote default for a week; you will see what you are asking people to put up with....

Michael, I don't understand your explanation. Did you miss out a paragraph?

You write that you have a solution that's the way you prefer it, and then you mention Unicode.org and Microsoft won't accept it, but you don't explain what this solution is.

The reference to "if it is fair or not" and "oppsote default" is unclear, perhaps about reading order flags again?

Did you find the IETF discussion? Were any of the proposals similar to your own idea?

Actually Nick, I was referring to your idea of changing both what MS does and what Unicode does, and how it is impractical and would NEVER happen.

I also pointed how out that it was easy to ferret the issue out -- how someone who uses principally LTR thinks LTR should be the default for everyone.

It was a critique -- and an explanation for how you can prove yourself wrong -- by living with the opposite as your default for a week.

You asked "how do you think the problem should be solved" in reference to C++ source code and I explained how it would be solved in accordance with the principles outlined by Unicode.org and UAX #9. You don't seem to have understood, and space in this comment widget forbids a substantially more detailed explanation, why is why I thought the IETF material might be useful. If you try to position this as a problem for the natural language BiDi algorithm to solve then you set off in the wrong direction.

You may be right that (if chosen for use in Visual Studio) my suggested approach obliges Microsoft to make some changes, perhaps API changes or wholesale updates to their text renderer, although I doubt that there's really a lot of work needed if you already support a higher level markup for text style in the renderer.

Since my preferred desktop OS implements UAX #9 in full the only noteworthy consequence of "changing the default" as you describe it, is to push the cursor to the other edge of the text area when starting a new paragraph. I don't see why I should live with that for a week to prove some abstract point, either to you or to me, especially since your demand seems to be based on a misunderstanding.

Um, Nick -- you also suggested avoiding rules within UAX #9 -- to wit, allowing one to treat lines as paragraphs.

Microsoft has more legacy built on RTL scripts than just sbout anyone else in the world, so you will have to forgive Microsoft for wanting to find a solution that does not change behavior but still allows display to worlk properly.

If you would like me to take judicial note of your solution that kind of ignores that legacy, I will do so.

Since I have apparently once again failed to understand your point I will simply assume I am not quite bright enough to fathom your approach....