What are directional marks -- chumps who point?

by Michael S. Kaplan, published on 2006/01/19 06:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/19/514718.aspx

Earlier today in the post Just when you think you know a function... I talked about the secret way to use two U+200f (RIGHT-TO-LEFT MARK) characters in the MessageBox function to put MB_RTLREADING flag behavior in the hands of localizers, where it may often belong.

While I was talking to people about that post, I got a question about what U+200f was for when it was being used correctly, and what made me so sure that it was not dangerous to put two in a row that way.

I figured I should answer that question (since several native speakers of bidrectional languages helped give me some information too!).

The easiest way to explain it is to first look at how I talked about reading order in the post Sticky Keys vs. Reading Order. Basically, this 'Reading Order' setting allows you set the context for the text before you even type it. It is a non-destructive (in the sense that it does not alter the text in a harmful way) and easily changeable.

Then, you start typing. And now we will look at Unicode Standard Annex #9 - The Bidirectional Algorithm. It talks about how characters all have a Bidi class that can say what directionality it has (if any) and how strong that directionality is.

Now most letters have what is known as a strong directionality, but the strength is very local and has very little effect on anything but the characters right around it. And this is where U+200e and U+200f come in -- they are just as strong (but no stronger) than one of those letters might be (Left- to-Right and Right-to-Left, respectively). As UAX #9 says:

RLM	Right-to-Left Mark	Right-to-left zero-width character
LRM	Left-to-Right Mark	Left-to-right zero-width character

In fact the only difference between them and the letters is that LRM and RLM are not visible -- so two in a row has no more effect than two letters in a row -- which is to say none of any significance.

And as More on cursor support: the rest of the answer certainly showed, even a misplaced LRM, RLM, or random letter with strong directionality will not convince any character with strong directionality to change its stripes. The only characters that have anything to fear are the weaker characters, though as the UAX #9 indicates those do exist. So it makes sense to put them in when you want to give an extra hint if you are not as sure of the context.

I'll talk more about that how functions use (and perhaps misuse?) this functionality soon....

Both of today's articles seem like good places to explain why UAX #9 3.3.1 either isn't implemented or is overridden by "higher protocols" that don't have any context hints.

In the text editor on this computer, for example, when I open a document written in Hebrew it is all displayed as RTL paragraphs starting from the right margin.

But in Notepad on a Windows PC, it is displayed as RTL embedding in LTR paragraphs starting from the left margin. The result is that a few characters of LTR text can force the start of a sentence into the middle of the text, rarely the desired outcome.

Actually Nick, you are mistaken. On a Hebrew system (or on any system where you have changed the default reading order of Notepad explicitly, you have a text editor that has a default RTL context -- not an LTR context with embedding.

Sure, if I switch my system to Hebrew, the text editor now defaults to RTL for new paragraphs. That's perfectly sensible.

However I don't really see how this addresses UAX #9 3.3.1 or the example I presented, except as a work around.

You really need 3.3.1 or (as in HTML) some higher level paragraph mark-up and preferably both. The default paragraph direction isn't enough. Try it.

I have tried it and fail to see a case where the reading order set to RTL on an English system gives different results than a Hebrew system. Do you see some sort of difference between the two?

"Do you see some sort of difference between the two?"

No. Why would there be any difference?

The Unicode BiDi algorithm gets this right (you should be able to test that easily enough) but Windows text widgets don't. Perhaps a text widget does have a "higher protocol" internally but that's not much of an excuse in an application.

If you're having trouble understanding this then it explains why you didn't see the significance of my suggested example for the selections articles. Hebrew paragraphs don't start in the middle of the text, and neither do English ones.

Nick,

I guess you just have a better parser of text than I. Because this conversation makes NO sense to me.

You claimed "But in Notepad on a Windows PC, it is displayed as RTL embedding in LTR paragraphs starting from the left margin."

This is not true, that is not how the Reading Order setting works. I said as much.

You then started making a new point without actually explaining what you mant in the original point.

And now you are claiming Unicode gets something right while Microsoft does not and have still not bothered to explain why.

So I guess I will yield at this point, since it makes no sense to have either a conversation or an argument if only one person understands what is going on.

But in any case it is a separate point than this post, so there is no point in having the conversation HERE in the post about directional marks. Let's try to keep it on topic....

This has nothing to do with the directional marks! Whuich is what this post is about.

Please put them in the post that referred to this issue you wanted a better example of. Future comments along this line put HERE will be removed since they are off topic.

If it is about any part of Bidi other than specifically the direction marks, then it cannot be posted here. Sorry, everyone else who is not insisting on putting stuff here after being told it is off-topic. :-(

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

What are directional marks -- chumps who point?

2.4 Implicit Directional Marks