by Michael S. Kaplan, published on 2005/12/06 07:31 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/06/500475.aspx
Ilya Konstntinov asked in the Suggestion Box:
When Complex Text Layout is enabled, TEXT controls get additional new items in the context menu, among which is Insert Unicode Control Character. Among the controls characters, you have something called RS and US. I've found those characters in the Unicode specs (looks like something inherited from terminals) but the BiDi TR says nothing about their influence, whereas in practice, they seem to be break the BiDi run, whatever direction it is -- kind-of like you'd get with an RLM-LRM combination.
Now, let's introduce another issue. It's not uncommon for applications to concatenate a couple of strings into a status line of some sort. For example, Internet Explorer might compose its window caption as:
[page title] - [(customizable) IE caption]
and then there's the common issue of the two concatenated strings "blending" according to BiDi reordering rules. This is of course not a good thing, since each of those strings stands on its own and is not a "continuation" of the other. I've noticed that many Microsoft apps *don't* exhibit this problem, and I was wondering: What is the Right way to avoid this problem (besides emiting the text in two different TextOut calls)? Do you insert LRM-RLM (in hope to break the run, whatever it is), increase embedding, or use one of those RS/US characters?
Otherwise, what are RS and US intended for?
Ok, first off so we know what we are talking about, it is this right click menu which you can get to from most EDIT controls:
Now, let me take care of a few misconceptions....
Every character in Unicode has a Bidirectional class value in the Unicode Character Database. The possible values (listed in ucd.html and explained in UAX #9 - The Bidirectional Algorithm) are:
Type Description
L
Left-to-Right
LRE
Left-to-Right Embedding
LRO
Left-to-Right Override
R
Right-to-Left
AL
Right-to-Left Arabic
RLE
Right-to-Left Embedding
RLO
Right-to-Left Override
PDF
Pop Directional Format
EN
European Number
ES
European Number Separator
ET
European Number Terminator
AN
Arabic Number
CS
Common Number Separator
NSM
Non-Spacing Mark
BN
Boundary Neutral
B
Paragraph Separator
S
Segment Separator
WS
Whitespace
ON
Other Neutrals
For the two characters in question, the values are:
U+001e RECORD SEPARATOR B
U+001f UNIT SEPARATOR S
Now if you look at UAX #9 it does contain clear information on the effect that these two Bidi categories have on bidirectional text.
The categories can even be useful in the context that Ilya refers to later in his question (mixed runs of LTR and RTL text with surrounding neutral characters such as parentheses), with the added advantage that they are not visible so you do not have to worry about them.
(Of course this can also be a disadvantage since not even the Show Unicode control characters entry on the right click menu will show any visible indication that they are there; like the wind, you can only see the effects -- a small oversight in the functionality, in my opinion!)
There is unfortunately no good answer to that later question other than try to avoid the situation by creating strings that do not have the neutral characters that might not render the way you would like them to.
It even indirectly relates to my concerns about assuming the format of Windows language strings, since they may one day be changed one day to get away from this particular bug popping up at different times with bidirectional text....
This post brought to you by "Ὗ" (U+1f5f, a.k.a. GREEK CAPITAL LETTER UPSILON WITH DASIA AND PERISPOMENI)
# Ilya Konstantinov on 8 Dec 2005 4:23 AM:
# Michael S. Kaplan on 8 Dec 2005 7:22 AM:
referenced by
2006/11/18 Read-only, you say? Read-only to whom?