Questions about the unit and record separators

by Michael S. Kaplan, published on 2005/12/06 07:31 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/06/500475.aspx


Ilya Konstntinov asked in the Suggestion Box:

When Complex Text Layout is enabled, TEXT controls get additional new items in the context menu, among which is Insert Unicode Control Character. Among the controls characters, you have something called RS and US. I've found those characters in the Unicode specs (looks like something inherited from terminals) but the BiDi TR says nothing about their influence, whereas in practice, they seem to be break the BiDi run, whatever direction it is -- kind-of like you'd get with an RLM-LRM combination.

Now, let's introduce another issue. It's not uncommon for applications to concatenate a couple of strings into a status line of some sort. For example, Internet Explorer might compose its window caption as:

[page title] - [(customizable) IE caption]

and then there's the common issue of the two concatenated strings "blending" according to BiDi reordering rules. This is of course not a good thing, since each of those strings stands on its own and is not a "continuation" of the other. I've noticed that many Microsoft apps *don't* exhibit this problem, and I was wondering: What is the Right way to avoid this problem (besides emiting the text in two different TextOut calls)? Do you insert LRM-RLM (in hope to break the run, whatever it is), increase embedding, or use one of those RS/US characters?

Otherwise, what are RS and US intended for?

Ok, first off so we know what we are talking about, it is this right click menu which you can get to from most EDIT controls:

Now, let me take care of a few misconceptions....

Every character in Unicode has a Bidirectional class value in the Unicode Character Database. The possible values (listed in ucd.html and explained in UAX #9 - The Bidirectional Algorithm) are:

Type

Description

L Left-to-Right
LRE Left-to-Right Embedding
LRO Left-to-Right Override
R Right-to-Left
AL Right-to-Left Arabic
RLE Right-to-Left Embedding
RLO Right-to-Left Override
PDF Pop Directional Format
EN European Number
ES European Number Separator
ET European Number Terminator
AN Arabic Number
CS Common Number Separator
NSM Non-Spacing Mark
BN Boundary Neutral
B Paragraph Separator
S Segment Separator
WS Whitespace
ON Other Neutrals

For the two characters in question, the values are:

U+001e   RECORD SEPARATOR   B

U+001f   UNIT SEPARATOR     S

Now if you look at UAX #9 it does contain clear information on the effect that these two Bidi categories have on bidirectional text.

The categories can even be useful in the context that Ilya refers to later in his question (mixed runs of LTR and RTL text with surrounding neutral characters such as parentheses), with the added advantage that they are not visible so you do not have to worry about them.

(Of course this can also be a disadvantage since not even the Show Unicode control characters entry on the right click menu will show any visible indication that they are there; like the wind, you can only see the effects -- a small oversight in the functionality, in my opinion!)

There is unfortunately no good answer to that later question other than try to avoid the situation by creating strings that do not have the neutral characters that might not render the way you would like them to.

It even indirectly relates to my concerns about assuming the format of Windows language strings, since they may one day be changed one day to get away from this particular bug popping up at different times with bidirectional text....

 

This post brought to you by "" (U+1f5f, a.k.a. GREEK CAPITAL LETTER UPSILON WITH DASIA AND PERISPOMENI)


# Ilya Konstantinov on 8 Dec 2005 4:23 AM:

"There is unfortunately no good answer to that later question other than try to avoid the situation by creating strings that do not have the neutral characters that might not render the way you would like them to."

That's not sufficient to avoid the problem. Even without the dash between the web page's title and the "Microsoft Internet Explorer" branding, the two could blend together in unexpected ways. The only way is to avoid the problem is to break the BiDi run between concatenated strings, by using BiDi Control Characters and such.

And you didn't cover the other part of my question -- why were RS and US so important as to include them in this context menu?

# Michael S. Kaplan on 8 Dec 2005 7:22 AM:

If you read UAX#9 and see what those two Bidi categories can do, then the answer is obbvious -- to have those two different effects on text....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2006/11/18 Read-only, you say? Read-only to whom?

go to newer or older post, or back to index or month or day