GetStringTypeW almost understands the Bidirectional Algorithm

by Michael S. Kaplan, published on 2006/10/07 12:51 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/10/07/801030.aspx

Since we're gonna talk about bidirectional text, it makes sense to start with UAX #9 (The Bidirectional Algorithm). Specifically, there is a table in there that has all of the bidirectional character types:

Table 3-7. Bidirectional Character Types

Category

Type

Description

General Scope

Strong
L Left-to-Right LRM, Most alphabetic, syllabic, Han ideographic characters, digits that are neither European nor Arabic, ...

LRE Left-to-Right Embedding LRE

LRO Left-to-Right Override LRO

R Right-to-Left RLM, Hebrew alphabet, most punctuation specific to that script, ...

AL Right-to-Left Arabic Arabic, Thaana, and Syriac alphabets, most punctuation specific to those scripts, ...

RLE Right-to-Left Embedding RLE

RLO Right-to-Left Override RLO

Weak
PDF Pop Directional Format PDF

EN European Number European digits, Eastern Arabic-Indic digits, ...

ES European Number Separator Plus Sign, Minus Sign

ET European Number Terminator Degree, Currency symbols, ...

AN Arabic Number Arabic-Indic digits, Arabic decimal & thousands separators, ...

CS Common Number Separator Colon, Comma, Full Stop (Period), Non-breaking space, ...

NSM Non-Spacing Mark Characters marked Mn (Non-Spacing Mark) and Me (Enclosing Mark) in the Unicode Character Database.

BN Boundary Neutral Most formatting and control characters, other than those explicitly given types above.

Neutral
B Paragraph Separator Paragraph Separator, appropriate Newline Functions, higher-protocol paragraph determination.

S Segment Separator Tab

WS Whitespace Space, Figure Space, Line Separator, Form Feed, General Punctuation Spaces, ...

ON Other Neutrals All other characters, including OBJECT REPLACEMENT CHARACTER.

Now one of the interesting parts about this table is how there are specific types listed that only contain a single control character in them, basically LRE/LRO/RLE/RLO/PDF. Now let's compare the items in this table to the CT_CTYPE2 information in the GetStringTypeW function (which, as I have stated previously, is the best version of the function to call, no matter what the documentation may say!):

Ctype 2
These types support proper layout of Unicode text. The direction attributes are assigned so that the bi-directional layout algorithm standardized by Unicode produces accurate results. These types are mutually exclusive. For more information about the use of these attributes, see The Unicode Standard: Worldwide Character Encoding, Volumes 1 and 2, Addison Wesley Publishing Company: 1991, 1992, ISBN 0201567881.

Name Value Meaning

Strong

C2_LEFTTORIGHT 0x0001 Left to right

C2_RIGHTTOLEFT 0x0002 Right to left

Weak

C2_EUROPENUMBER 0x0003 European number, European digit

C2_EUROPESEPARATOR 0x0004 European numeric separator

C2_EUROPETERMINATOR 0x0005 European numeric terminator

C2_ARABICNUMBER 0x0006 Arabic number

C2_COMMONSEPARATOR 0x0007 Common numeric separator

Neutral

C2_BLOCKSEPARATOR 0x0008 Block separator

C2_SEGMENTSEPARATOR 0x0009 Segment separator

C2_WHITESPACE 0x000A White space

C2_OTHERNEUTRAL 0x000B Other neutrals

Not applicable

C2_NOTAPPLICABLE 0x0000 No implicit directionality (for example, control codes)

Now obviously the bit with pointing to version 1.0 of the Unicode Standard, ISBN and all, is something of an issue -- in XP and Server 2003 it should be talking about Unicode 3.0 (ISBN 0201616335) and in Vista it should be talking about Unicode 5.0 (ISBN 0321480910) and to be frank it should just be pointing to the online version and its data files anyway. :-)

Now obviously if one is using a character that is newer that the version of the OS it will also show up in C2_NOTAPPLICABLE and thus it is obvious that the documentation should point out that this category includes characters that the function does not know about. But that is a minor doc issue that can be fixed whenever.

But look at the other problem -- that C2_NOTAPPLICABLE category actually contains those control characters with strong directionality (LRE/LRO/RLE/RLO) and the one with weak directionality (PDF), even though it claims everything in that category has no implicit directionality.

Luckily, the code in GDI carries around its own tables so it is not making decisions based on this, though it is interesting to argue about whether this should be fixed or not. Yes, someone may be depending on the old values, but the whole point of the function is to provide information supplied on the Unicode Standard!

Wouldn't it be better to add some new values and update the value for these five characters? I mean, we do not hesitate to add new values, and if something changes within the Unicode Standard (not often, but it can happen) then we update. So we know where the bar is set.

If we can update due to external changes, doesn't that mean we should also be updating due to internal bugs?

Or should this case, on the assumption that any working implementation using CT_CTYPE2 values is already handling this case, be treated as a documentation issue for anyone else stumbling onto the function and wanting to use it?

This post brought to you by U+202c, a.k.a. PDF, a.k.a. POP DIRECTIONAL FORMATTING)

# Simon Montagu on 7 Oct 2006 2:47 PM:

"Now obviously if one is using a character that is newer that the version of the OS it will also show up in C2_NOTAPPLICABLE."

That's not obvious to me, in fact I would say it's a bug. The second bullet point below that table in the UBA says "Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters."

# Michael S. Kaplan on 7 Oct 2006 3:06 PM:

Hi Simon,

Well, since they have also included the strong Cf characters, perhaps the function is on the right track and you could just consider that a doc bug?

Though that still leaves one not know what to do with the Cf characters with different strong directionality than the block would give them by default....

# Nick Lamb on 8 Oct 2006 4:25 AM:

The control characters you mentioned don't have implicit directionality, they have explicit directionality. And Simon is correct to tell you that unassigned characters have strong types.

As far as I can see these features are necessary to a useful implementation of UAX#9. but as you've explained previously you prefer Microsoft's implementation which e.g. always ignores P2 and P3. The Unicode tables as provided aren't necessary for such a weak or incomplete implementation, so presumably Microsoft should be maintaining their own tables for this purpose (certainly some other users of the tables think so...)

# Michael S. Kaplan on 8 Oct 2006 4:47 AM:

Huh? Nick, I do not have a clue what you are talking about, as usual.

I was making the point that grouping:

* ON
* Unassigned
* Cf characters of strong and weak directionality

In one category is a bad thing in this function. Are you disagreeing with that or agreeing with that?

I also pointed out that the MS implementations are not using this function's data, they are doing their own thing. Do you disagree with that?

I understand you like to disagree, and I understand that every time I talk about Bidi you plan to make sure to point out that Microsoft does not handle Bidi correctly according to you and that the only person who knows less than Microsoft about Bidi is me. But could you at least address the points I am raising rather than your own agenda? I mean, really!

# Nick Lamb on 8 Oct 2006 7:01 AM:

Hmm, point by point...

* It is a bug to disobey the clear instruction in UAX#9 to include directionality for unassigned characters, and that bug should be fixed (but we won't hold our breath as usual)

* Other neutrals and explicit control characters belong together for the purpose of UAX#9 paragraph level direction. You could give them separate CT_TYPE2 values but it seems like unnecessary clutter which also might interfere with any currently functional applications using this API.

* Microsoft should implement UAX #9 as completely as possible, because it's what users prefer (as at least one other user has pointed out to you before) and therefore this table and API call would be useful once the unassigned characters are fixed, although any implementation is acceptable if it produces the correct results.

* Currently Windows doesn't provide any equivalent to the paragraph order P2 and P3 stages of UAX#9 even when it has no meaningful higher protocol. Instead paragraph level is always overridden.

* It's pretty easy to find or write regression tests for UAX #9 (you can't directly use the tests in the document itself because they've been transformed into ASCII for readability) and see whether you get P2 and P3 right, so that isn't a matter of opinion.

This is a marginal case for me, users frustrated by Microsoft's BiDi implementation don't lose data but they may create documents with redundant control codes, or require a higher level protocol to be used because the rendering in Windows is incorrect with plain text. As it is, it's hard even to be convinced that this is deliberate rather than, as I originally suggested so long ago, just another bug.

# Michael S. Kaplan on 8 Oct 2006 7:22 AM:

Sigh....

So, the only real problem that has anything to do with this post is the second one.

Every other bullet point is unrelated to this post, which FWIW makes me wonder how you feel about how so many of the changes in UAX #9 over the last few years have been to bring it more in line with Microsoft's implementation. Though not enough to want to bother listen to you rant again, of course. :-)

Per your second point everything is actually fine as it is in this function other than various doc issues, such as , the doc. description of C2_OTHERNOTAPPLICABLE. By your estimation, not even a code change would be required to take care of the issues IN THIS POST.

Please do not take every time I mention Bidi as your chance to get on your soapbox about Microsoft's evil implementation. It's old, it's tired, it's offtopic. And from now on the editing scissors will be applied as needed to keep your Bidi frothing on-topic....

Though I look forward to your own "Microsoft's raping and pillaging of Bidi prove it is an evil company" blog and will happily subscribe to it as soon as it has been set up. I'll even link to it, if you promise to make it entertaining.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/02/24 How to get it done if Microsoft does not do it?

go to newer or older post, or back to index or month or day

Name	Value	Meaning
Strong
C2_LEFTTORIGHT	0x0001	Left to right
C2_RIGHTTOLEFT	0x0002	Right to left
Weak
C2_EUROPENUMBER	0x0003	European number, European digit
C2_EUROPESEPARATOR	0x0004	European numeric separator
C2_EUROPETERMINATOR	0x0005	European numeric terminator
C2_ARABICNUMBER	0x0006	Arabic number
C2_COMMONSEPARATOR	0x0007	Common numeric separator
Neutral
C2_BLOCKSEPARATOR	0x0008	Block separator
C2_SEGMENTSEPARATOR	0x0009	Segment separator
C2_WHITESPACE	0x000A	White space
C2_OTHERNEUTRAL	0x000B	Other neutrals
Not applicable
C2_NOTAPPLICABLE	0x0000	No implicit directionality (for example, control codes)