Fight the Future? (#2 of ??), aka Spot the Bug(s)!

by Michael S. Kaplan, published on 2008/04/04 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/04/04/8355738.aspx

Content of Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)!
Regular readers should keep in mind that all I said in The End? still applies; the allusion to the X-Files continues for people who understand such references....

So anyhow, the other day developer Ben Karas (he of The Great Flying Tortoise) who I have mentioned before, sent a fun "find the bug" code problem to one of those aliases inside Microsoft that gets those kinds of problems.

It had some interesting issues from my point of view so I thought I would ask the problem again here.

If you internal to Microsoft and you answer the question anyway than you kind of officially suck -- though several issues were missed when the answers were given so if you were watching that thread but knew about bugs that never came up then you neatly manage to avoid suckage. :-)

So here we go. The question -- can you spot the bugs in this snippet?

if( _plv->_ci.dwExStyle & WS_EX_RTLREADING)
{
    if (item.pszText)
    {
        //
        // temp hack for the find.files to see if LtoR/RtoL mixing
        // works. if ok, we'll take this out and make that lv ownerdra
        //
        if ((item.pszText[0] != '\xfd') && (item.pszText[lstrlen(item.pszText)-1] != '\xfd'))
        {
            textflags |= SHDT_RTLREADING;
        }
    }
}

You can assume the code is compiled with UNICODE/_UNICODE defined, like all code should be.

Now you can name any bug you see, but of course when I start blathering Saturday or Monday or whenever, I'll focus on the issues that interest me. :-)

I'll moderate all posts until the answers are given....

This blog brought to you by ? (U+003f, aka QUESTION MARK)

# Tim Smith on 4 Apr 2008 10:17 AM:

Lets get one of the basic bugs out of the way...

What if lstrlen(x) == 0?

# Tim Smith on 4 Apr 2008 10:33 AM:

Ok, I'm going to go out on a limb here since I don't really get into LTR/RTL detection.

There seems to be one real problem and then a question about completeness.

1) If the program is in UNCODE, then aren't they using the wrong codes? Isn't FD/FE only for 8-bit latin/hebrew? The proper codes would be 200E and 200F.

2) Why are we only checking for LRM? I guess since we have already specified RTLREADING, then we are looking for the exception.

# Mihai on 4 Apr 2008 12:49 PM:

I am always split when you post a puzzle.

On one side I feel like commenting, on the other side I try to refrain and let it for potential interns.

Why spoil it for them? :-)

# Michael S. Kaplan on 4 Apr 2008 1:36 PM:

Mihai, there is no spoiling here -- I am not showing any comments with answers until after the next post -- so anyone can put in any issues they say without hurting their colleagues (or their interns!).

# John Hensley on 4 Apr 2008 2:00 PM:

When the string pszText is zero length, pszText[lstrlen(pszText)-1] will be addressing memory before the actual first character of the string, instead of what the developer assumed would be the last character in the string.

# Mihai on 4 Apr 2008 2:02 PM:

Ok, then here it is (the one I think you care about):

0xFD is LTR Mark in cp1256 (Arabic), not in Unicode. The code currently tests for ý :-)

And even using U+200E would not help much, because the LTR does not have to be the first/last character.

# Michael S. Kaplan on 4 Apr 2008 5:34 PM:

Okay, update for everyone --

A few responses so far (four to the post and seven to the Contact link), but no one yet has found what Ben referred to as the "main bug" from the non-internationalization side (though I would tend to think of it as having certain internationalization consequences).

And although some people saw a specific valid problem and pointed it out, no one has suggested what the code ought to be doing. Though I am still hopeful that we will come up with more responses! :-)

# Mihai on 4 Apr 2008 7:28 PM:

Well, I only commented about "the issues that interest me. :-)"

There are other issues, like for instance:

item.pszText[lstrlen(item.pszText)-1]

Guess what happens for an empty string?

item.pszText[-1]

Oups!

# John Hensley on 6 Apr 2008 2:25 PM:

The function is attempting to check for a unicode Left-to-right mark at the beginning and end of the item text string and then setting the SHDT_RTLREADING bit in textflags if the Left-to-right mark is not at the beginning or end of the string.

Bug 1 - "item.pszText[lstrlen(item.pszText)-1] != '\xfd'))" reads memory outside of the string if the string is zero length.

Bug 2 - SHDT_RTLREADING should not be set if the language does not support reading-order-alignment because WS_EX_RTLREADING will be ignored if the language does not support reading-order-alignment.

Bug 3 - GetStringTypeEx() should have been used to determine the directional information for the first and last characters in the string. This would eliminate bugs #2.

Possible bug 4 - Unicode strings can have bi-directional information embedded anywhere within the string not just at the beginning and end.

... John

# Henry D. on 6 Apr 2008 3:45 PM:

Isn't there a signed/unsigned problem with the two comparisons in the innermost if statement?

This may not apply if they set the right compiler flags but I think the default configuration in VC++ will have problems.

# Michiel on 7 Apr 2008 5:00 AM:

Just an assumption here, but does this code reuse ý for markup? It seems to use "ý...ý" as an indication that text should _not_ be Right-to-Left, despite the flag.

Now, that's definitely marked as a hack, and in RTL text I would not expect a string to contain ý, let alone at both ends.

Personally, I'd guess that both if-statements were added for this hack, and the textflags |= SHDT_RTLREADING was originally there. Now, if there is no text set, the SHDT_RTLREADING is no longer set even though it should (but since that's an MS-internal flag, I can't tell what other uses it has)

# Arun Philip on 7 Apr 2008 5:55 AM:

Couple of guesses:

* lstrlen assumes a null-terminated string

* Endianness of the data while checking for 0xFD? For the small fraction of Itaniums running in big-endian

* BIDI text can be entered within the text also by introducing 0xFD and 0xFE, so checking the beginning/end alone will not suffice

Aside: LtoR/RtoL -> amusing alternative to LTR/RTL, although my mind first sees Tolkien references!

# Mike Dimmick on 7 Apr 2008 6:28 AM:

a) Why are they looking for ý?

b) Why look for it at start and end of string only?

It looks like it's a special marker character, but I think they have flawed logic: they're expecting RTL text to appear in the buffer in LTR order, i.e. the first character printed (at the right-hand-end) appears at the right-hand-end of the buffer. Wrong. Text in a Unicode buffer always appears in reading order, even if that order happens to be right-to-left.

Let's, for the purpose of an example, give the latin/arabic numbers a right-to-left reading property. If the buffer was:

ABC 123 XYZ

it would be displayed as

ABC 321 XYZ.

If the buffer is right-to-left reading in the example, U+00FD always appears at the left.

In addition, the compiler will interpret \xFD with whatever code page it's interpreting the source file as, so it won't have a fixed meaning. I think there's also a chance it will sign-extend to U+FFFD (REPLACEMENT CHARACTER) rather than appear as U+00FD as the code does not use an L prefix on the character literal.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/04/07 Fight the Future? (#8 of ??), aka The Bug(s) Spotted, aka Design flaws are worse than bugs

go to newer or older post, or back to index or month or day