UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev

by Michael S. Kaplan, published on 2008/12/16 00:01 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2008/12/16/9223301.aspx

It was way back in January of 2005 where I first mentioned how CharNext(ch) != ch+1, a lot of the time, which explained about how just incrementing a character pointer was not all that the CharNext function did.

Then James pointed out in a comment (here) that, as it turns out, on Windows XP and Server 2003, that kind of was all that CharNext was doing.

Remember how I talked about the way that even though NLS did not own some of these USER functions, that we pretty much "owned" them since we control their behavior, in this post?

So the decision was made to fix the bug (restoring the old functionality from NT <= Windows 2000, and at the same time to look at the other common complaint related to surrogate pairs.

Vista, therefore, supports the old functionality and took steps to add the new functionality (not splitting surrogate pairs).

The code in there swapped the check for high and low surrogates such that it was always skipping the high surrogate and always returning the low surrogate -- which is exactly the opposite of the behavior you want.

Now no one found the bug because as it turns out the tested case (and admittedly the most common scenario for the function?), which is "a more linguistically appropriate lstrlenW based on user character principles" will still work here, even though a single call will return the wrong result when faced with a high surrogate.

What happens in the next version, and/or possibly in the next service pack of Vista/Server 2008?

2) Do we fix the bug with supplementary characters so that both they and the combining characters case will both work?

3) Do we give up on both and go back to the XP level behavior, which even though it was a regression from prior versions does represent a very popular platform?

4) Do we give up on trying to do anything here and just leave it broken as it is now, and perhaps in some unknown future version (it is a bit late in the cycle to start designing all new features) look into all new solutions to the problem(s) once they are identified?

Now the order of these four choices, due to the way the code is written and under the principle of minimal change, is technically in order from most difficult to least difficult. Though really the amount of difficulty involved here is not that much even as you move across all four options, so that does not really provide very much insight into a triage process.

In terms of platform popularity, I don't think there are many people outside of fans of the Windows "Mohave" commercials who would claim that XP isn't the most popular platform -- which does suggest that #3 is worth considering, at least.

My personal preference would be #2 since it is "the right thing to do" though when you have behavior that has been changing every few versions it might perhaps better to take some time to think about the backward compatibility issues before concluding that "the right thing to do" and "the best thing to do" are necessarily the same.

But let's assume that a certain number of developers have noticed the odd behavior and chosen to work around it in their own code. Plenty of people do that, and many of them are either too cynical to report the bug or don't know of a good way to make the report. Or they just don't like Microsoft -- it happens.

The ones who just decide the function is unreliable and write their own can be removed from our consideration here, since even though they may be right, they will not be broken if the behavior changes. So we'll leave them aside for a moment.

If we don't want to break the people who found the bug and worked around it, we'd have to assume that they were essentially detecting the case where CharNext or CharPrev incorrectly return a high surrogate value (whether using the IS_HIGH_SURROGATE macro or simple range checking or whatever), and then doing an additional increment/decrement in those cases.

Perhaps they feel that their code was a really good idea since it will even "fix" prior versions like NT 4.0 or Windows 2000 or XP and thus they feel they are "future-proof" since no right-thinking developer would break against every version (note that none of the above solutions do that!).

Now if history is to be a guide, people might not do the full job here -- they might not be detecting the errant cases like unpaired surrogates or multiple high surrogates, so it might just be blind one WCHAR increment/decrement.

And some might go even farther and validate that a valid surrogate pair exists, which is not something Windows necessarily does but isn't unreasonable here.

But note that even in all of the above potential circumstances, the full fix described above in #2 is still entirely safe since the function would never return a character that was a high surrogate.

On balance, my gut feeling that #2 would be the best thing to do (in the next version of Windows and possibly even in future Vista/Server 2008 service packs) mainly on the basis that it is the right thing to do also does appear to be the best solution for technical reasons as well.

I mean, as UTF-16 detection mechanisms go, the best that can be said about CharNext and CharPrev is that they [sometimes, in some versions] work. Which is not saying much, but is saying something, at least. It is better at least in the abstract to improve with each version, in my opinion....

Though perhaps others would analyze the situation and circumstances and come to a different conclusion.

This post brought to you by å (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)
A character that is downright snooty about the fact that it has none of the problems mentioned in this article.
Despite the fact that a circle almost bigger than its body is super-glued to its head, something that would make me feel at least a little self-conscious.

On my side, I have my own.

The problem is, I cannot advice people anything.

I used to say: use CharNext, it's safe. But not anymore :-(

Thing is, I would rather go with 4.

A function that changed behavior 3 times in 3 successive versions of Windows is so unreliable that is not trustworthy anymore.

How can I advise someone to use this? What is the story? "It will be wrong in XP, Vista, and Server 2003, but will work in Vista SPx and Win 7?"

Should one detect the Win version and use the stock API or their own, depending on the result? Then why not use always their own? At least is guaranteed to work the same on all platforms, so if you test on one you know it works on all.

In general it is easier to say "don't use this, never ever" instead of "it is safe to use on xyz, but not on abc"

People don't write code for xyz, they want their applications to run on multiple systems, as long as they are popular.

One of the things preventing (in my opinion) the faster adoption of new APIs is the lack of back-ports libraries.

Examples:

- Unicode: MSLU was too late

- MUI: the backport is not 100% compatible with the new API (folder naming convention is not portable)

- string locale id: there is no backport, just a mapping API between string and numeric locale identifiers

So I would say the best solution I can see is:

- Advise against CharNext/CharPrev

- Add a new API that is native in Win X (whatever version that is) and available as a back-ported library for everything between XP and Win X

I know that backporting is not in general a good idea (otherwise why would people migrate to the latest and greatest).

But for some core, infrastructure stuff, that people are really encouraged to use, is better to backport. It will make adoption faster.

Let's keep the backport issue separate for a moment -- I happen to agree with you there, but there are at time powerful forces arrayed against it (I doubt, for example, that another XP SP is planned?).

And let's keep the new version, new feature issue separate too. That is also something I agree with, but again if not backported far down enough, it won't help much for years anyway.

But the real question is whether to leave the two functions that are literally broken in a specific way, knowing people will keep using them and never fixing them where they are broken, before they cause real problems (which in theory they can now)...

I'm not convinced what the best overall strategy would be (though I am fairly close to where you are in that last comment!), but the bug is not strategy, its tactics. from a tactics point of view fixing things on a less popular OS before a potentially more popular OS inherits the bug does have specific visceral appeal to me.

Bit I talk as a potential beneficiary of this: this is broken for about 10 years. It is dead and buried. I don't use it, I recommend against it.

Something new backported as a library (not SP!) and deprecating the old API feels like the most useful thing.

I just don't see much benefit in fixing something that did not work for close to 10 years now. Stuff is broke, but fixing it might break existing software that was tested and works with the broken stuff.

I don't know what the standard MS approach is on this kind of stuff.

But I am trying to think a bit like Raymond :-)

Let's take as an example NLSDL (download here) -- only good for XPSP2 and higher, Server 2003 SP1 and higher. How far downlevel would this (technically non-NLS) set of functions have to go? There are complicated issues there, and no solution is likely until at least Win8, by which time it might be very hard to be allowed to support even XP, let alone 2000....

And in the meantime the two functions that exist stay wrong, for the next generation of developers. :-(

I would look at the statistics: what is out there, installed.

Correlated with the official support period.

Gut feeling: XP SP2 might be enough (it also depends when the thing gets released).

The next generation of developers should not use it.

Is deprecated, mark it as such.

How is one supposed to write code that works on

Win 8, Win 7, Vista, and (maybe XP)?

1. New API + static library. The static lib is the backport doing the right thing on old win, and calls the native stuff on Win 8

2. Old API and force all users to install the latest SP (with the risk of breaking old applications that don't use the library)

3. Old API + backport lib? Is the lib supposed to "override" the system API? This one seems the most "unclean" version.

I kind of favor 1. But I think someone like Raymond Chen can offer better advice than me, really...

Right now my advice for devs is: write your own stuff and make sure to use it everywhere. This would allow fixing stuff in one single place when something gets fixed in Win.

"In terms of platform popularity, I don't think there are many people outside of fans of the Windows "Mohave" commercials who would claim that XP isn't the most popular platform -- which does suggest that #3 is worth considering, at least."

Reminds me of this comment:

http://blogs.msdn.com/michkap/archive/2008/11/11/9059132.aspx#9088973

my vote (same as Mihai):

- new static lib like NLSDL

- don't touch the current APIs - keep as-is in Win7 and Win8

- appcompat db for xp behavior

I'm new to this surrogate pair thing, so bear with me. ;)

I'm confused by these two seemingly contradictory statements:

"The code in there swapped the check for high and low surrogates such that it was always skipping the high surrogate and always returning the low surrogate -- which is exactly the opposite of the behavior you want."

"If we don't want to break the people who found the bug and worked around it, we'd have to assume that they were essentially detecting the case where CharNext or CharPrev incorrectly return a high surrogate value"

The first statement says that it's incorrect for CharNext/CharPrev to return the low surrogate value; the second says that it's incorrect to return the high surrogate value! Which is it? (Or am I misunderstanding all of this?)

The first statemenet makes more sense to me since my gut tells me that CharNext/CharPrev ideally would return the leading code unit of a surrogate pair, and I'm assuming that the leading code unit is the "high" one.

Hi James,

It is incorrect to return the low surrogate. What I was trying to convey (perhaps unclearly) was the notion that people who assumed we were going to stay wrong by detecting we were returning the wrong character would advance an extra WCHAR to try to make up for the bug....