by Michael S. Kaplan, published on 2010/05/31 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/05/31/10017323.aspx
Okay, a little Spot the Bug for everyone on a Monday morning when one shouldn't need to be working anyway....
The other day a question came up about some code.
It looked something like this:
whcar_t* newParams = new wchar[MAX_PATH + MAX_PATH]; whcar_t* p = newParams; while (*params) { *p++ = *params; params = ::CharNext(params); } *p = 0; return newParams;
Can you figure out what this code is doing? Assume it is compiled with UNICODE (since the compile would fail, otherwise!).
Well, I know what it is doing. The key is trying to understand what it was trying to do.
You can probably gather why whatever they were trying to do is doomed to failure, of course.
Also, you can look at previous blogs like The torrents of breaking CharNext/CharPrev if you want a quick refresher, of course.
Ignore the bug mentioned there, and think about what CharNextW does.
So, what does the code do, and what did the original authors want it to do?
Aaron Ballman on 31 May 2010 7:52 AM:
I'll go for the pedantic answer of "it doesn't compile because you can't assign a "new wchar[]" to a "wchar_t *" (note the lack of _t). But this code is a poor replacement for wcscpy, basically. Except that the rate you are skipping bytes in the destination string may not match the rate in the source string. Aka, you may advance 2 bytes in the destination string, but 4 bytes in the source string (since you're pulling out characters instead of surrogate pairs).
Michiel on 31 May 2010 8:15 AM:
Since the point of CharNext is that CharNext(ch) != ch+1, copying only ch but not ch+1 is a bad idea. You can't split a surrogate like that. I've got no idea what the intent was.
Robert on 31 May 2010 8:21 AM:
The original intent seems to be copying a string. It's doomed because CharNext() may advance more than one single byte per call (due to Unicode) and thus the copy would end up with garbage (half-characters)
Simon Montagu on 31 May 2010 8:40 PM:
Never mind surrogate pairs for the moment. I think what the code was trying to do was copy the source string without diacritics -- in other words the writer knew that CharNext(ch) != ch + 1 and was trying to exploit that, maybe in order to be able to compare "cafe" to "café" and have them match or something similar.
If one could guarantee that the input was in normalization form D, it might even work. Some of the time.
Mihai on 2 Jun 2010 10:52 AM:
1. If they wanted to strip all accents, it will work for string using combining characters.
Otherwise they would have to perform a decomposition step before.
2. If they wanted "an array of user characters", it does not work because *params will only get you the first code point.
Not to mention that a "user character" (what the user considers to be a character) does not fit in a wchar_t
But based on the MAX_PATH thing, it looks like intention 1 is more likely.
The intention is probably wrong in both cases, so the execution almost does not matter anymore :-)
Doug Ewell on 2 Jun 2010 4:00 PM:
I don't suppose the spelling "whcar_t" has anything to do with this...
referenced by
2010/06/26 BUG SPOTTING answers