We broke CharNext/CharPrev (or, bugs found through blogging?)

by Michael S. Kaplan, published on 2005/01/30 13:42 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/30/363420.aspx


(special thanks to James for pointing out this bug)

It is amazing how sometimes one can be so busy trying to make a point that one can miss the point.

A few days ago, I pointed out that CharNext(ch) != ch+1, a lot of the time.

That ought to be true. It is true if you are running Windows NT 3.51, Windows NT 4.0, or Windows 2000.

But in XP, things seem to have changed a bit.

It used to be that if one took combining characters like U+0308 (COMBINING DIAERESIS) and passed them to the GetStringTypeW or GetStringTypeEx APIs with the CT_CTYPE3 dwInfoType, it would return (C3_NONSPACING | C3_DIACRITIC). If you look at the Platform SDK topics for these APIs, the types are defined as follows:

Name                      Value       Meaning
C3_NONSPACING    0x0001       Nonspacing mark. 
C3_DIACRITIC        0x0002       Diacritic nonspacing mark. 

Starting with Windows XP and continuing on with Windows Server 2003, it now just returns C3_DIACRITIC. Looking at the definitions, this makes sense -- C3_DIACRITIC claims it is for nonspacing marks, too. So the relevant part of the change is:

  1. There used to be no characters marked with just C3_DIACRITIC.
  2. There are no characters that are marked with just C3_NONSPACING now (there used to be several).

This would all be fine given the above definitions (well, not really -- but we'll let that lie for a bit). The problem is that the CharNext and CharPrev APIs are relying on that C3_NONSPACING definition to figure out when to skip characters.

I'm not sure what scares me more -- that this bug has been around since October of 2000, or that it was found due to a blog post that I might not have thought to do had not someone suggested it to me.

I'll see about making sure this bug gets put in on Monday.

So, between this one and the one I found myself (described in the answer to Guess #3 in Why I don't like the IsTextUnicode API), two longstanding bugs in Windows have been found through the act of blogging.

This answers the question I posted in OT -- They taste like chicken, don't they? once and for all. Blogging may annoy me, but its not really relevant anymore. They help me make the product better. So I think I'd better keep doing it....

Scoble, you reading this? :-)

 

This post sponsored by all 792 of the nonspacing marks in Unicode


# James on 30 Jan 2005 12:21 PM:

Thanks for investigating and sorry it took time from your weekend. I'm also glad you're going to keep blogging - it's a great read!

# Michael Kaplan on 30 Jan 2005 12:32 PM:

No apology needed -- it is a great find, and you were an important step to finding it. Feel free to read and question any time!

# Ken Smith on 30 Jan 2005 6:39 PM:

Any chance if a fix for downlevel platforms? :-)

# Michael Kaplan on 30 Jan 2005 7:10 PM:

Ken -- I honestly don't know, but I'll probably ask.

# BillG would call me Communist on 30 Jan 2005 7:49 PM:

Yes, how wonderful is open collaboration, it's rather incredible how much better it makes a development process! How much more wonderful still would be freedom of code! <br> <br>Of course, this is why Free Software will inevitably triumph - possibly not organically, but rather by converting unbelievers like yourself, however grudgingly. Here you show yourself taking the first baby steps, and if it were not for your salary clouding your judgements your steps would be quicker and heartier still. <br> <br>Guess what - Free Software is generally far more localis/zed and internationalis/zed and even globalis/zed than your own patheticly Anglocentric offerings. This is as a direct result of the freedom - because of the sense of ownership and community and collaboration that can only come from the trust encircled with code. <br> <br>You are obviously an exceptionally talented man leading an exceptionally talented men and women, probably the best internationalis/zation individuals in the world. It must hurt terribly to be outflanked despite this. Perhaps you should reflect on the reasons why, and start embracing and engaging all those who care at a deeper level. <br> <br>Finally, some humo/ur related to internationalis/zation: <br>Russia Donates Cyrillic Characters To Alleviate Acronym Shortage <br><a target="_new" href="http://humorix.org/articles/2005/01/acronym-crisis/">http://humorix.org/articles/2005/01/acronym-crisis/</a>

# Sam on 16 Dec 2008 6:41 PM:

So why does ü (U+00FC) return C3_NONSPACING?  What does this flag mean if non-spacing characters (U+0308) don't have it, and characters which flow as normal do?

# Michael S. Kaplan on 16 Dec 2008 11:08 PM:

Actually, in Vista it returns C3_ALPHA | C3_DIACRITIC | C3_NONSPACING since it is a letter with a diacritic included in it....

U+0308 has C3_DIACRITIC | C3_NONSPACING there.

What full results are you seeing, and on what platform?


referenced by

2009/06/10 UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...

2008/12/16 UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev

2006/11/10 Some people feel really insecure about the size of their [string] members

2006/07/17 'A' and 'W' are sometimes living in two different worlds

2006/06/22 Things I [don't] like about blogging

2005/09/13 Here is an interview question for you. :-)

2005/04/29 Where did the new StringInfo stuff come from?

go to newer or older post, or back to index or month or day