by Michael S. Kaplan, published on 2005/01/30 13:42 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/30/363420.aspx
(special thanks to James for pointing out this bug)
It is amazing how sometimes one can be so busy trying to make a point that one can miss the point.
A few days ago, I pointed out that CharNext(ch) != ch+1, a lot of the time.
That ought to be true. It is true if you are running Windows NT 3.51, Windows NT 4.0, or Windows 2000.
But in XP, things seem to have changed a bit.
It used to be that if one took combining characters like U+0308 (COMBINING DIAERESIS) and passed them to the GetStringTypeW or GetStringTypeEx APIs with the CT_CTYPE3 dwInfoType, it would return (C3_NONSPACING | C3_DIACRITIC). If you look at the Platform SDK topics for these APIs, the types are defined as follows:
Name Value Meaning
C3_NONSPACING 0x0001 Nonspacing mark.
C3_DIACRITIC 0x0002 Diacritic nonspacing mark.
Starting with Windows XP and continuing on with Windows Server 2003, it now just returns C3_DIACRITIC. Looking at the definitions, this makes sense -- C3_DIACRITIC claims it is for nonspacing marks, too. So the relevant part of the change is:
This would all be fine given the above definitions (well, not really -- but we'll let that lie for a bit). The problem is that the CharNext and CharPrev APIs are relying on that C3_NONSPACING definition to figure out when to skip characters.
I'm not sure what scares me more -- that this bug has been around since October of 2000, or that it was found due to a blog post that I might not have thought to do had not someone suggested it to me.
I'll see about making sure this bug gets put in on Monday.
So, between this one and the one I found myself (described in the answer to Guess #3 in Why I don't like the IsTextUnicode API), two longstanding bugs in Windows have been found through the act of blogging.
This answers the question I posted in OT -- They taste like chicken, don't they? once and for all. Blogging may annoy me, but its not really relevant anymore. They help me make the product better. So I think I'd better keep doing it....
Scoble, you reading this? :-)
This post sponsored by all 792 of the nonspacing marks in Unicode
# James on 30 Jan 2005 12:21 PM:
# Michael Kaplan on 30 Jan 2005 12:32 PM:
# Ken Smith on 30 Jan 2005 6:39 PM:
# Michael Kaplan on 30 Jan 2005 7:10 PM:
# BillG would call me Communist on 30 Jan 2005 7:49 PM:
# Sam on 16 Dec 2008 6:41 PM:
So why does ü (U+00FC) return C3_NONSPACING? What does this flag mean if non-spacing characters (U+0308) don't have it, and characters which flow as normal do?
# Michael S. Kaplan on 16 Dec 2008 11:08 PM:
Actually, in Vista it returns C3_ALPHA | C3_DIACRITIC | C3_NONSPACING since it is a letter with a diacritic included in it....
U+0308 has C3_DIACRITIC | C3_NONSPACING there.
What full results are you seeing, and on what platform?
referenced by
2009/06/10 UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...
2008/12/16 UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev
2006/11/10 Some people feel really insecure about the size of their [string] members
2006/07/17 'A' and 'W' are sometimes living in two different worlds
2006/06/22 Things I [don't] like about blogging
2005/09/13 Here is an interview question for you. :-)