CharNext(ch) != ch+1, a lot of the time
by Michael S. Kaplan, published on 2005/01/14 00:08 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/01/14/352802.aspx
Earlier today, Raymond Chen sent me a piece of email that mentioned an important point for developers who iterate a string one character at a time. Its a lot more interesting than what I was going to post so I'll do the boring one later or tomorrow (or if I am smart I'll just give it a miss entirely), and I'll do this one now, instead.
It has to do with the CharNext and CharPrev APIs
The email had a title similar to the one of this posting. Basically he was pointing out that CharNextW(p) != p+1. And he is right, they are not equal, a lot of the time.
The same issue that applies to CharNext(ch) not being the same as ch+1 applies to CharPrev(ch) and ch-1. Put more verbosely, this is because incrementing and decrementing a character pointer within a string is not nearly as functional as these APIs can be.
The reasons for this are many and as I said they apply to both CharPrev and CharNext (and incidentally also CharNextExA and CharPrevExA). They include:
- When not dealing with Unicode, strings on CJK (Chinese, Japanese, and Korean) systems have many characters that take up two bytes (including all Han/Kanji/Hanja, all Hangul and all full-width non-Kanji). Using simple byte increments will mean one is moving through half a character with each iteration.
- When dealing with Unicode, strings that use combining characters like å (U+0061 U+030a, a plus combing ring) or ü (U+0075 U+0308, u plus combining umlaut) take up two code points, so once again one will be dealing with moving through half a character with each iteration.
- Stretching to languages like Vietnamese where double diacritics are commonly seen in characters such as ộ (U+006f U+0323 U+0302, o plus combining dot below plus combining circumflex) take up three code points, one is dealing with only a third of a character with each iteration!
When one uses Unicode supplementary characters such as U+21532 (an Extension B ideograph) where a Unicode string on Windows will actually be represented by a surrogate pair (U+d845 U+dd32) one will actually once again be dealing with half a character.
I crossed that last item off of the list because CharNext and CharPrev do not currently handle surrogate pairs properly, so one will be just as bad off using the APIs as one would be just doing simple pointer arithmetic.
(Never fear, I will be looking into seeing if I can do something about this for the future now that someone reminded me!)1
How does it do this work? Well, CharNextA/CharPrevA use the IsDBCSLeadByte API on appropriate platforms to determine if the byte is a lead byte in a lead byte/trail byte pair, and CharNextW/CharPrevW use the GetStringTypeW API to figure out if a character is a non-spacing character like a ring or a diaresis.
Other APIs that do an even better job can be found in Uniscribe. The ScriptBreak API will return an array of SCRIPT_LOGATTR structs, and that structure's fCharStop member, when set, indicates that this is a valid place for a cursor to jump to. When such a cursor jump is valid, it indicates that you have moved forward by one "character" in the sense that a user might think of a character. And therefore if you use Uniscribe you will be handle this job properly, even for supplementary characters.
Unicribe is a useful enough library that I will talk more about it in a future post, maybe even some sample code for easier operations (like this one). Uniscribe is behind most of the support of international text on Windows.
It was ported to Windows CE 5.0 as well, though it is described in documentation about the CE Platform Builder, which implies (to me) that it might only be included on SKUs that are built with it for shipment to places that require it. Those who know more about Windows CE and Uniscribe should feel free to contact me with more info so I can sound more intelligent the next time I talk about it!
This is a much easier task in managed code, whether you can use the StringInfo class and its GetTextElementEnumerator and GetTextElement methods, which allow for easy iteration of a string. You can also use its ParseCombiningCharacters method to get an array of integers representing the same character boundaries represented by Uniscribe's SCRIPT_LOGATTR.fCharStop member.
For those of you who are still awakereading, I will point out one annoying issue I discovered while typing in the å, ü, and ộ characters earlier in this post. Problems? Well...
- The .Text blog system's editor to type did not handle this properly and moved past 1/2 or 1/3 of the character with each arrow keypress.
- When the character was being selected it either selected the whole character and then appeared to do nothing or appeared to do nothing and then selected the whole character, depending on whether I moved from left-to-right or right-to-left with the cursor.
- The deletion behavior was just as dismal. I don't even want to talk about it.
I also saw the same behavior in plain old EDIT text boxes in Internet Explorer, so it looks like .Text is off the hook. And that even though IE does use Uniscribe, they forgot to implement some of the possible features that library provides2. :-(
If someone wanted to try this in other browsers (cough! FireFox or Opera), I'd be curious about the results; the IE results were pretty disappointing to me. Try selecting the string "åüộåüộåüộåüộ" and pasting into their browser wherever you see an EDIT box to test, or just try putting thr cursor into and running the cursor through the INPUT control text right here:
and let me know what you see (unless you see 20 other people have already responded!). Be sure to mention the browser and version (I am using IE 6.0.3790.0 on Windows Server 2003 with all of the latest updates).
(All that talk about Uniscribe reminds me I want to talk about MLang at some point too -- I've added MLang and Uniscribe to my list of things to talk about!)
1 - It first started when I was working on MSLU, but even to this day I have not gotten fully used to being able to say stuff like this. I'll try to write more about it another time, cuz it is kind of cool.
2 - It works fine in Notepad and Word and VS.Net, all of which use Uniscribe.
This post brought to you by "å" (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)
A character that is downright snooty about the fact that it has none of the problems mentioned in this article.
Despite the fact that a circle almost bigger than its body is super-glued to its head, something that would make me feel at least a little self-conscious.
# Serge Wautier on Friday, January 14, 2005 12:38 AM:
Firefox' job is only arguably slightly better:
- Simple arrow key navigation (left and right arrows) also require being pressed 2 or 3 times to move one char forward/backward but the caret doesn't move at all during the intermediate moves.
- the back key behaves the same way as IE6.
- Selection (Ctrl+ left/right) goes through 1/2 or 1/3 of char but the difference is that the (un)selected parts of the characters are moved (half-way) besides the 'main' character to show that they are included or not.
Version : Ouch... 0.9 :-(
# Troy on Friday, January 14, 2005 12:52 AM:
Safari 1.2.4 handles it fine (if you are interested in other OSs) - it didn't move half a character etc like IE 6 SP2 on XPSP2 did.
# Ludvig A. Norin on Friday, January 14, 2005 12:58 AM:
Be nice to the å, Swedes are very affectionate towards this character :-)
# Michael Kaplan on Friday, January 14, 2005 1:05 AM:
I work with a Swede who told me as much, hopefully my joke under the sponsorship will not be taken as too offensive (if so it will be taken down, I would not want "å" yo retract its sponsorship.
I guess I could use the combining form as a stunt double for the precomposed form as a way around the problem, then it would be less likely to be offensive to it, since a stunt double rarely worries about the same things as the celebrity....
Or maybe this whole post is silly, too. Scratch the maybe! :-)
# Marcel on Friday, January 14, 2005 3:04 AM:
Same problems with Opera 7.54.
# Mike Dimmick on Friday, January 14, 2005 5:38 AM:
Re: Windows CE:
Platform Builder is the tool for OEMs to build their own custom platforms. It's entirely up to the OEM to decide what to put in. Evangelism may be required for OEMs producing PDA-like platforms.
I would expect Microsoft's own Windows Mobile platforms (Pocket PC and Smartphone) to include
Uniscribe in future versions. Microsoft dictates which CE components are included in a Windows Mobile platform image. The OEM has responsibility for the OEM Adaptation Layer to adapt the platform to the specific hardware. At least, that's my understanding.
# Michael Kaplan on Friday, January 14, 2005 6:48 AM:
Thanks Mike,
Yep, seeing it in the Platform Builder means people have to end up choosing if it is a wothwhile addition to the platform. Getting them to do it can be a challenge since the space on the device is at such a premium....
# Centaur on Friday, January 14, 2005 7:49 AM:
Firefox 1.0 is no better. First Shift+Right selects the 'a' (displaying the 'a' selected and the ring shifted right and unselected), second Shift+Right selects the ring and it snaps back into place. Deleting is (pardon me) intuitive: first Backspace deletes the circumflex, second the combining dot below, and third the o. One can also delete the letter first, then the combining diacritics move to the previous letter. They, strangely, don’t combine but are positioned left to right.
# Larry Osterman on Friday, January 14, 2005 8:26 AM:
Btw, the IE address bar (and google's search bar) get it right. My suspicion is that's because they're standard windows edit controls, while the IE (and firefox) edit controls aren't real windows controls.
# Rick Schaut on Friday, January 14, 2005 8:44 AM:
Well, there goes my even-odd trick for Kanji-backspace :-).
For what it's worth, the reason Safari, on Mac OS X, gets this right is essentially the same reason the IE address bar gets it right. Safari uses ATSUI (Apple Text Services for Unicode Imaging).
Unfortunately, Mac OS X doesn't have a
Uniscribe equivalent. One either eats the whole ATSUI pie, or one has to try and figure out character break points on one's own.
If I may offer a suggestion, I think some developers might well be interested in how to use Uniscribe to do context-based glyph substitution (e.g. Arabic).
# Michael Kaplan on Friday, January 14, 2005 8:53 AM:
Rick -- check out the "Suggest a topic for me!" link at the top of the page....
# Mike Dimmick on Friday, January 14, 2005 4:46 PM:
Interesting - at home your example renders correctly, while at work it didn't. I get boxes (the 'missing character' symbol) behind the o rather than the diacritics. Both machines run XP SP2.
One difference is that I use ClearType on the home machine, but I've just turned that off and it's still OK. I ought to have the same fonts on both...
Ah! Some stupid program has overwritten Arial with a version from what looks like Win95! Whatever it was has also trashed Times, Times Bold and Symbol.
# Ryan Myers [MSFT] on Friday, January 14, 2005 11:29 PM:
Does CharNextW only recognize combining characters, or does it use the UAX29 grapheme cluster algorithm, or something else that's locale-specific?
I've been considering the impact of jamo syllables upon this and debating whether to treat each jamo as a grapheme cluster for my "character-wise iterator", or treat a syllable cluster according to 3.12 as a cluster.
And yeah, Firefox 1.0 treats it boneheadedly, both in the input box and in the address bar.
As Larry alluded to, there are so many elements on any given page that creating a child window for everything would exhaust handle space easily, so IE has its own set of "windowless controls" that it uses for everything inside a rendered page.
# Michael Kaplan on Friday, January 14, 2005 11:59 PM:
It is not using anything out of Unicode, it is using the
GetStringTypeW API's results to decide what is a non-spacing character. Unicode may choose to respect its elders since GetStringTypeW predates that UAX (and most of the others!).
Looking at the data, Jamos are not treated as combining characters. They are pretty much treated as:
CT_CTYPE1: C1_ALPHA | C1_DEFINED
CT_CTYPE2: C2_LEFTTORIGHT
CT_CTYPE3: C3_ALPHA
But the CTYPE data comes from Unicode and ha for a while now, and Unicode does not define them as combining either; it calls them Lo (Letter, Other).
And I do see what IE is doing, but even in their own custom controls they could use
Uniscribe to define the cursor/arrow/selection/deletion behavior....
# Dean Harding on Sunday, January 16, 2005 2:37 PM:
I could perhaps argue that the way it's handled in IE is by design. When I select the text with the mouse, it selects a whole character at a time (i.e. base char + combining chars) and I can delete the whole thing by just hitting backspace or delete.
When using the keyboard, it moves through the 1/2 or 1/3 of the character, which basically lets me edit the combining charaters in-place. Maybe it's just a personal thing, but I think that's not entirely a bad way of doing it.
You're right in that it's not how Visual Studio does it - once you create the full charater, it treats all two or three bytes as a single character, but I can see how this could be viewed as "by design".
# Michael Kaplan on Sunday, January 16, 2005 3:22 PM:
Well, one could argue that, except there is an MS-wide plan that gives the suggested means of dealing with complex scripts, which IE is not following. :-)
I'll try to talk a bit about complex scripts tonight....
# Dean Harding on Sunday, January 16, 2005 8:54 PM:
Heh, since when have the IE team been known for following the standards? :p~
# James on Sunday, January 30, 2005 6:57 AM:
Based on this post, I wrote a snippet to try to iterate through a string using the CharNextW function:
int main()
{
for (wchar_t const * p = L"\x0075\x0308\x006f\x0323\x0302", * n; *p; p = n)
std::cout << (n = CharNextW(p)) - p << '\n';
}
From what you've said, I'd expect '2' and '3' to be printed, corresponding to your "u plus combining umlaut" and "o plus combining dot below plus combining circumflex" examples. Instead, it prints a whole bunch of '1's. What have I misunderstood?
# Michael Kaplan on Sunday, January 30, 2005 7:40 AM:
What platform are you running on?
# James on Sunday, January 30, 2005 7:59 AM:
I ran the program above on XP Pro SP2. I had compiled it with Visual C++ 2003 and the current Platform SDK as an otherwise empty console project, adding just includes for Windows.h (for CharNextW) and iostream (to cout the differences). Under the debugger, the string displayed as the two characters I expected, but
CharNext always returned p+1.
# Michael Kaplan on Sunday, January 30, 2005 8:46 AM:
Hmmm.... weird. I tried to split it out a little to make it a little clearee what was happening zt each step in the debugger;
int main(int argc, CHAR* argv[])
{
wchar_t * p = L"\x0075\x0308\x006f\x0323\x0302", * n;
for (; *p; p = n) {
n = CharNextW(p);
std::cout << n - p << '\n';
}
return 0;
}
And it is iterating through one wchar_t at a time rather than jumping at the character boundaries. This is definitely not expected.
# Michael Kaplan on Sunday, January 30, 2005 11:43 AM:
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day