by Michael S. Kaplan, published on 2005/12/21 04:01 -05:00, original URI:

I have often talked about how awesome I think the MSDN Product Feedback Center is. But I do have to admit that sometimes bugs can kind of get lost....

As an example, Ben Monroe sent me the following via the contact link:

Greetings. I enjoy reading your blog. I find many of the issues to be quite interesting. Thanks for your contributions.

I reguarly tested and continue to test VS 2005. I came across an issue with surrogate handling. I reported it, but there hasn't been a response in 17 months. Perhaps it has been overlooked. If you have the time or interest, would you please review it or make sure that another capable person sees it?

"Backspace of Unicode characters above U+FFFF results in unpaired surrogates"
Bug ID: FDBK10993


Allow me to apologize, Ben.

No one intentionally meant to be unresponsive, but this particular bug had been reported several different times, by many different people (I saw over a dozen bug reports put in across several product bug databases!). And unfortunately at some point when the bug report was resolved as a duplicate of a duplicate and so on, the follow-up information was not transferred to the active bug report (so the old bug report kept getting your updates but no one knew to look at it anymore since it was not active).

Again, I am sorry about that.

For the primary bug itself, the actual problem is an issue in the core OS control (you can reproduce the issue in Notepad, for example) and the managed controls are really wrappers around the core controls in this case. So the active bug in the "Whidbey" bug database was resolved as an 'external issue' while the bug report was put into the Windows bug database.

Believe it or not, the behavior is mostly intentional!

You see, in situations where one is typing digraphs or other compounds that are considered to be a single character to the user even when they are not represented by a single keystroke on the keyboard, the decision to have the DELETE key get rid of the entire text element yet have the BACKSPACE key only remove the last one is an intentional one. A decision that allows you to avoid the frustration of having to retype 2-4 or more characters if you only mis-typed the last one.

However, while this is a reasonable plan of action in most cases, it is obviously not so reasonable in the case of a supplementary character (since it is in most cases unlikely that individual keystrokes are the high and low surrogate code units). And in this case it probably would make more sense to delete both the high and low surrogate rather than just the trailing low surrogate, especially considering the fact that there really is no recovery other than hitting the BACKSPACE again (since the keyboard is unlikely to be made up of individual high and low surrogate code units with one per key). And all of that is ignoring the very real concerns that Ben pointed out about having unpaired surrogates in data streams -- a situation that is definitely best avoided.

The bug in the Windows database was reported fixed a few months ago, but was never marked as resolved (and does not appear to be fixed in the latest build I have installed on my machine, from a few days ago, so it is probably not fixed in the December CTP build, either).

I will keep an eye on this bug in any case, not only to make sure it is fixed but also to make sure the desired behavior for the non-supplementary character scenario is not unintentionally broken.

In any case, I am sorry about how letting you know kind of fell through the cracks. The MSDN Product Feedback Center is still a very awesome resource and at the very least the issue itself was still being tracked. We've got your back on this one!


This post brought to you by "𐂀" (U+10080, a.k.a. LINEAR B IDEOGRAM B100 MAN)

# Anutthara MSFT on 21 Dec 2005 4:20 AM:

Wow - I actually came across this behaviour while testing my product with digraphs and after I thought about it, I was convinced this was by design.
But of course, as you pointed out, the supplementary char scenario doesn't make much sense!

# Marc Bernard on 21 Dec 2005 9:19 AM:

> Believe it or not, the behavior is mostly intentional!

Wow, this might be the first bug to say "this behaviour is mostly by design".


# Nick Lamb on 21 Dec 2005 11:28 AM:

Another bug caused by having a "wide character" that is not wide enough for storing characters. It's more or less inevitable that coders trying to use 16-bit integers as a "native" Unicode character will get these cases wrong over and over again and again.

If I were to go look in MSDN right now, how many examples would I find where there's an implicit assumption that characters fit in a 16-bit Win32 wchar_t ?

# Michael S. Kaplan on 21 Dec 2005 12:08 PM:

Nick, this is really not the point, since even in UTF-32, what a user sees as a single character may be in fact many Unicode code points.

For most purposes, buffer lengths are not based on a character count but on a code point count. Supplementary characters being two Uniocde code units is no more of an issue than them being 4 bytes in UTF-8.

# Ben Monroe on 21 Dec 2005 8:15 PM:

Thank you for the response.
I had a feeling that the report got lost somewhere along the lines.

I am glad that Microsoft is aware of the issue. I understand and agree with your comments on diagraphs. However, surrogates are a result of the chosen encoding (UTF-16) even when they map to a single Unicode scalar value. Unpaired surrogates are a serious concern to data integrity and conformance. I think this may warrant special handling, and I hope to see a resolution of some type in the near future.

How about defining Backspace to remove a single Unicode scalar value? This should preserve the current behavior designed for diagraphs as well as handling surrogates. It is also independent of the encoding.

Ben Monroe
Tokyo, Japan

# Michael S. Kaplan on 21 Dec 2005 8:25 PM:

Hi Ben,

It is unfortunately not that simple -- at the level where these things are determined, there is no knowledge of Unicode scalar values.

But like I said, the fix should be available in Vista soon (and although Vista is not the first version to support supplementary characters, it is the first version to ship with font and input support for them in the box!).

# Nick Lamb on 21 Dec 2005 8:37 PM:

"Nick, this is really not the point, since even in UTF-32, what a user sees as a single character may be in fact many Unicode code points. "

Just so that we're clear, is it your contention that this bug was /not/ caused by the issue I'm describing ? If so, can you show your readers the affected code ?

Or is it instead that you believe it doesn't matter, that while a different API would avoid this mistake it would make some other type of error more common ?

# Michael S. Kaplan on 21 Dec 2005 9:32 PM:

My point is that the issue of the "wide character that is not wide enough for storing characters" being treated as a bug in and of itself is a fundamental misunderstanding of the Unicode standard and how it is formulated.

The issue I am describing is a side effect of the design decision that was made to have the DELETE key and th BACKSPACE key show different behaviors -- a decision that is still obviously not fixed for anyone who creates a Form 'D' keyboard that combined multiple code points on single keystrokes.

The fix of the supplementary character issue is obviously more critical than the other issue (which is good since of the two it is the one that is more fixable!).

But worrying about it as if it were a 'size of UTF-16 code units' issue kind of distracts from the actual problem....

Pavel Radzivilovsky on 6 Dec 2009 12:55 AM:

Then you guys mixed two unrelated things together.

Deleting part of a character by backspace is legitimate and may have a meaning.

It should have nothing to do with how many bytes does it take in UTF-16 or other encoding of unicode.

Especially given that UTF-16 will hopefully not survive in Windows - the UI should be designed regardless of underlying text encoding.

Michael S. Kaplan on 6 Dec 2009 7:26 AM:

UTF-16 is not going away, so your premise is somewhat faulty Pavel.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2012/04/27 Should considering UTF-16 be harmful be considered harmful?

2008/10/06 UCS-2 to UTF-16, Part 4: Talking about the ask

2006/08/11 Are ligatures supposed to be thought of as 'single characters' ?

2006/06/22 Things I [don't] like about blogging

2006/06/21 Give me a break [Char] !

2006/02/17 What do you get when you combine a base character with a buttload of diacritics?

2005/12/30 More on cursor movement

go to newer or older post, or back to index or month or day