by Michael S. Kaplan, published on 2009/06/10 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/06/10/9723321.aspx
Previous blogs in this series of blogs on this Blog:
It has been a while since the last part of this series. Kind of coinciding with when I took a break here.
Sorry about that! :-)
This part kind of starts with a previous blog -- CharNext(ch) != ch+1, a lot of the time.
No, wait. That isn't the one.
It really came from a follow-up to that blog, namely We broke CharNext/CharPrev (or, bugs found through blogging?).
Where I talked about the change of diacritics being thought of by GetStringTypeW as C3_DIACRITIC rather than C3_NONSPACING | C3_DIACRITIC.
When I fixed this bug, I (wisely?) chose to make the check include C3_NONSPACING | C3_DIACRITIC rather than the minimal change of just C3_DIACRITIC.
Because there is a whole class of character, pretty much added to Unicode at a time that made it new in Vista, that has the C3_NONSPACING classification.
UNICODE VARIATION SELECTORS!!!
These characters have the goal of changing the visible representation of the character preceding them, or (as the Unicode Display of Unsupported Characters FAQ states) be invisible, if not supported.
Certainly either way it should never be treated independently of the character preceding it; any operation of selection or truncation must treat it like a part of the preceding character, as much as a low surrogate and its preceding high surrogate!
Of course my GetStringTypeW discussion that I used to introduce the topic also points out the detection solution, one that would incidentally be used for a lot of the other cases I have discussed in the series.
Of course this leads to another interesting case, which I will discuss next time....
And maybe more on variation selectors and other related matters, some either time entirely.
This blog brought to you by U+fe00, a Unicode Varation Selector.
John Cowan on 10 Jun 2009 4:19 PM:
I don't know about that. I'd rather see the VS truncated than both the whole previous character and the VS truncated. Without the VS, the result may look weird, but it'll be more usable than if the whole base character disappears as well.
Michael S. Kaplan on 10 Jun 2009 5:39 PM:
That can depend on your customer -- someone Japanese may well disagree with you about that in regard to form. Also note that truncation is not the only operation we are talking about here of course...
Andrew West on 11 Jun 2009 9:01 AM:
Actually, the result is more likely to look weird with the VS; and if stripped of the VS the result would generally look normal. But, anyway I agree with John that I would not consider the VS and its base character as an inseperable entity like a surrogate pair, and personally like to be able to treat VS's as characters when editing text -- deleting them and trying new different VS's on different characters to get the desired result.
Michael S. Kaplan on 11 Jun 2009 10:10 AM:
Okay, an opinion not brought up with Unicode (or at least not written in for its own recommendations...).
But let's take the cut/paste case as an example -- you'd really want the VS to be left dangling alone with no base character, even though the user selected the "changed" character?
Or cursor movement, when display works right. You really want the cursor weirdness that would manifest itself, with no other visual indication of what's going on?
Michael S. Kaplan on 11 Jun 2009 12:18 PM:
Okay, differences in opinion are good, always. :-)
But I am curious, especially of John and Andrew but really of everyone: in the myriad of cases, what is the best result? Imagine (in addition to the specific scenarios John and Andrew mentioned) the following:
-- Delete a section, included an alternate glyph at the end -- Should the VS be left in the old text, and does the answer change when the VS is alone at the beginning vs. attached to a new (possibly illegal or potentially worse legal but unintended choice of) base character?
-- Copy/Paste -- All of the above, plus: -- should the alternate glyph be lost?
You see what I am getting at -- what do you think should happen with the orphaned VS, which is now just in the text, unintended? I am worried about two principles, 1) what to do with the original intent of the text, and 2) what to do with the potentially illegal, potentially unintended result of the remaining text....
Michael S. Kaplan on 11 Jun 2009 12:20 PM:
As a by the way, this is one of the many reasons I was against the VS notion the entire time it has been alive.
Does the answer to any above change with the argument (of some, and which I also disagree with) that new simplified characters should be traditional characters with variation selectors?
Andrew West on 12 Jun 2009 5:17 AM:
I think I'm too close to variation selectors to be able to answer your questions objectively. I like to have full control over my text, including variation selectors, but I can see that many users would be annoyed if the appearance of their text kept getting screwed up because of some invisible entity that was not staying in its place.
Andrew West on 12 Jun 2009 5:25 AM:
"new simplified characters should be traditional characters with variation selectors"
I know that their are certain people in the UTC who favour this approach, but I don't think it has any chance of being adopted, because there would be massive opposition from IRG and WG2. At this stage in the game it is simply too late to change the model for Han encoding; and all we can do is bite the bullet and encode all required simplified character forms as separate characters (about half of the CJK-D characters currently under ballot are simplified forms of existing traditional characters).
John Cowan on 23 Jun 2009 5:57 PM:
Cut/paste isn't quite the same thing as selection: in some contexts, when we cut selected text, following whitespace also goes with it. In that case, I'd probably extend the selection to include any following invisible character (what is currently done for bidi marks in that situation?)
In addition to Andrew's reasons, another reason why VSes shouldn't be used for simplified characters is that N traditional characters often map to the same simplified character, so it would be impossible to tell by looking at a text what underlies it. There are other situations where this is true, like:
english = CIBARA
(which word do you read first?), but that's no reason to add to it.
Michael S. Kaplan on 23 Jun 2009 7:45 PM:
Not all clipboard cases grab surrounding spaces and most do not if you do not select the space.
The Bidi marks are another case, one I talk about tomorrow....
referenced by
2012/05/21 Whither WM_UNICHAR in Windows 7 (and 8!)
2012/04/27 Should considering UTF-16 be harmful be considered harmful?
2010/12/15 I think MaxLength needs protection to assure safer text
2010/05/20 Attempting to discourage some variations?
2009/06/29 UCS-2 to UTF-16, Part 11: Turning it up to Eleven!