UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...

by Michael S. Kaplan, published on 2009/06/10 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/06/10/9723321.aspx

It has been a while since the last part of this series. Kind of coinciding with when I took a break here.

Where I talked about the change of diacritics being thought of by GetStringTypeW as C3_DIACRITIC rather than C3_NONSPACING | C3_DIACRITIC.

When I fixed this bug, I (wisely?) chose to make the check include C3_NONSPACING | C3_DIACRITIC rather than the minimal change of just C3_DIACRITIC.

Because there is a whole class of character, pretty much added to Unicode at a time that made it new in Vista, that has the C3_NONSPACING classification.

These characters have the goal of changing the visible representation of the character preceding them, or (as the Unicode Display of Unsupported Characters FAQ states) be invisible, if not supported.

Certainly either way it should never be treated independently of the character preceding it; any operation of selection or truncation must treat it like a part of the preceding character, as much as a low surrogate and its preceding high surrogate!

Of course my GetStringTypeW discussion that I used to introduce the topic also points out the detection solution, one that would incidentally be used for a lot of the other cases I have discussed in the series.

Of course this leads to another interesting case, which I will discuss next time....

And maybe more on variation selectors and other related matters, some either time entirely.

I don't know about that. I'd rather see the VS truncated than both the whole previous character and the VS truncated. Without the VS, the result may look weird, but it'll be more usable than if the whole base character disappears as well.

That can depend on your customer -- someone Japanese may well disagree with you about that in regard to form. Also note that truncation is not the only operation we are talking about here of course...

Actually, the result is more likely to look weird with the VS; and if stripped of the VS the result would generally look normal. But, anyway I agree with John that I would not consider the VS and its base character as an inseperable entity like a surrogate pair, and personally like to be able to treat VS's as characters when editing text -- deleting them and trying new different VS's on different characters to get the desired result.

Okay, an opinion not brought up with Unicode (or at least not written in for its own recommendations...).

But let's take the cut/paste case as an example -- you'd really want the VS to be left dangling alone with no base character, even though the user selected the "changed" character?

Or cursor movement, when display works right. You really want the cursor weirdness that would manifest itself, with no other visual indication of what's going on?

Okay, differences in opinion are good, always. :-)

But I am curious, especially of John and Andrew but really of everyone: in the myriad of cases, what is the best result? Imagine (in addition to the specific scenarios John and Andrew mentioned) the following:

-- Delete a section, included an alternate glyph at the end -- Should the VS be left in the old text, and does the answer change when the VS is alone at the beginning vs. attached to a new (possibly illegal or potentially worse legal but unintended choice of) base character?

-- Copy/Paste -- All of the above, plus: -- should the alternate glyph be lost?

You see what I am getting at -- what do you think should happen with the orphaned VS, which is now just in the text, unintended? I am worried about two principles, 1) what to do with the original intent of the text, and 2) what to do with the potentially illegal, potentially unintended result of the remaining text....

As a by the way, this is one of the many reasons I was against the VS notion the entire time it has been alive.

Does the answer to any above change with the argument (of some, and which I also disagree with) that new simplified characters should be traditional characters with variation selectors?

I think I'm too close to variation selectors to be able to answer your questions objectively. I like to have full control over my text, including variation selectors, but I can see that many users would be annoyed if the appearance of their text kept getting screwed up because of some invisible entity that was not staying in its place.

"new simplified characters should be traditional characters with variation selectors"

I know that their are certain people in the UTC who favour this approach, but I don't think it has any chance of being adopted, because there would be massive opposition from IRG and WG2. At this stage in the game it is simply too late to change the model for Han encoding; and all we can do is bite the bullet and encode all required simplified character forms as separate characters (about half of the CJK-D characters currently under ballot are simplified forms of existing traditional characters).

Cut/paste isn't quite the same thing as selection: in some contexts, when we cut selected text, following whitespace also goes with it. In that case, I'd probably extend the selection to include any following invisible character (what is currently done for bidi marks in that situation?)

In addition to Andrew's reasons, another reason why VSes shouldn't be used for simplified characters is that N traditional characters often map to the same simplified character, so it would be impossible to tell by looking at a text what underlies it. There are other situations where this is true, like:

english = CIBARA

(which word do you read first?), but that's no reason to add to it.

Not all clipboard cases grab surrounding spaces and most do not if you do not select the space.

The Bidi marks are another case, one I talk about tomorrow....