UCS-2 to UTF-16, Part 4: Talking about the ask

by Michael S. Kaplan, published on 2008/10/06 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/10/06/8977552.aspx

Previous blogs in this series of blogs on this Blog:

If you have been following this series, you might be wondering what comes next.

Perhaps it would be about selection and all of the interesting issues there (especially the weirdnesses with selecting partial elements in some cases and more interestingly the way we would think it weird if partial elements were not selected in other cases. It may be impossible to separate our intuitive expectations coming fresh into computers from our intuitive behavior based on generations of typewriters and then computers.

But it is really kind of old hat, minus an issue or two like the ones summarized in More on cursor support: the rest of the answer and the earlier blogs it references, plus a blog or two where I mention how weird it is to select half a surrogate pair or half a composite (decomposed) character -- like the cases I mention in More on cursor movement.

And when you get down to it these issues really are a natural extension of the ones involving cursor movement -- since selection is often just moving the cursor with shift key held down (and has top behave the same way even when it's not).

Or I could go down the road of how the more destructive operations (e.g. the BACKSPACE key and the DELETE key) come into play here, though they are two just natural extensions of cursor movement and selection (the one major difference being the exception I describe in BACKSPACE vs. DELETE and What do you get when you combine a base character with a buttload of diacritics? and really I think the behavior difference between BACKSPACE and DELETE is pretty sensible, and pretty defensible.

If you want to get into the one that is harder to suss out, I would say it is probably easier to get confused about the difference between the BACKSPACE key and the BACK ARROW key. The fact that these two do not behave the same way is explainable but may be even less intuitive....

And of course none of it should happen for surrogate pairs in any case -- thy should always be whole units and never split out, as the blogs above mention.

But in the end what is left to talk about?

The main thing that is left to describe in this series is explaining what support a product should add to a product if it is Unicode but thoroughly "USC-2" based using the definitions in this series, and one wants to move it up to be "UTF-16" based.

In other words, the obvious remaining practical question is what work needs to be done?

And that is a very good question.

Now for my example I will use a product that simultaneously

much more UCS-2 based than any product ought to be, and
knows much more about this whole idea of clusters of characters than most products ever might,

And that product type is databases.

The first criteria is met by the fact that they have a firm base in the middle of allocation issues -- whether it is column lengths or whatever -- and also "character" based parameters/syntax in SQL.

And the second is met by the fact that if you look at the categories from earlier in this series the second and third categories (and if you include SQL Server 2008, the first category as well) is well handled by the collation support used by the engine for almost all of its comparison operations, even the ones it should not!

This schizophrenic behavior is one I have mentioned in the past in blogs like Wild[card] thing, You make my CHAR sing and the follow-up to it, with the other side of the equation somewhere in the realm of blogs like the Freudianly-themed Sometimes a WCHAR really *is* just a character.....

Perhaps with With SQL Server (and SQL itself) comes the illogic of 'trailing spaces' (and the myth of fixed width) thrown in for good measure....

But here and now we run into a bigger problem --to do a full job here, up to three things would be needed:

For functions and such that currently take one Unicode character, a way to pass more than one would be needed -- somewhat like the way .NET took the Char.IsLetter method and its one IsLetter(Char) signature and added an IsLetter(String, Int32) overload;
For things like wildcards (as discussed in Wild[card] thing, You make my CHAR sing), a new wildcard would be needed to specify as linguistic character; something like the StringInfo class would be needed.
For things like linguistic string lengths

Now obviously these three items while being simple enough conceptually could lead to a ton of actual work, so how much would actually be needed would have to be triaged.

And ideally, if one can get away with just extending existing support, that would be much better (as would handling all three categories of linguistic character) rather than just some if them).

Next time, I'll talk about some of the triage rules.

This blog brought to you by ດ (U+0e94, aka LAO LETTER DO)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2009/06/29 UCS-2 to UTF-16, Part 11: Turning it up to Eleven!

2009/06/10 UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...

2008/12/16 UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev

2008/12/09 UCS-2 to UTF-16, Part 8: It's the end of the string as we know it (and I feel ellipses)

2008/12/04 UCS-2 to UTF-16, Part 7: If it makes the SQL Server columns too small then it made the Oracle columns either too smallER or too smallEST

2008/11/24 UCS-2 to UTF-16, Part 6: An exercise left for whoever needs some exercise

2008/10/15 UCS-2 to UTF-16, Part 5: What's on the Next Level?

go to newer or older post, or back to index or month or day