UCS-2 to UTF-16, Part 3: It starts with cursor movement (where MS simultaneously gets better and worse)

by Michael S. Kaplan, published on 2008/09/18 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/09/18/8956650.aspx

Okay, so far we have introduced the topic, pointed out that 9/10's of what a person was going to run off and do is probably too much, and then jumped into define the things that are sequences of storage characters that meet the definition of what a user calls a character.

So what are the things you can do with them, if you are armed with this knowledge?

Well, since we are focusing on "user" characters, we'll start there -- with users moving through a text stream. You know, using the arrow keys to move either forward or backward through the text, and watching the cursor as they go.

The ideal behavior that the user expects without thinking about it is not too complicated: if they think of something as a single character, then the do not expect it to tale multiple arrow keypresses to move through it.

In other words, they want the computer to understand the text in the same way that they do.

Now although that is simple conceptually, it os not always supportable by software today -- especially when one considers sort elements, where there is no easy function to call that finds those boundaries. The underlying data exists in collation algorithms (for example Microsoft's and the UCA's) and is used in order to define the sorting behavior of those elements, but when they are independent letters like the Traditional Spanish ch or the Hungarian dzs, there generally isn't an easy way ti query for the information.

Now in this case there has not been such a method for as long as computers or even typewriters have been there, so it may be stretching the definition of expectation to assume that people would expect computers to understand the boundaries of a sort element when it comes to cursor movement. At best they would be pleasantly surprised if this happened, and at worst they would think of it as a bug.

Finding out whether this is learned behavior or an intuitive expectation would make for fascinating study if the parameters for determining the truth could be defined. It makes me jealous of the ClearType folks when I think of the number of studies they do related to reading when I think about how large the budget is to commission such a study in Windows International.

So realistically we can put that third type of linguistic character aside for now.

Looking at the other two types of linguistic characters -- surrogate pairs (aka supplementary characters) and grapheme clusters (aka text elements), generally users don't want want or expect to require multiple keypresses to get through what they think of a single character.

This raises an interesting question for a developer performing an operation that is enumerating a string.

Should the developer ever care about the answer on the length of a string or substrinfg when they are scrolling through a character?

I mean, take the word 𐎀𐎇𐎖 (this is not a word so much as a stream of Ugaritic letters).

Using a modern browser like FireFox I have no problem seeing the string treated as three characters, despite the fact that under the covers it is actually:

And if you try to click in the middle of a letter you are never given the opportunity -- it always picks a side and puts you in one spot or the other.

So clearly there are times that a developer might need to care about this fact, and therefore there should be a good way to provide this.

But there are things like .NET's StringInfo class, which will help map the storage characters to the linguistic characters -- something I have talked about before.

Though as I pointed out Sometimes you need more than StringInfo, there are cases in between the second and third category that actually do have data somewhere.

Thus in Assamese ম্পা is four Unicode code points (U+09ae U+09cd U+09aa U+09be), two text elements according to StringInfo, but we know from prior "Virama-esque" posts like this one that this is actually a conjunct. So as Sometimes you need more than StringInfo points out, there is a construct that the computer understands that is not being provided as easily to developers.

Now I am tempted to call this yet another category, and it really is a grapheme cluster that is not a text element,.

I think in the long run it would be better if Microsoft treated this as a limitation/bug in StringInfo and its definiton of text element and either fixef it or added a new construct to handle this additional understanding of "characterness" that Uniscribe clearly understands even if StringInfo does not.

In other words. Microsoft ought to provide the mechanisms that it actually does expose in easier ways here.

Because no method should break up a conjunct, or put a cursor in the middle of one. But how is a developer supposed to support all that without a way to get at the data?

Now in Sometimes you need more than StringInfo I actually asked if samples for this kind of data would be desirable and nobody responded, but I'll ask again to see if I have inspired interest. Any takers? :-)

I realize you're only 3 blog posts into this series, but coming from the web side of things (where UTF-8 is king), I find your description pretty puzzling.

Sorry about that! -- Michael

The transition from UCS-2 to UTF-16 could considered analogous to the transition from e.g. ISO-8859-1 (or any other 8-bit encoding) to UTF-8: instead of processing fixed-width units, you process variable width-units. And, in order to migrate old data to new data, you need to convert it/clean it up. In the case of UTF-16, this means making sure there are no unpaired surrogate units around. In the case of UTF-8, this means converting any 0x80-0xFF byte to a two-byte sequence.

Over here, we consider it to be much more than that... -- Michael

In the case of UTF-8, I've never really seen anyone talk about treating the individual bytes as a series of 'characters' to skip over. That is, the notion of having the cursor in the middle of a multi-byte UTF-8 sequence seems absurd: characters only exist above the encoding level, and the UTF-8 bytes are merely a storage mechanism.

Well, the issue is that people doing Unicode in Win32 still often keeps the storage character and the linguistic character together, and the fact is that moving from UCS-2 to UTF-16 is best thought of as separating the two -- Michael

From this point of view, it seems weird that you're mixing the storage aspects of UTF-16 surrogates with the linguistic aspects of handling characters. The former is encoding specific (and doesn't happen with UTF-32) while the latter applies regardless of the Unicode encoding used.

I'm not mixing them up, I am introducing a concept that a distressing number of programs do not handle, all! -- Michael

Saying that 'under the covers', U+10380 U+10387 U+10396 is really just U+d800 U+df80 U+d800 U+df87 U+d800 U+df96 (in UTF-16) seems as absurd to me as saying that 'under the covers' in UTF-8, U+00ff is really U+00c3 U+00bf.

The main difference is that your statement would not be true, since those byte sequences would not be represented that way! :-)

On a sort-of related note, it's always bugged me that the convention is to denote Unicode codepoints as U+xxxx by default, even though the range extends to U+10FFFF. I know about Unicode's 16-bit origins, but it seems silly today to write U+00ff instead of either U+ff (no padding) or U+0000ff (full padding). It certainly doesn't help when you're trying to explain that Unicode is not 16-bit when pretty much every technical document out there writes out BMP codepoints as if it is.

I will probably blog about this topic, it has been on my list for a while now - Michael