UCS-2 to UTF-16, Part 5: What's on the Next Level?

by Michael S. Kaplan, published on 2008/10/15 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/10/15/9000371.aspx

Previous blogs in this series of blogs on this Blog:

Now going through the previous parts, there are two conclusions that developers and architects can come to:

  1. The basic problem is much simpler that it seemed at first;
  2. The full problem is not only more complicated than it seemed, but there is currently no full, query-able support for it on the platform.

It was ever about UCS-2; it was never about UTF-16. It was about character identity, and the desire to not destroy meaning....

But it is important to look at the full scope of the kinds of things that users might expect here.

The other day a mail thread involving the "delete" aspects of the problem inherent in how to safely truncate a string talked about several aspects of the problem.

In that thread, Peter Constable suggested:

SCRIPT_LOGATTR::fCharStop returned by Uniscribe’s ScriptBreak() API ought to tell you valid truncation points – including not to break between surrogate pairs, between a base character and a combining mark, or within Indic clusters.

This gives a practical suggestion to handle many of the problems pointed out earlier, problems that encompass the first two of the three major categories of linguistic characters I defined earlier in the series.

Peter then went on to mention:

One case that won’t tell you, though: In Thai and Lao, it’s best not to break between {U+0E40..U+0E44, U+0EC0..U+0EC4} and a following consonant letter {U+0E01..U+0E2E, U+0EC1..U+0EAE}.

Now these two cases actually also happen to fall under the third category which I named as sort elements, since these are the kinds of things that would also tend to impact collation operations.

Though the fact that a "collation dude" like me would call them sort elements while a "font dude" like Peter would not suggests that perhaps a name that encompasses a wider description of what these items would be might be in order. Otherwise we are each being partially functionally descriptive but not in any kind of robust, complete way.

In this case the result is the same, and plus there is no function to get all of the information, so naming it might be less important at the moment. :-)

There are also another aspect of truncation that was discussed in the thread, such as:

Also additional support for Arabic [could be] added since when breaking in the middle of a word, the character shape would change from the medial or initial form to the final form... ...inject a ZWJ to keep the form from changing so the user has context that additional text follows and wouldn’t change the meaning/context.

Note that this is a new idea that has not been discussed before in this series, or indeed this blog. I am still wrapping my had around the validity of the notion since to a native reader of the language the different forms really are the same letter, and I am not convinced that the confusion of truncating in the middle of a word would be mitigated appropriately by such an approach.

Another interesting issues related to the Thai case that Peter mentioned was that while a user might occasionally expect cursor movement and/or selection to respect those category-three sort element boundaries, they would for the most part actually expect the cursor to be able to put in the middle of the sort element, and most might never have such an expectation.

This is similar to the way (to use the earlier example for sort elements) a Hungarian user might expect dzs to be handled.

At this point it may be a challenge to find a user who is sophisticated enough to ask about expectations who does not have decades of experiences with typewriters and/or computers that have already built up expectations of proper behavior. and for every case where a user might expect multiple characters to be treated as one for movement/selection/deletion due to their experiences with dead keys, there can be just as many cases where the fact that the underlying entry/storage mechanism has multiple characters does not even allow for the possibility of treating the cluster as a single character.

Thus we have a design flaw in the platform where the underlying entry/storage mechanisms are guiding "expected" behavior, rather than the other way around.

Then we can look at yet another aspect of the problem: the way .NET does things.

You know how I quoted Peter Constable's ScriptBreak comments earlier?

This is an idea I mentioned before, in blogs like Stick a fork in GetCharacterPlacement and Sometimes you need more than StringInfo.

Now that blog prophetically talks about how StringInfo can't do it all here (in fact it only handles most of those first two categories of linguistic character boundaries without giving any of the word boundary information).

Add to it the fact that all of the methods that used to take just a System.Char -- for example Char.IsLetter(Char) -- now have a second overload that takes a string and a length -- for example Char.IsLetter(String, Int32). These have only one purpose -- handling surrogate pair handling. Thus all of the other times when a user might think of it as a character or even Unicode would think it is a grapheme cluster or a character via canonical equivalence are ignored.

And this addition of dozens of overloads to .NET >= 2.0 is a partial bandage over a perceived limitation -- the inability of functions to handle surrogate pairs properly -- that ignores all of the other scenarios here.

And .NET has no built-in support for the kind of things ScriptBreak can tell you, which are largely based in what Unicode and its character properties reveal (even though lots of the underlying properties are available via the CharUnicodeInfo class).

I guess this is what happens when the typography and rendering folks are not involved in design discussions that the core NLS folks work on. :-)

But you can see how lots of the pieces are available here, and can be used. Note that if in truncation you use the ScriptBreak-provided word break opportunities for your truncation then you will be able to support all three levels of linguistic character since none of then would survive a word-breaking opportunity!


This blog brought to you by(U+0d91, aka SINHALA LETTER EYANNA)

no comments

referenced by

2009/06/29 UCS-2 to UTF-16, Part 11: Turning it up to Eleven!

2009/06/10 UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...

2008/12/16 UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev

2008/12/09 UCS-2 to UTF-16, Part 8: It's the end of the string as we know it (and I feel ellipses)

2008/12/04 UCS-2 to UTF-16, Part 7: If it makes the SQL Server columns too small then it made the Oracle columns either too smallER or too smallEST

2008/11/24 UCS-2 to UTF-16, Part 6: An exercise left for whoever needs some exercise

go to newer or older post, or back to index or month or day