UCS-2 to UTF-16, Part 6: An exercise left for whoever needs some exercise

by Michael S. Kaplan, published on 2008/11/24 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/11/24/9134826.aspx


Previous blogs in this series of blogs on this Blog:

Now continuing on from prior blogs in the series, I thought I'd quote a bit from a recent email thread about a very similar issue to some of the prior discussion like especially Part 3, related to caret stops, aka the points where you can put the cursor as you navigate the string.

The thread had wandered a bit (as threads tend to do!), and then colleague Jerry Dunietz (an architect I have worked with before on issues related to Unicode and cultures and locales and such) offered a great description of many of the issues that I have discussed here in this series. With his permission, I will quote from his response:

Logically, a Unicode string is a sequence of UCS-4 code-points.  (I intend to carefully distinguish between “code-point” and “code-unit” in the text below.)

There are several way to encode a Unicode string in memory.  Ignoring for today the existence of a byte-order-mark (and of byte-ordering variations), there are three ways that one could represent such a string in memory:

As a sequence of 32-bit code-units, each representing a single UCS-4 code-point.  (UTF-32)
As a sequence of 16-bit code-units.  Some UCS-4 code-points are represented by single code-unit, and some represented as a “surrogate pair” of two code-units.  (UTF-16.)
As a sequence of 8-bit code-units.  The code points from U+0000 to U+007F are represented as a single code-unit, but all other code-points are represented by a longer sequence of code-units (UTF-8.)

Now imagine that you’re a programmer, and you want to pick representation in memory for a Unicode string.  UTF-32 seems like the easiest to work with.  (But it is the fattest encoding of the three for any real-world corpus of text.)  If you want the fifth code-point in a string, you just array-index to the fifth code-unit in the string.  If seems (but wait) that if the user hits backspace after inputting a string, you just back up one-code-unit.  Is seems that if a user hits an arrow key to move a caret from one character to the next, you would just advance the caret position by one code-unit. (Given all of this apparent simplicity, there seems to be a compelling argument for defining C++’s wchar_t to be 32-bits long.)

But it turns out that the stuff I wrote using the word “seems” is an over-simplification.  Unicode has code-points that correspond to combining characters.  Such characters combine with a previous code-point (or sequence thereof) to present to the user what appears to be a single character.  Section 2.11 of the Unicode 5.0 spec provides lots of examples of different combining characters, in different languages.  From a font-rendering glyph selection point of view, or from the point of a view of a user attempting to move a text cursor from character to character, a single glyph or character may correspond to a sequence of multiple UCS-4 code-points, and thus multiple UTF-32 code-units.  Given that such situations exists, an internationally-robust program working with a UTF-32 string must be prepared for the concept that multiple-code code-units or code-points correspond to the user’s concept of a single character.

But if our program needs to deal with the possibility of a user’s concept of a character corresponding to multiple code-units, then the apparent advantage to the programmer of using a UTF-32 representation instead of a UTF-16 one goes away. (And given the real-world size advantages of UTF-16, it now seems that making wchar_t be 16 bits is a better choice than making it be 32-bits.)

Whether you choose UTF-16 or UTF-32 encoding, suppose you want to spec a file format that can be easily displayed, but for which a viewer program can easily support text selection.  You could ask the viewer program to build in lots of smarts to determine where a logical caret-stop can occur.  Or you can ask the program that built the file to encode the caret-stop information within the file itself, reducing the burden on the viewing program.

This is really a great summary of the issues surrounding the UTF-16 vs. UTF-32 debate, as well as the whole UCS-2 vs. UTF-16 one I have been covering already. I can't really take full credit for his knowledge in this area since I was just one of several sources, but Iike now I'm in there somewhere and it is always good to know when one has been helpful in influencing an influencer (which indirectly puts me in mind of both the influence vs. impact issue and the fact that Ms. Phair might be wrong about the influence of an Ant in Alaska, sometimes!).

Now as to whether storing the information about the carets stops explicitly versus calculating them via a StringInfo-type technique (StringInfo is something I have talked about previously, in blogs like this one) is an interesting one.

Really in the end it depends on whether you think the pure "derived from Unicode data" answer is sufficient or whether you have additional sources of information. In the particular case Jerry was thinking of, they rely on much more sophisticated methods, such that constantly calculating on the fly could really impact performance. Since the data itself in his case is read-only there was never worry about the potential need for recalculation so the calc-once obnly and store the information was a no-brainer for them.

But even in a read-write situation, providing a sorted array of indexes for easy enumeration that allows for

is really just an ordinary interview question in disguise, and one that is pretty easily solved, too. Were it even vaguely "international" I'd say we should solve it here, but it's not so I'll leave that as an exercise for whoever feels that they need the exercise. :-)

Though of course I'll point a few issues to keep in mind for anyone who would want to implement such a cache....

Obviously one has to be ready to re-order the items after the point of change for the insertion/deletion case. this could be expensive, depending on where the action is happening and how mush text follows it (when to move to slightly more complex schemes then the doubly linked list that the initial problem suggests is also left as an exercise!).

But less obviously, in the area immediately preceding and following the insertion point, one can potentially need to recalculate caret stops due to changes in the text. For example if one has the letters "abcdef" then one will have seven indices (covering the points before each caret stop as one moves across the string, plus one for the end):

{0, 1, 2, 3, 4, 5, 6}

Then if one decides to add a combining umlaut after the initial "e", the new string is "abcdëf" and the indices would now have to be:

{0, 1, 2, 3, 4, 6, 7}

and so on. And in other cases a formerly combining character before or after might now not be....

From there the exact methods used for calculation of caret stops and how to integrate them together comes into play, as does how the different methods [might] interact. It really can become a rather fascinating technical question, in the end. Though mostly not for this blog, except more on the various methods eventually. :-)

Getting back to the series for a second, there are a few points left to cover, such as the actual means of support and then the whole UTF-8 question (which is still fair game here!).

I'll cover these in upcoming blogs....


This blog brought to you by(U+17bf, aka KHMER VOWEL SIGN YA)


no comments

referenced by

2009/06/29 UCS-2 to UTF-16, Part 11: Turning it up to Eleven!

2009/06/10 UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...

2008/12/16 UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev

2008/12/09 UCS-2 to UTF-16, Part 8: It's the end of the string as we know it (and I feel ellipses)

2008/12/04 UCS-2 to UTF-16, Part 7: If it makes the SQL Server columns too small then it made the Oracle columns either too smallER or too smallEST

go to newer or older post, or back to index or month or day