UCS-2 to UTF-16, Part 8: It's the end of the string as we know it (and I feel ellipses)

by Michael S. Kaplan, published on 2008/12/09 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/12/09/9187379.aspx

Previous blogs in this series of blogs on this Blog:

Now when you only have a certain amount of space in a control and you have to truncate the text inside it, one common thing for developers to do is look at the StringInfo class that I have talked about before and avoid the truncation of text on character boundaries.

If you need to do it in native rather than managed code, having to do all the work yourself may even seem a little daunting, but once aimed at the way to solve the perceived problem, you may be off aned running.

You know, make it a simple UCS-2 vs. UTF-16 issue like the ones we have done before.

To be honest, this is usually the wrong approach.

It is a great approach oif what one is worried about is size of storage, like a SQL Server column whose size cannot be exceeded.

But not so good if one is dealing with a rendering issue -- with truncation of the display of text in a user interface.

Lest we forget, the width of the widest character and the skinniest one really shows the heart of the problem. Just think about what UI will do with

WWWWWWWWWWWWWWWWWWWW

vs.

iiiiiiiiiiiiiiiiiiii

In both cases we are seeing 20 characters, but clearly one string is gonna take up a whole lot more space than the other.

Using the character boundary, even the linguistic character boundary, is not the most sensible way to tackle the problem.

A good way to turn in these rendering scenarios is toward rendering solutions, since they have two things over everyone else -- not only do they have the ACTUAL width of the text to work with, but they also pack some of the linguistic data about character boundaries right in them!

The simplest solution might be to go the DrawText direction for rendering -- with the special flag

DT_MODIFYSTRING

combined with one of the magical truncation-friendly flags:

DT_END_ELLIPSIS -- For displayed text, if the end of a string does not fit in the rectangle, it is truncated and ellipses are added. If a word that is not at the end of the string goes beyond the limits of the rectangle, it is truncated without ellipses.
DT_PATH_ELLIPSIS -- For displayed text, replaces characters in the middle of the string with ellipses so that the result fits in the specified rectangle. If the string contains backslash (\) characters, DT_PATH_ELLIPSIS preserves as much as possible of the text after the last.
DT_WORD_ELLIPSIS -- Truncates any word that does not fit in the rectangle and adds ellipses.

What the DrawText documentation does not mention in there is that most of the problems we talked about before -- from surrogate pairs to combining characters to what happens when complex scripts are rendered and so on -- are all built-in there. So most of the work will be done for us!

Finally, a situation where the native code has more to say for itself than the managed code, since there isn't really a DrawText.NET analogue.

Though perhaps you are already imagining how one would handle the specific scenario you need based on the three flags if you had to write it yourself -- and whether you are daunted by the complexity or delighted by it, the problem, along with the complexities of combining characters and surrogate pairs, is at least a tractable one....

This blog brought to you by ܢ (U+0722, aka SYRIAC LETTER NUN)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2009/06/29 UCS-2 to UTF-16, Part 11: Turning it up to Eleven!

2009/06/10 UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...

2008/12/16 UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev

go to newer or older post, or back to index or month or day