Where did the new StringInfo stuff come from?

by Michael S. Kaplan, published on 2005/04/29 02:36 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/04/29/413366.aspx


I used it in a very confusing and obfuscated way in Normalization as obfuscation in C#. And then yesterday I used it again in my internationally savvy palindrome checker, in a slightly more intuitive manner.

It is the all new StringInfo class in Whidbey.

Now the old StringInfo class had only static methods -- in other words it was a walking FxCop violation.

And the main method it had was StringInfo.ParseCombiningCharacters, which was a static method that would take a string and return an array of int values, each one of which would be an index into that string that showed where a new text element was started. A text element could be a single letter, a letter and a diacritic, a letter and a bunch of diacritics, a hugh and low surrogate representing a surrogat pair, etc.

ParseCombiningCharacters is an incredibly useful method, but it is not very intuitive to use, certainly not and use effectively. The same goes for the other methods for dealing with text elements (GetTextElementEnumerator and GetNextTextElement) -- people were just getting confused.

But people have no problem understanding the need to be able to count entities based on what a typical user might think a character is. Once one explains what a text element is, they immediately understand the need for ways to make use of them.

So we had some meetings to talk about how to make the ways to work with text elements more intuitive, at least as intuitive as the concept of a text element itself. In the last of those meetings, someone pointed out that people usually had no problem understanding the semantic of the Substring method or the Length property of System.String. Maybe we could learn a lesson from that?

And viola, the SubstringByTextElements method and the LengthInTextElements property were born!

Each behaves just like their cousins, the Substring method and the Length property, but rather than being based on UTF-16 code points, they are based on text elements, or what the user might reasonably point to and call a character. The same thing that the Win32 CharNext and CharPrev functions do (at least, when we have not accidentally broken them!).

Now the method and property are useless if there is not some object that they can hang off of which has the string. People were leery about adding them directly to System.String since they really want to try keep that object as lightweight as they can (and some would even say they are not trying hard enough on that). That's when somebody remembered this class you could instantiate yet had no instance methods, this FxCop violation with a hat. And we added a constructor that takes a string and a StringInfo.String property to retrieve the string later if you wanted or change it without having to tear down the object.

Now we were rolling....

Internally, it just uses that incredibly useful but not-so-intuitive StringInfo.ParseCombiningCharacters and stores that System.Int32 array. That makes StringInfo.LengthInTextElements a simple call to Length on the array, and StringInfo.SubstringByTextElements is a simple tip-toe through the array, using the very start and length parameters that the method contains in order to know where and how far to go. So we get to be intuitive and pretty fast at the same time. and we get to get rid of that FxCop issue, to boot. Everybody wins!

 

This post brought to you by "¾" (U+00be, a.k.a. VULGAR FRACTION THREE QUARTERS)


# Maurits on Friday, April 29, 2005 8:13 AM:

Much more intuitive, good job!
I don't have Whidbey Beta 2, so I'll have to reverse-engineer LengthInTextElements and SubstringByTextElements.

# Srikanth on Friday, April 29, 2005 9:23 AM:

One another great article!
Keep doing..

Thanks

# Michael S. Kaplan on Friday, April 29, 2005 10:07 AM:

Or you could just get Beta 2. :-)

# Wayne Steele on Friday, April 29, 2005 12:20 PM:

OK, smart guy, so what happens for code points that contain more than one text element?

# Michael S. Kaplan on Friday, April 29, 2005 1:24 PM:

Not sure I understand, Wayne -- a text element is made up of one or more code points. We never divide code points into multiple text elements.

# Maurits on Friday, April 29, 2005 3:56 PM:

I think Wayne's thinking of things like fl (that's fl in one grapheme) - these are ONE TEXT ELEMENT.

See my recent update:

http://channel9.msdn.com/ShowPost.aspx?PostID=63297

# Maurits on Friday, April 29, 2005 5:43 PM:

Ah, found the codepoint for the fl ligature:
http://www.fileformat.info/info/unicode/char/fb02/index.htm

# Michael S. Kaplan on Friday, April 29, 2005 9:01 PM:

You can use Unicode Normalization to fold out that ligature into more than one unique text element. Normalization form KC does the trick nicely. :-)

Now I would not tend to agree with your assessment of sort elements as being letters, FWIW. They are not relly thought of in that way.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/12/09 UCS-2 to UTF-16, Part 8: It's the end of the string as we know it (and I feel ellipses)

2008/11/24 UCS-2 to UTF-16, Part 6: An exercise left for whoever needs some exercise

2008/09/18 UCS-2 to UTF-16, Part 3: It starts with cursor movement (where MS simultaneously gets better and worse)

2008/07/24 When you assess, you make an...

2007/05/09 Sometimes you need more than StringInfo

2007/03/04 String Indexing?

2006/11/10 Some people feel really insecure about the size of their [string] members

2005/06/15 Once more into the palindrome

go to newer or older post, or back to index or month or day