Sometimes you need more than StringInfo

by Michael S. Kaplan, published on 2007/05/09 03:11 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/05/09/2497106.aspx


Sometimes people expect StringInfo to do more than it is designed to do.

Like just the other day when Federica asked:

Hi

We are trying to parse a set of strings and read each joined char as 1 text element
Using the System.Globalization.StringInfo::ParseCombiningCharacters it returns the indexes of each base character / high-surrogate
But we have examples where a high surrogate is in the middle of a joined char and this method fails.
Is there a way to handle correctly these strings?

Here are 2 examples in Assamese:

ম্পা   [4 joined chars] but ParseCombiningCharacters reads 2 text elements
text element ম্ starts at index 0
text element পা starts at index 2

ত্ত  [3 joined chars] but ParseCombiningCharacters reads 2 text elements
text element ত্ starts at index 0
text element ত starts at index 2

Thanks
Federica

Now I am the last person in the world to knock StringInfo, given all the times before that I have talked about it. :-)

But it is important to keep in mind that StringInfo bases all of its rules on Unicode characteristics of the string, not based upon linguistic ones such as Federica is hoping for here.

In this case, one can use Uniscribe (specifically the ScriptItemize and ScriptBreak functions) to get what you are looking for here, which is to see these strings as being made up of larger clusters ) just like Notepad does, for example....

Maybe this one needs a sample too, what do you think? :-)

 

This post brought to you by (U+09a4, a.k.a. BENGALI LETTER TA)


no comments

referenced by

2008/10/15 UCS-2 to UTF-16, Part 5: What's on the Next Level?

2008/09/18 UCS-2 to UTF-16, Part 3: It starts with cursor movement (where MS simultaneously gets better and worse)

go to newer or older post, or back to index or month or day