A little bit about the new CharUnicodeInfo class

by Michael S. Kaplan, published on 2005/01/28 05:31 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/28/362305.aspx


CharUnicodeInfo is a new class that is being added to Whidbey. It has one very straightforward job -- pick up property information from the Unicode Character Database. But there is a lot of data there!

The name provides the proper balance between being appropriately descriptive and showing up near the System.Char struct in the Object Browser.

It is much more functional than the FoldString API's MAP_FOLDDIGITS (discussed a little bit yesterday), which simply maps digits from various scripts to 0 - 9. And it carries much more information than System.Char struct methods like Char.IsWhiteSpace and Char.IsPunctuation (plus it is entirely based on Unicode character properties and has none of the backwards compatibility issues of the methods off of the Char struct (e.g. having to consider some characters as white space because other programs used to do so in their parsing). Pure Unicode all the way, baby!

Now you can hardly call a class a secret when even simple searches in Google and MSN find over 100 pages about it. But I'll try to give a rundown of some of its basic functionality....

Here is a list of some of the methods this new class contains:

GetDecimalDigitValue -- as the title implies, returns the actual value this character has as a decimal digit (or -1 if it is not a decimal digit at all). This is ever so much more useful than Char.IsDigit, which only returns a simple yes/no answer to the question! In official terms, it returns the value of Unicode's Numeric Type/Numeric Value fields whenever the Unicode category is Nd (Number, Decimal).

GetDigitValue -- For all those cases where a character is in fact a digit even if it is not just between 0 and 9, the GetDigitValue method can retrieve those values.

GetNumericValue -- For the times that it is a number but may not even be a digit (such as fractional values), this method returns a numeric representation.

GetBidiCategory -- There are many possible categories that describe the behavior of a character in bidirectional contexts, and every character falls into one of them: LeftToRight (L), LeftToRightEmbedding (LRE), LeftToRightOverride (LRO), RightToLeft (R), RightToLeftArabic (AL), RightToLeftEmbedding (RLE), RightToLeftOverride (RLO), PopDirectionalFormat (PDF), EuropeanNumber (EN), EuropeanNumberSeparator (ES), EuropeanNumberTerminator (ET), ArabicNumber (AN), CommonNumberSeparator (CS), NonSpacingMark (NSM), BoundaryNeutral (BN), ParagraphSeparator (B), SegmentSeparator (S), Whitespace (WS), and OtherNeutrals (ON) -- all members of the BidiCategory enumeration.

GetUnicodeCategory -- Arguably the most elemental property, a character's General Category (one per character) really defines what a character is. Possible values are UppercaseLetter (Lu), LowercaseLetter (Ll), TitlecaseLetter (Lt), ModifierLetter (Lm), OtherLetter (Lo), NonSpacingMark (Mn), SpacingCombiningMark (Mc),  EnclosingMark (Me), DecimalDigitNumber (Nd), LetterNumber (Nl), OtherNumber (No), SpaceSeparator (Zs), LineSeparator (Zl), ParagraphSeparator (Zp), Control (Cc), Format (Cf), Surrogate (Cs), PrivateUse (Co), ConnectorPunctuation (Pc), DashPunctuation (Pd), OpenPunctuation (Ps), ClosePunctuation (Pe), InitialQuotePunctuation (Pi), FinalQuotePunctuation (Pf), OtherPunctuation (Po), MathSymbol (Sm), CurrencySymbol (Sc), ModifierSymbol (Sk), OtherSymbol (So), and OtherNotAssigned (Cn). And every one of them is a member of the UnicodeCategory enumeration.

Note that every one of these methods has two overrides -- one that accepts a single System.Char, and the other which takes a System.String and an index value. The latter case is for dealing with supplementary characters, which are made up of a high and low surrogate (also known as a surrogate pair).

Who knows what the future may bring to this class? The possibilities are endless, as the data that sits behind Unicode allows sophisticated text processing engines to use these properties in exciting ways. All written using the .NET Framework. Speaking as someone charged with writing tools such as MSKLC in the .NET Framework, I plan to try and be one of CharUnicodeInfo's best and most appreciative customers in the months and years to come. :-)

 

This post brought to you by the many Unicode Character Categories....


# Jochen Kalmbach on 28 Jan 2005 3:40 AM:

You should add the links to the class-documentation:
CharUnicodeInfo Class:
http://msdn2.microsoft.com/library/k43c6164.aspx
CharUnicodeInfo Members:
http://msdn2.microsoft.com/library/wtth5wtz.aspx

By the way: is there some equivalent to FoldString, especially "MAP_PRECOMPOSED" and "MAP_COMPOSITE"? Neither StringInfo nor TextInfo provide such a function, or?

# Michael Kaplan on 28 Jan 2005 4:07 AM:

Heh -- you have the links in the very first comment. :-)

The .NET Framework has something even better than FoldString here -- I'll post on it tomorrow....

(I will not approve any spoiler comments until after I post it -- No one spoil the surprise!)

# Jonathan Wilson on 28 Jan 2005 4:26 AM:

Where does windows keep the actual database with all this information in it?
Is there a way to get at the database from straight C/C++ (i.e. no .NET stuff)?

# Michael Kaplan on 28 Jan 2005 4:38 AM:

Jonathan -- Windows does not keep the info -- the .NET Framework does (it keeps its own data). There is no way to get all of the info from unmanaged code, though a lot of the bidi and general category stuff is captured in the GetStringTypeW/GetStringTypeEx APIs.

# Michael Kaplan on 28 Jan 2005 4:40 AM:

Jochen, one more thing -- I talk about the Bidi class stuff, which the docs do not (yet).

And I have those cool links to the fileformat.info site (there were no similar links for bidi class -- the site has the info per character but they do not have pages that links to all of the characters). :-)

# Uwe Keim on 28 Jan 2005 5:00 AM:

Michael, I tried, but I failed:

http://lab.msdn.microsoft.com/productfeedback/viewfeedback.aspx?feedbackid=5c156734-8fef-4e31-a30f-789185aa4900

Uwe :-)

# Michael Kaplan on 28 Jan 2005 5:57 AM:

Ah, I have tried in the past, as well. But *at least* that difficult method does indeed work....

# Jochen Kalmbach on 28 Jan 2005 2:10 PM:

Michael! This Bidi-Stuff is really new! At least it is not available in Beta1! Hopefully it will be available in Beta2... :->

# Michael Kaplan on 28 Jan 2005 2:50 PM:

Indeed, the Bidi stuff is post-Beta1, and will be available in Beta2 of Whidbey.

referenced by

2011/11/21 One disadvantage to being supplementary...or Japanese?

2008/12/08 Lt is TC (and TC is Title Case, or Total Crap -- take your pick!)

2007/01/06 Mixing it up with bidirectional text

2006/07/22 Behind the return of the Unicode IME

2006/06/26 Not all GetUnicodeCategory methods are created equal

2005/09/09 Update on the CharUnicodeInfo class

2005/03/12 Stability of the Unicode Character Database

2005/02/27 Some suggested updates to the Win32-->.NET mapping for NLS functions....

2005/02/19 Stripping diacritics....

2005/01/31 FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler)

go to newer or older post, or back to index or month or day