by Michael S. Kaplan, published on 2005/01/28 05:31 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/28/362305.aspx
CharUnicodeInfo is a new class that is being added to Whidbey. It has one very straightforward job -- pick up property information from the Unicode Character Database. But there is a lot of data there!
The name provides the proper balance between being appropriately descriptive and showing up near the System.Char struct in the Object Browser.
It is much more functional than the FoldString API's MAP_FOLDDIGITS (discussed a little bit yesterday), which simply maps digits from various scripts to 0 - 9. And it carries much more information than System.Char struct methods like Char.IsWhiteSpace and Char.IsPunctuation (plus it is entirely based on Unicode character properties and has none of the backwards compatibility issues of the methods off of the Char struct (e.g. having to consider some characters as white space because other programs used to do so in their parsing). Pure Unicode all the way, baby!
Now you can hardly call a class a secret when even simple searches in Google and MSN find over 100 pages about it. But I'll try to give a rundown of some of its basic functionality....
Here is a list of some of the methods this new class contains:
GetDecimalDigitValue -- as the title implies, returns the actual value this character has as a decimal digit (or -1 if it is not a decimal digit at all). This is ever so much more useful than Char.IsDigit, which only returns a simple yes/no answer to the question! In official terms, it returns the value of Unicode's Numeric Type/Numeric Value fields whenever the Unicode category is Nd (Number, Decimal).
GetDigitValue -- For all those cases where a character is in fact a digit even if it is not just between 0 and 9, the GetDigitValue method can retrieve those values.
GetNumericValue -- For the times that it is a number but may not even be a digit (such as fractional values), this method returns a numeric representation.
GetBidiCategory -- There are many possible categories that describe the behavior of a character in bidirectional contexts, and every character falls into one of them: LeftToRight (L), LeftToRightEmbedding (LRE), LeftToRightOverride (LRO), RightToLeft (R), RightToLeftArabic (AL), RightToLeftEmbedding (RLE), RightToLeftOverride (RLO), PopDirectionalFormat (PDF), EuropeanNumber (EN), EuropeanNumberSeparator (ES), EuropeanNumberTerminator (ET), ArabicNumber (AN), CommonNumberSeparator (CS), NonSpacingMark (NSM), BoundaryNeutral (BN), ParagraphSeparator (B), SegmentSeparator (S), Whitespace (WS), and OtherNeutrals (ON) -- all members of the BidiCategory enumeration.
GetUnicodeCategory -- Arguably the most elemental property, a character's General Category (one per character) really defines what a character is. Possible values are UppercaseLetter (Lu), LowercaseLetter (Ll), TitlecaseLetter (Lt), ModifierLetter (Lm), OtherLetter (Lo), NonSpacingMark (Mn), SpacingCombiningMark (Mc), EnclosingMark (Me), DecimalDigitNumber (Nd), LetterNumber (Nl), OtherNumber (No), SpaceSeparator (Zs), LineSeparator (Zl), ParagraphSeparator (Zp), Control (Cc), Format (Cf), Surrogate (Cs), PrivateUse (Co), ConnectorPunctuation (Pc), DashPunctuation (Pd), OpenPunctuation (Ps), ClosePunctuation (Pe), InitialQuotePunctuation (Pi), FinalQuotePunctuation (Pf), OtherPunctuation (Po), MathSymbol (Sm), CurrencySymbol (Sc), ModifierSymbol (Sk), OtherSymbol (So), and OtherNotAssigned (Cn). And every one of them is a member of the UnicodeCategory enumeration.
Note that every one of these methods has two overrides -- one that accepts a single System.Char, and the other which takes a System.String and an index value. The latter case is for dealing with supplementary characters, which are made up of a high and low surrogate (also known as a surrogate pair).
Who knows what the future may bring to this class? The possibilities are endless, as the data that sits behind Unicode allows sophisticated text processing engines to use these properties in exciting ways. All written using the .NET Framework. Speaking as someone charged with writing tools such as MSKLC in the .NET Framework, I plan to try and be one of CharUnicodeInfo's best and most appreciative customers in the months and years to come. :-)
This post brought to you by the many Unicode Character Categories....
# Jochen Kalmbach on 28 Jan 2005 3:40 AM:
# Michael Kaplan on 28 Jan 2005 4:07 AM:
# Jonathan Wilson on 28 Jan 2005 4:26 AM:
# Michael Kaplan on 28 Jan 2005 4:38 AM:
# Michael Kaplan on 28 Jan 2005 4:40 AM:
# Uwe Keim on 28 Jan 2005 5:00 AM:
# Michael Kaplan on 28 Jan 2005 5:57 AM:
# Jochen Kalmbach on 28 Jan 2005 2:10 PM:
# Michael Kaplan on 28 Jan 2005 2:50 PM:
referenced by
2011/11/21 One disadvantage to being supplementary...or Japanese?
2008/12/08 Lt is TC (and TC is Title Case, or Total Crap -- take your pick!)
2007/01/06 Mixing it up with bidirectional text
2006/07/22 Behind the return of the Unicode IME
2006/06/26 Not all GetUnicodeCategory methods are created equal
2005/09/09 Update on the CharUnicodeInfo class
2005/03/12 Stability of the Unicode Character Database
2005/02/27 Some suggested updates to the Win32-->.NET mapping for NLS functions....
2005/02/19 Stripping diacritics....
2005/01/31 FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler)