by Michael S. Kaplan, published on 2006/06/26 22:40 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/26/648040.aspx
So the question that Matt asked was something like this:
Why is it that calling GetUnicodeCategory on character '00ad' returns a category of DashPunctuation? I expected the Unicode 4.1 category of Format. Here is the code I used:
using System;
using System.Text;
namespace ConsoleApplication2 {
class Program {
static void Main(string[] args) {
TestChar(Convert.ToChar(Convert.ToInt32("00ad", 16)));
}
public static void TestChar(char testing) {
Console.WriteLine("Categorys for char 0x" + Convert.ToInt32(testing).ToString("x4"));
Console.WriteLine("\tUCD Category : " + Char.GetUnicodeCategory(testing));
Console.WriteLine("\tNLS+ Category : IsPunctuation=" + Char.IsPunctuation(testing));
}
}
}Do you know what is going on here?
Indeed, the problem is sort of the root of the title of this post. The simple fact is that not all GetUnicodeCategory() methods are created equal!
There is char.GetUnicodeCategory that has been around all along, and starting with version 2.0 of the .NET Framework there is CharUnicodeInfo.GetUnicodeCategory, which I have talked about previously (though in fairness not with 100% accuracy!).
They are different for some sort of backcompat reason related to programs that are just assuming certain behavior. Let's see how different they are, with code like this:
using System;
using System.Text;
using System.Globalization;
namespace ConsoleApplication2 {
class Program {
static void Main(string[] args) {
for(ushort ich = 0x0000; ich < 0xffff; ich++) {
UnicodeCategory ucC = char.GetUnicodeCategory((char)ich);
UnicodeCategory ucCui = CharUnicodeInfo.GetUnicodeCategory((char)ich);
if(ucC != ucCui) {
Console.WriteLine("{0}\t{1}\t{2}", ich.ToString("x4"), ucC, ucCui);
}
}
}
}
}
If you run it, it will, amazingly enough, return just that one character that is different:
00ad DashPunctuation Format
Although in the future, the plan is that CharUnicodeInfo.GetUnicodeCategory will be updated when Unicode is, while char.GetUnicodeCategory will usually also be updated though occasionally there may be some sort of application dependency that would force it to not change for specific characters.
Kind of a whole new way to solve the consistency vs. correctness argument -- support both? :-)
This post brought to you by U+00ad, a.k.a. SOFT HYPHEN
# Mihai on 27 Jun 2006 7:28 PM:
# Michael S. Kaplan on 27 Jun 2006 11:44 PM:
# Mihai on 28 Jun 2006 1:36 PM:
# Michael S. Kaplan on 28 Jun 2006 4:56 PM:
referenced by