Not all GetUnicodeCategory methods are created equal

by Michael S. Kaplan, published on 2006/06/26 22:40 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/26/648040.aspx


So the question that Matt asked was something like this:

Why is it that calling GetUnicodeCategory on character '00ad' returns a category of DashPunctuation? I expected the Unicode 4.1 category of Format. Here is the code I used:

using System;
using System.Text;

namespace ConsoleApplication2 {
    class Program {
        static void Main(string[] args) {
            TestChar(Convert.ToChar(Convert.ToInt32("00ad", 16)));
        }

        public static void TestChar(char testing) {
            Console.WriteLine("Categorys for char 0x" + Convert.ToInt32(testing).ToString("x4"));
            Console.WriteLine("\tUCD  Category : " + Char.GetUnicodeCategory(testing));
            Console.WriteLine("\tNLS+ Category : IsPunctuation=" + Char.IsPunctuation(testing));
        }
    }
}

Do you know what is going on here?

Indeed, the problem is sort of the root of the title of this post. The simple fact is that not all GetUnicodeCategory() methods are created equal!

There is char.GetUnicodeCategory that has been around all along, and starting with version 2.0 of the .NET Framework there is CharUnicodeInfo.GetUnicodeCategory, which I have talked about previously (though in fairness not with 100% accuracy!).

They are different for some sort of backcompat reason related to programs that are just assuming certain behavior. Let's see how different they are, with code like this:

using System;
using System.Text;
using System.Globalization;

namespace ConsoleApplication2 {
    class Program {
        static void Main(string[] args) {
            for(ushort ich = 0x0000; ich < 0xffff; ich++) {
                UnicodeCategory ucC = char.GetUnicodeCategory((char)ich);
                UnicodeCategory ucCui = CharUnicodeInfo.GetUnicodeCategory((char)ich);
                if(ucC != ucCui) {
                    Console.WriteLine("{0}\t{1}\t{2}", ich.ToString("x4"), ucC, ucCui);
                }
            }
        }
    }
}

If you run it, it will, amazingly enough, return just that one character that is different:

00ad    DashPunctuation Format

Although in the future, the plan is that CharUnicodeInfo.GetUnicodeCategory will be updated when Unicode is, while char.GetUnicodeCategory will usually also be updated though occasionally there may be some sort of application dependency that would force it to not change for specific characters.

Kind of a whole new way to solve the consistency vs. correctness argument -- support both? :-)

 

This post brought to you by U+00ad, a.k.a. SOFT HYPHEN


# Mihai on 27 Jun 2006 7:28 PM:

<<Although in the future, the plan is that CharUnicodeInfo.GetUnicodeCategory will be updated when Unicode is, while char.GetUnicodeCategory will usually also be updated though occasionally there may be some sort of application dependency that would force it to not change for specific characters.>>

Kind of a mess, if you ask me.
Maybe keeping the two in sync and adding CharUnicodeLatestInfo would be an idea? And (in time) deprecating CharUnicodeInfo?

# Michael S. Kaplan on 27 Jun 2006 11:44 PM:

Nah, CharUnicodeInfo was specifically added to always match what the Unicode Standard has in it when it shipped -- deprecating that would make no sense!

And notice that there is only one character that freaked everyone out to change in the System.Char method -- all of the other changes happened and everyone lived....

# Mihai on 28 Jun 2006 1:36 PM:

If this was the idea, then maybe clearly spelling this out in the MSDN doc would be enough.
Nothing tells me what the difference is, or that CharUnicodeInfo.GetUnicodeCategory contains the latest info, while char.GetUnicodeCategory gives "compatibility" info.

Reading the current doc it feels like char.GetUnicodeCategory is just syntactic suggar for CharUnicodeInfo.GetUnicodeCategory, maybe even implemented by using that :-)

I was trying to make the name self-documented, but a note in the doc might be enough (and does not break any code :-)

# Michael S. Kaplan on 28 Jun 2006 4:56 PM:

Thats what this blog post and the other one were all about! :-)

referenced by

2006/09/02 Every character has a story #23: U+00ad (SOFT HYPHEN)

go to newer or older post, or back to index or month or day