No Regex in the Unicode room! (and no sex in the champagne room, either!)

by Michael S. Kaplan, published on 2007/04/26 09:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/04/26/2286153.aspx


(apologies to Chris Rock for the title!)

Ted first sent me mail years ago, he was asking some questions about MSLU and Julie (who knew Ted back from when he was working for Microsoft) sent him to me. If memory serves he actually pointed out an interesting bug or two in the course of answering those questions that I ended up fixing.... :-)

Anyway, a few years later he came back to Microsoft and from time to time a question would come up about some random Unicode or internationalization thing and I'd often know the answer.

Though the question that came up yesterday from his colleague Kevin, I did not know for sure what was going on.

The problem amounted to a Regex expression that should have returned the same results as char.IsLetter, but it wasn't. This code listed the characters with the problem:

using System;
using System.IO;
using System.Text;
using System.Globalization;
using System.Text.RegularExpressions;
namespace UnicodeCategory {
    class Program     {
        static void Main(string[] args)
        {
            StringBuilder sb = new StringBuilder();
            int cnt = 0;
            char c = char.MinValue;
            do {
                const RegexOptions opt = RegexOptions.Compiled
                    | RegexOptions.CultureInvariant
                    | RegexOptions.IgnoreCase
                    | RegexOptions.ExplicitCapture;
                Regex regex = new Regex(@"^([\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}]+)$", opt);
                bool regexOK = regex.Match(c.ToString()).Success;
                bool functionOK = Char.IsLetter(c);
                if (regexOK != functionOK) {
                    cnt++;
                    sb.AppendLine(string.Format("regex: {0}\tfunction: {1}\tchar in hex: {2:x} - {3}",
                                                regexOK, functionOK, (int)c, CharUnicodeInfo.GetUnicodeCategory(c)));
                }
                if (c == char.MaxValue) {
                    break;
                }
                c++;
            } while (true);
            sb.AppendLine(string.Format("TOTAL mismatches: {0}", cnt));
            File.WriteAllText("result.txt", sb.ToString());
        }
    }
}

The code was finding a total of 213 characters that were detected by char.IsLetter that the Regex expression that was literally searching for the same Unicode categories was not finding. The full list of characters this code was returning was:

regex: False    function: True    char in hex: 130 - UppercaseLetter
regex: False    function: True    char in hex: 1a6 - UppercaseLetter
regex: False    function: True    char in hex: 1c5 - TitlecaseLetter
regex: False    function: True    char in hex: 1c8 - TitlecaseLetter
regex: False    function: True    char in hex: 1cb - TitlecaseLetter
regex: False    function: True    char in hex: 1f2 - TitlecaseLetter
regex: False    function: True    char in hex: 1f6 - UppercaseLetter
regex: False    function: True    char in hex: 1f7 - UppercaseLetter
regex: False    function: True    char in hex: 1f8 - UppercaseLetter
regex: False    function: True    char in hex: 218 - UppercaseLetter
regex: False    function: True    char in hex: 21a - UppercaseLetter
regex: False    function: True    char in hex: 21c - UppercaseLetter
regex: False    function: True    char in hex: 21e - UppercaseLetter
regex: False    function: True    char in hex: 220 - UppercaseLetter
regex: False    function: True    char in hex: 222 - UppercaseLetter
regex: False    function: True    char in hex: 224 - UppercaseLetter
regex: False    function: True    char in hex: 226 - UppercaseLetter
regex: False    function: True    char in hex: 228 - UppercaseLetter
regex: False    function: True    char in hex: 22a - UppercaseLetter
regex: False    function: True    char in hex: 22c - UppercaseLetter
regex: False    function: True    char in hex: 22e - UppercaseLetter
regex: False    function: True    char in hex: 230 - UppercaseLetter
regex: False    function: True    char in hex: 232 - UppercaseLetter
regex: False    function: True    char in hex: 23a - UppercaseLetter
regex: False    function: True    char in hex: 23b - UppercaseLetter
regex: False    function: True    char in hex: 23d - UppercaseLetter
regex: False    function: True    char in hex: 23e - UppercaseLetter
regex: False    function: True    char in hex: 241 - UppercaseLetter
regex: False    function: True    char in hex: 3d2 - UppercaseLetter
regex: False    function: True    char in hex: 3d3 - UppercaseLetter
regex: False    function: True    char in hex: 3d4 - UppercaseLetter
regex: False    function: True    char in hex: 3d8 - UppercaseLetter
regex: False    function: True    char in hex: 3da - UppercaseLetter
regex: False    function: True    char in hex: 3dc - UppercaseLetter
regex: False    function: True    char in hex: 3de - UppercaseLetter
regex: False    function: True    char in hex: 3e0 - UppercaseLetter
regex: False    function: True    char in hex: 3f4 - UppercaseLetter
regex: False    function: True    char in hex: 3f7 - UppercaseLetter
regex: False    function: True    char in hex: 3f9 - UppercaseLetter
regex: False    function: True    char in hex: 3fa - UppercaseLetter
regex: False    function: True    char in hex: 3fd - UppercaseLetter
regex: False    function: True    char in hex: 3fe - UppercaseLetter
regex: False    function: True    char in hex: 3ff - UppercaseLetter
regex: False    function: True    char in hex: 400 - UppercaseLetter
regex: False    function: True    char in hex: 40d - UppercaseLetter
regex: False    function: True    char in hex: 48a - UppercaseLetter
regex: False    function: True    char in hex: 48c - UppercaseLetter
regex: False    function: True    char in hex: 48e - UppercaseLetter
regex: False    function: True    char in hex: 4c0 - UppercaseLetter
regex: False    function: True    char in hex: 4c5 - UppercaseLetter
regex: False    function: True    char in hex: 4c9 - UppercaseLetter
regex: False    function: True    char in hex: 4cd - UppercaseLetter
regex: False    function: True    char in hex: 4ec - UppercaseLetter
regex: False    function: True    char in hex: 4f6 - UppercaseLetter
regex: False    function: True    char in hex: 500 - UppercaseLetter
regex: False    function: True    char in hex: 502 - UppercaseLetter
regex: False    function: True    char in hex: 504 - UppercaseLetter
regex: False    function: True    char in hex: 506 - UppercaseLetter
regex: False    function: True    char in hex: 508 - UppercaseLetter
regex: False    function: True    char in hex: 50a - UppercaseLetter
regex: False    function: True    char in hex: 50c - UppercaseLetter
regex: False    function: True    char in hex: 50e - UppercaseLetter
regex: False    function: True    char in hex: 1f88 - TitlecaseLetter
regex: False    function: True    char in hex: 1f89 - TitlecaseLetter
regex: False    function: True    char in hex: 1f8a - TitlecaseLetter
regex: False    function: True    char in hex: 1f8b - TitlecaseLetter
regex: False    function: True    char in hex: 1f8c - TitlecaseLetter
regex: False    function: True    char in hex: 1f8d - TitlecaseLetter
regex: False    function: True    char in hex: 1f8e - TitlecaseLetter
regex: False    function: True    char in hex: 1f8f - TitlecaseLetter
regex: False    function: True    char in hex: 1f98 - TitlecaseLetter
regex: False    function: True    char in hex: 1f99 - TitlecaseLetter
regex: False    function: True    char in hex: 1f9a - TitlecaseLetter
regex: False    function: True    char in hex: 1f9b - TitlecaseLetter
regex: False    function: True    char in hex: 1f9c - TitlecaseLetter
regex: False    function: True    char in hex: 1f9d - TitlecaseLetter
regex: False    function: True    char in hex: 1f9e - TitlecaseLetter
regex: False    function: True    char in hex: 1f9f - TitlecaseLetter
regex: False    function: True    char in hex: 1fa8 - TitlecaseLetter
regex: False    function: True    char in hex: 1fa9 - TitlecaseLetter
regex: False    function: True    char in hex: 1faa - TitlecaseLetter
regex: False    function: True    char in hex: 1fab - TitlecaseLetter
regex: False    function: True    char in hex: 1fac - TitlecaseLetter
regex: False    function: True    char in hex: 1fad - TitlecaseLetter
regex: False    function: True    char in hex: 1fae - TitlecaseLetter
regex: False    function: True    char in hex: 1faf - TitlecaseLetter
regex: False    function: True    char in hex: 1fbc - TitlecaseLetter
regex: False    function: True    char in hex: 1fcc - TitlecaseLetter
regex: False    function: True    char in hex: 1ffc - TitlecaseLetter
regex: False    function: True    char in hex: 2102 - UppercaseLetter
regex: False    function: True    char in hex: 2107 - UppercaseLetter
regex: False    function: True    char in hex: 210b - UppercaseLetter
regex: False    function: True    char in hex: 210c - UppercaseLetter
regex: False    function: True    char in hex: 210d - UppercaseLetter
regex: False    function: True    char in hex: 2110 - UppercaseLetter
regex: False    function: True    char in hex: 2111 - UppercaseLetter
regex: False    function: True    char in hex: 2112 - UppercaseLetter
regex: False    function: True    char in hex: 2115 - UppercaseLetter
regex: False    function: True    char in hex: 2119 - UppercaseLetter
regex: False    function: True    char in hex: 211a - UppercaseLetter
regex: False    function: True    char in hex: 211b - UppercaseLetter
regex: False    function: True    char in hex: 211c - UppercaseLetter
regex: False    function: True    char in hex: 211d - UppercaseLetter
regex: False    function: True    char in hex: 2124 - UppercaseLetter
regex: False    function: True    char in hex: 2126 - UppercaseLetter
regex: False    function: True    char in hex: 2128 - UppercaseLetter
regex: False    function: True    char in hex: 212a - UppercaseLetter
regex: False    function: True    char in hex: 212b - UppercaseLetter
regex: False    function: True    char in hex: 212c - UppercaseLetter
regex: False    function: True    char in hex: 212d - UppercaseLetter
regex: False    function: True    char in hex: 2130 - UppercaseLetter
regex: False    function: True    char in hex: 2131 - UppercaseLetter
regex: False    function: True    char in hex: 2133 - UppercaseLetter
regex: False    function: True    char in hex: 213e - UppercaseLetter
regex: False    function: True    char in hex: 213f - UppercaseLetter
regex: False    function: True    char in hex: 2145 - UppercaseLetter
regex: False    function: True    char in hex: 2c00 - UppercaseLetter
regex: False    function: True    char in hex: 2c01 - UppercaseLetter
regex: False    function: True    char in hex: 2c02 - UppercaseLetter
regex: False    function: True    char in hex: 2c03 - UppercaseLetter
regex: False    function: True    char in hex: 2c04 - UppercaseLetter
regex: False    function: True    char in hex: 2c05 - UppercaseLetter
regex: False    function: True    char in hex: 2c06 - UppercaseLetter
regex: False    function: True    char in hex: 2c07 - UppercaseLetter
regex: False    function: True    char in hex: 2c08 - UppercaseLetter
regex: False    function: True    char in hex: 2c09 - UppercaseLetter
regex: False    function: True    char in hex: 2c0a - UppercaseLetter
regex: False    function: True    char in hex: 2c0b - UppercaseLetter
regex: False    function: True    char in hex: 2c0c - UppercaseLetter
regex: False    function: True    char in hex: 2c0d - UppercaseLetter
regex: False    function: True    char in hex: 2c0e - UppercaseLetter
regex: False    function: True    char in hex: 2c0f - UppercaseLetter
regex: False    function: True    char in hex: 2c10 - UppercaseLetter
regex: False    function: True    char in hex: 2c11 - UppercaseLetter
regex: False    function: True    char in hex: 2c12 - UppercaseLetter
regex: False    function: True    char in hex: 2c13 - UppercaseLetter
regex: False    function: True    char in hex: 2c14 - UppercaseLetter
regex: False    function: True    char in hex: 2c15 - UppercaseLetter
regex: False    function: True    char in hex: 2c16 - UppercaseLetter
regex: False    function: True    char in hex: 2c17 - UppercaseLetter
regex: False    function: True    char in hex: 2c18 - UppercaseLetter
regex: False    function: True    char in hex: 2c19 - UppercaseLetter
regex: False    function: True    char in hex: 2c1a - UppercaseLetter
regex: False    function: True    char in hex: 2c1b - UppercaseLetter
regex: False    function: True    char in hex: 2c1c - UppercaseLetter
regex: False    function: True    char in hex: 2c1d - UppercaseLetter
regex: False    function: True    char in hex: 2c1e - UppercaseLetter
regex: False    function: True    char in hex: 2c1f - UppercaseLetter
regex: False    function: True    char in hex: 2c20 - UppercaseLetter
regex: False    function: True    char in hex: 2c21 - UppercaseLetter
regex: False    function: True    char in hex: 2c22 - UppercaseLetter
regex: False    function: True    char in hex: 2c23 - UppercaseLetter
regex: False    function: True    char in hex: 2c24 - UppercaseLetter
regex: False    function: True    char in hex: 2c25 - UppercaseLetter
regex: False    function: True    char in hex: 2c26 - UppercaseLetter
regex: False    function: True    char in hex: 2c27 - UppercaseLetter
regex: False    function: True    char in hex: 2c28 - UppercaseLetter
regex: False    function: True    char in hex: 2c29 - UppercaseLetter
regex: False    function: True    char in hex: 2c2a - UppercaseLetter
regex: False    function: True    char in hex: 2c2b - UppercaseLetter
regex: False    function: True    char in hex: 2c2c - UppercaseLetter
regex: False    function: True    char in hex: 2c2d - UppercaseLetter
regex: False    function: True    char in hex: 2c2e - UppercaseLetter
regex: False    function: True    char in hex: 2c80 - UppercaseLetter
regex: False    function: True    char in hex: 2c82 - UppercaseLetter
regex: False    function: True    char in hex: 2c84 - UppercaseLetter
regex: False    function: True    char in hex: 2c86 - UppercaseLetter
regex: False    function: True    char in hex: 2c88 - UppercaseLetter
regex: False    function: True    char in hex: 2c8a - UppercaseLetter
regex: False    function: True    char in hex: 2c8c - UppercaseLetter
regex: False    function: True    char in hex: 2c8e - UppercaseLetter
regex: False    function: True    char in hex: 2c90 - UppercaseLetter
regex: False    function: True    char in hex: 2c92 - UppercaseLetter
regex: False    function: True    char in hex: 2c94 - UppercaseLetter
regex: False    function: True    char in hex: 2c96 - UppercaseLetter
regex: False    function: True    char in hex: 2c98 - UppercaseLetter
regex: False    function: True    char in hex: 2c9a - UppercaseLetter
regex: False    function: True    char in hex: 2c9c - UppercaseLetter
regex: False    function: True    char in hex: 2c9e - UppercaseLetter
regex: False    function: True    char in hex: 2ca0 - UppercaseLetter
regex: False    function: True    char in hex: 2ca2 - UppercaseLetter
regex: False    function: True    char in hex: 2ca4 - UppercaseLetter
regex: False    function: True    char in hex: 2ca6 - UppercaseLetter
regex: False    function: True    char in hex: 2ca8 - UppercaseLetter
regex: False    function: True    char in hex: 2caa - UppercaseLetter
regex: False    function: True    char in hex: 2cac - UppercaseLetter
regex: False    function: True    char in hex: 2cae - UppercaseLetter
regex: False    function: True    char in hex: 2cb0 - UppercaseLetter
regex: False    function: True    char in hex: 2cb2 - UppercaseLetter
regex: False    function: True    char in hex: 2cb4 - UppercaseLetter
regex: False    function: True    char in hex: 2cb6 - UppercaseLetter
regex: False    function: True    char in hex: 2cb8 - UppercaseLetter
regex: False    function: True    char in hex: 2cba - UppercaseLetter
regex: False    function: True    char in hex: 2cbc - UppercaseLetter
regex: False    function: True    char in hex: 2cbe - UppercaseLetter
regex: False    function: True    char in hex: 2cc0 - UppercaseLetter
regex: False    function: True    char in hex: 2cc2 - UppercaseLetter
regex: False    function: True    char in hex: 2cc4 - UppercaseLetter
regex: False    function: True    char in hex: 2cc6 - UppercaseLetter
regex: False    function: True    char in hex: 2cc8 - UppercaseLetter
regex: False    function: True    char in hex: 2cca - UppercaseLetter
regex: False    function: True    char in hex: 2ccc - UppercaseLetter
regex: False    function: True    char in hex: 2cce - UppercaseLetter
regex: False    function: True    char in hex: 2cd0 - UppercaseLetter
regex: False    function: True    char in hex: 2cd2 - UppercaseLetter
regex: False    function: True    char in hex: 2cd4 - UppercaseLetter
regex: False    function: True    char in hex: 2cd6 - UppercaseLetter
regex: False    function: True    char in hex: 2cd8 - UppercaseLetter
regex: False    function: True    char in hex: 2cda - UppercaseLetter
regex: False    function: True    char in hex: 2cdc - UppercaseLetter
regex: False    function: True    char in hex: 2cde - UppercaseLetter
regex: False    function: True    char in hex: 2ce0 - UppercaseLetter
regex: False    function: True    char in hex: 2ce2 - UppercaseLetter
TOTAL mismatches: 213

I probably should have recognized the list since I have dealt with it before. But off the top of my head I didn't, and in the meantime Ryan over on the CLR team  stepped in help explain what was going on:

This appear to be a bug in the Regex class. If IgnoreCase is present we will translate Lu and Lt to just Ll since we call Char.ToLower for every character in the input.  You would likely know more about this than I do but I verified that Char.ToLower for one of the characters returns the same character presumably because there is no lower case version of the character.  So the expression fails to match because the Unicode category for the character is still uppercase letter and we are trying to match Ll.

Ah, now it all came together.

Well, if you are running on Vista and have the updated casing table then they will work. But otherwise, when you are not running on Vista, the casing table does not cover all of Unicode 5.0 even though the property table in .NET 2.0 will.

(if you run on .NET 1.1 then you will be missing even more characters since not all characters are identified, though in that case they will not be listed as missing in the script since neither function knows asbout them!)

So if you are running on 2.0 of better, this Regex "optimization" is the cause of the bug.

Strictly speaking, there was no need to pass RegexOptions.IgnoreCase since char.IsLetter is going to pick both of them up anyway. So there is a workaround here -- don't pass flags that slow down the Regex and break its functioning anyway, and you can then freely use the Regex if you like (though it did still seem kinda slow to me, maybe there are some optimizations here.... :-)

 

This post brought to you by(U+2c00, a.k.a. GLAGOLITIC CAPITAL LETTER AZU)


# Michael S. Kaplan on 26 Apr 2007 9:20 AM:

This bug actually indirectly shows which developers weren't running on Vista for their primary development machine. :-)


referenced by

2008/07/25 Let's save some time and call them all IRregular expression engines

2008/06/25 Seeing the tears, my heart went out to her as I asked her "Why the Long S?"

2007/12/05 No way to get that script info I was looking for earlier

2007/04/26 'The 44' (*not* 'The 4400')

go to newer or older post, or back to index or month or day