'The 44' (*not* 'The 4400')

by Michael S. Kaplan, published on 2007/04/26 18:05 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/04/26/2290760.aspx


The 4400 is an interesting television show that this post has nothing to do with.

This post is about what happens if you run the script from No Regex in the Unicode room! (and no sex in the champagne room, either!) on a Vista machine.

Basically, you will still get 44 characters with different results between char.IsLetter and the Regex expression:

regex: False function: True char in hex: 130 - UppercaseLetter
regex: False function: True char in hex: 1c5 - TitlecaseLetter
regex: False function: True char in hex: 1c8 - TitlecaseLetter
regex: False function: True char in hex: 1cb - TitlecaseLetter
regex: False function: True char in hex: 1f2 - TitlecaseLetter
regex: False function: True char in hex: 23a - UppercaseLetter
regex: False function: True char in hex: 23e - UppercaseLetter
regex: False function: True char in hex: 3d2 - UppercaseLetter
regex: False function: True char in hex: 3d3 - UppercaseLetter
regex: False function: True char in hex: 3d4 - UppercaseLetter
regex: False function: True char in hex: 3f4 - UppercaseLetter
regex: False function: True char in hex: 1fc3 - LowercaseLetter
regex: False function: True char in hex: 1fcc - TitlecaseLetter
regex: False function: True char in hex: 1ff3 - LowercaseLetter
regex: False function: True char in hex: 1ffc - TitlecaseLetter
regex: False function: True char in hex: 2102 - UppercaseLetter
regex: False function: True char in hex: 2107 - UppercaseLetter
regex: False function: True char in hex: 210b - UppercaseLetter
regex: False function: True char in hex: 210c - UppercaseLetter
regex: False function: True char in hex: 210d - UppercaseLetter
regex: False function: True char in hex: 2110 - UppercaseLetter
regex: False function: True char in hex: 2111 - UppercaseLetter
regex: False function: True char in hex: 2112 - UppercaseLetter
regex: False function: True char in hex: 2115 - UppercaseLetter
regex: False function: True char in hex: 2119 - UppercaseLetter
regex: False function: True char in hex: 211a - UppercaseLetter
regex: False function: True char in hex: 211b - UppercaseLetter
regex: False function: True char in hex: 211c - UppercaseLetter
regex: False function: True char in hex: 211d - UppercaseLetter
regex: False function: True char in hex: 2124 - UppercaseLetter
regex: False function: True char in hex: 2126 - UppercaseLetter
regex: False function: True char in hex: 2128 - UppercaseLetter
regex: False function: True char in hex: 212a - UppercaseLetter
regex: False function: True char in hex: 212b - UppercaseLetter
regex: False function: True char in hex: 212c - UppercaseLetter
regex: False function: True char in hex: 212d - UppercaseLetter
regex: False function: True char in hex: 2130 - UppercaseLetter
regex: False function: True char in hex: 2131 - UppercaseLetter
regex: False function: True char in hex: 2133 - UppercaseLetter
regex: False function: True char in hex: 213e - UppercaseLetter
regex: False function: True char in hex: 213f - UppercaseLetter
regex: False function: True char in hex: 2145 - UppercaseLetter
regex: False function: True char in hex: 2c65 - LowercaseLetter
regex: False function: True char in hex: 2c66 - LowercaseLetter
TOTAL mismatches: 44

The remaining characters make up an interesting bunch that give insight into the specific flaws of certain Regex operations:

So there you have it -- a combination of ones that shouldn't have failed since they were already lowercase and ones that failed due to that weird optimization to not look at Title Case and Upper Case characters since it attempted to lowercase first.

That RegexOptions.IgnoreCase is just a nightmare!

Interestingly, the OS casing table combined with a non-invariant culture (which is not possible in the .NET Framework today) would have picked up many of these letter like symbols and other one way mappings. But not all of them....

 

This post brought to you by every member of "The 44"


no comments

referenced by

2008/07/25 Let's save some time and call them all IRregular expression engines

2008/06/25 Seeing the tears, my heart went out to her as I asked her "Why the Long S?"

2007/12/05 No way to get that script info I was looking for earlier

2007/09/11 4400 (*not* 'The 4400')

go to newer or older post, or back to index or month or day