'The 44' (*not* 'The 4400')
by Michael S. Kaplan, published on 2007/04/26 18:05 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/04/26/2290760.aspx
The 4400 is an interesting television show that this post has nothing to do with.
This post is about what happens if you run the script from No Regex in the Unicode room! (and no sex in the champagne room, either!) on a Vista machine.
Basically, you will still get 44 characters with different results between char.IsLetter and the Regex expression:
regex: False function: True char in hex: 130 - UppercaseLetter
regex: False function: True char in hex: 1c5 - TitlecaseLetter
regex: False function: True char in hex: 1c8 - TitlecaseLetter
regex: False function: True char in hex: 1cb - TitlecaseLetter
regex: False function: True char in hex: 1f2 - TitlecaseLetter
regex: False function: True char in hex: 23a - UppercaseLetter
regex: False function: True char in hex: 23e - UppercaseLetter
regex: False function: True char in hex: 3d2 - UppercaseLetter
regex: False function: True char in hex: 3d3 - UppercaseLetter
regex: False function: True char in hex: 3d4 - UppercaseLetter
regex: False function: True char in hex: 3f4 - UppercaseLetter
regex: False function: True char in hex: 1fc3 - LowercaseLetter
regex: False function: True char in hex: 1fcc - TitlecaseLetter
regex: False function: True char in hex: 1ff3 - LowercaseLetter
regex: False function: True char in hex: 1ffc - TitlecaseLetter
regex: False function: True char in hex: 2102 - UppercaseLetter
regex: False function: True char in hex: 2107 - UppercaseLetter
regex: False function: True char in hex: 210b - UppercaseLetter
regex: False function: True char in hex: 210c - UppercaseLetter
regex: False function: True char in hex: 210d - UppercaseLetter
regex: False function: True char in hex: 2110 - UppercaseLetter
regex: False function: True char in hex: 2111 - UppercaseLetter
regex: False function: True char in hex: 2112 - UppercaseLetter
regex: False function: True char in hex: 2115 - UppercaseLetter
regex: False function: True char in hex: 2119 - UppercaseLetter
regex: False function: True char in hex: 211a - UppercaseLetter
regex: False function: True char in hex: 211b - UppercaseLetter
regex: False function: True char in hex: 211c - UppercaseLetter
regex: False function: True char in hex: 211d - UppercaseLetter
regex: False function: True char in hex: 2124 - UppercaseLetter
regex: False function: True char in hex: 2126 - UppercaseLetter
regex: False function: True char in hex: 2128 - UppercaseLetter
regex: False function: True char in hex: 212a - UppercaseLetter
regex: False function: True char in hex: 212b - UppercaseLetter
regex: False function: True char in hex: 212c - UppercaseLetter
regex: False function: True char in hex: 212d - UppercaseLetter
regex: False function: True char in hex: 2130 - UppercaseLetter
regex: False function: True char in hex: 2131 - UppercaseLetter
regex: False function: True char in hex: 2133 - UppercaseLetter
regex: False function: True char in hex: 213e - UppercaseLetter
regex: False function: True char in hex: 213f - UppercaseLetter
regex: False function: True char in hex: 2145 - UppercaseLetter
regex: False function: True char in hex: 2c65 - LowercaseLetter
regex: False function: True char in hex: 2c66 - LowercaseLetter
TOTAL mismatches: 44
The remaining characters make up an interesting bunch that give insight into the specific flaws of certain Regex operations:
-
U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) -- no lowercase form in the invariant table, only one on Turkish
-
U+01c5 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON) -- no lowercase form in the invariant table
-
U+01c8 (LATIN CAPITAL LETTER L WITH SMALL LETTER J) -- no lowercase form in the invariant table
-
U+01cb (LATIN CAPITAL LETTER N WITH SMALL LETTER J) -- no lowercase form in the invariant table
-
U+01f2 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z) -- no lowercase form in the invariant table
-
U+023a (LATIN CAPITAL LETTER A WITH STROKE) -- no idea why this one fails, there is a lowercase form (U+2c65)
-
U+023e (LATIN CAPITAL LETTER T WITH DIAGONAL STROKE) -- no idea why this one fails, there is a lowercase form (U+2c66)
-
U+03d2 (GREEK UPSILON WITH HOOK SYMBOL) -- this is a symbol; no lowercase form in the invariant table
-
U+03d3 (GREEK UPSILON WITH ACUTE AND HOOK SYMBOL) -- this is a symbol; no lowercase form in the invariant table
-
U+03d4 (GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL) -- this is a symbol; no lowercase form in the invariant table
-
U+03f4 (GREEK CAPITAL THETA SYMBOL) -- this is a symbol; no lowercase form in the invariant table
-
U+1ff3 (GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI) -- no idea why this one fails, it IS a lowercase form
-
U+1ffc (GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI) -- no idea why this one fails, it has a lowercase form (U+1ff3)
-
U+2102 (DOUBLE-STRUCK CAPITAL C) -- this is a symbol; no lowercase form in the invariant table
-
U+2107 (EULER CONSTANT) -- this is a symbol; no lowercase form in the invariant table
-
U+210b (SCRIPT CAPITAL H) -- this is a symbol; no lowercase form in the invariant table
-
U+210c (BLACK-LETER CAPITAL H) -- this is a symbol; no lowercase form in the invariant table
-
U+210d (DOUBLE-STRUCK CAPITAL H) -- this is a symbol; no lowercase form in the invariant table
-
U+2110 (SCRIPT CAPITAL I) -- this is a symbol; no lowercase form in the invariant table
-
U+2111 (BLACK-LETTER CAPITAL I) -- this is a symbol; no lowercase form in the invariant table
-
U+2112 (SCRIPT CAPITAL L) -- this is a symbol; no lowercase form in the invariant table
-
U+2115 (DOUBLE-STRUCK CAPITAL N) -- this is a symbol; no lowercase form in the invariant table
-
U+2119 (DOUBLE-STRUCK CAPITAL P) -- this is a symbol; no lowercase form in the invariant table
-
U+211a (DOUBLE-STRUCK CAPITAL Q) -- this is a symbol; no lowercase form in the invariant table
-
U+211b (SCRIPT CAPITAL R) -- this is a symbol; no lowercase form in the invariant table
-
U+211c (BLACK-LETTER CAPITAL R) -- this is a symbol; no lowercase form in the invariant table
-
U+211d (DOUBLE-STRUCK CAPITAL R) -- this is a symbol; no lowercase form in the invariant table
-
U+2124 (DOUBLE-STRUCK CAPITAL Z) -- this is a symbol; no lowercase form in the invariant table
-
U+2126 (OHM SIGN) -- this is a symbol; no lowercase form in the invariant table
-
U+2128 (BLACK-LETTER CAPITAL Z) -- this is a symbol; no lowercase form in the invariant table
-
U+212a (KELVIN SIGN) -- this is a symbol; no lowercase form in the invariant table
-
U+212b (ANGSTROM SIGN) -- this is a symbol; no lowercase form in the invariant table
-
U+212c (SCRIPT CAPITAL B) -- this is a symbol; no lowercase form in the invariant table
-
U+212d (BLACK-LETTER CAPITAL C) -- this is a symbol; no lowercase form in the invariant table
-
U+2130 (SCRIPT CAPITAL E) -- this is a symbol; no lowercase form in the invariant table
-
U+2131 (SCRIPT CAPITAL F) -- this is a symbol; no lowercase form in the invariant table
-
U+2133 (SCRIPT CAPITAL M) -- this is a symbol; no lowercase form in the invariant table
-
U+213e (DOUBLE-STRUCK CAPITAL GAMMA) -- this is a symbol; no lowercase form in the invariant table
-
U+213f (DOUBLE-STRUCK CAPITAL PI) -- this is a symbol; no lowercase form in the invariant table
-
U+2145 (DOUBLE-STRUCK ITALIC CAPITAL D) -- this is a symbol; no lowercase form in the invariant table
-
U+2c65 (LATIN SMALL LETTER A WITH STROKE) -- no idea why this one fails, it IS a lowercase form
-
U+2c66 (LATIN SMALL LETTER T WITH DIAGONAL STROKE) -- no idea why this one fails, it IS a lowercase form
So there you have it -- a combination of ones that shouldn't have failed since they were already lowercase and ones that failed due to that weird optimization to not look at Title Case and Upper Case characters since it attempted to lowercase first.
That RegexOptions.IgnoreCase is just a nightmare!
Interestingly, the OS casing table combined with a non-invariant culture (which is not possible in the .NET Framework today) would have picked up many of these letter like symbols and other one way mappings. But not all of them....
This post brought to you by every member of "The 44"
no comments
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day