I know I said 'µ' but I didn't really mean 'µ'. I meant 'μ', you know?

by Michael S. Kaplan, published on 2012/04/25 16:02 +02:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/04/25/10297456.aspx

So I recently got an email:

We recently had a bug filed against our team because on a PS-PS machine we were unable to do a proper search with a greek character. It turned out that the issue was caused because some greek lowercase characters do not compare correctly against their uppercase counterparts (and vice versa). The issue is actually a .Net bug. The attached bug is specifically for a RegEx check but it also fails when using .Net’s String.Compare function.

Example:

‘µ’.ToUpper() = ‘Μ’

Theoretically we would then expect that these two characters should compare true against each other when you do “IgnoreCase”. However they do not.

Ah yes, this is something I had seen before.

They were looking for µ, aka U+00b5 aka MICRO SIGN.

And unhappy that regular expressions that were uppercasing the text couldn't find the character again later.

Of course they were assuming it was μ, aka U+03bc, aka GREEK SMALL LETTER MU.

Unfortunately, several factors conspire to make things not work:

The 'linguistic' casing tables, which .NET uses by default, will uppercase convert U+00b5 to Μ, aka U+039c aka GREEK CAPITAL LETTER MU.
However, the collation tables tell a different story¹, so the three characters are not as interchangeable as one might want:
      0x00b5 10 11 2 2 ;Micro Sign
      0x03bc 15 24 2 2 ;Greek Small Mu
      0x039c 15 24 2 18 ;Greek Capital Mu
.NET's regular expression engine has some weird rules about matching
Pseudo tends to do cutesy substitutions like that lowercase Mu for u.
Unicode has some differences here to from unicodedata.txt:
      00B5;MICRO SIGN;Ll;0;L;<compat> 03BC;;;;N;;;039C;;039C
      039C;GREEK CAPITAL LETTER MU;Lu;0;L;;;;;N;;;;03BC;
      03BC;GREEK SMALL LETTER MU;Ll;0;L;;;;;N;;;039C;;039C

Now on the whole, pseudo is pretty cool.

It lets you find bugs that you usually wouldn't find until much later during the development cycle.

it does have one downside though - one that makes pseudo pretty annoying.

When you substitute characters for kinda-lookalike characters with different properties and attributes, then you're going to get unexpected results sometimes....

Like this time!

1 - One can only speculate why the MICRO SIGN is treated so differently than other similar symbols, e.g. Ω (U+212a, aka OHM SIGN), K (U+212a aka KELVIN SIGN) and Å (U+212b, aka ANGSTROM SIGN). I only know that it has always been done this way. There is one workaround for those troubled by the discontinuity: Unicode normalization....

comments not archived

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day