by Michael S. Kaplan, published on 2007/08/21 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/21/4489089.aspx
(Today's title has two possible meanings, thanks to the verbal Sargasso of unclarity that is English)
Katy King is one of the testers over in the managed world (and sometimes contributor to the BCL Team Blog) who I manage to run across from time to time.
The main reason is that she periodically finds results that seem unexpected or inconsistent, and she wants to ask if she is missing something.
Now her track record is actually pretty good since pretty much all the issues she has raised are either messy things that are known and by design (but still messy) or actual bugs.
So if she wanted to skip the step of sending mail to ask, she wouldn't be out of line. I mean I hope she still sends the mail, since having people interested is always nice and sometimes when they are known issue the conversations get fascinating; I'd hate to lose that. :-)
Anyway, she found a doozy of an issue this time....
The bug she found was that some the case pairs seemed to be reversed!
Here are the UnicodeData.txt entries for the eight characters:
1FC3;GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI;Ll;0;L;03B7 0345;;;;N;;;1FCC;;1FCC
1FCC;GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI;Lt;0;L;0397 0345;;;;N;;;;1FC3;
1FF3;GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI;Ll;0;L;03C9 0345;;;;N;;;1FFC;;1FFC
1FFC;GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI;Lt;0;L;03A9 0345;;;;N;;;;1FF3;
2C65;LATIN SMALL LETTER A WITH STROKE;Ll;0;L;;;;;N;;;023A;;023A
023A;LATIN CAPITAL LETTER A WITH STROKE;Lu;0;L;;;;;N;;;;2C65;
2C66;LATIN SMALL LETTER T WITH DIAGONAL STROKE;Ll;0;L;;;;;N;;;023E;;023E
023E;LATIN CAPITAL LETTER T WITH DIAGONAL STROKE;Lu;0;L;;;;;N;;;;2C66;
The Vista uppercase table:
0x1fcc 0x1fc3 ; GREEK LETTER ETA WITH PROSGEGRAMMENI
0x1ffc 0x1ff3 ; GREEK LETTER OMEGA WITH PROSGEGRAMMENI
0x023a 0x2c65 ; LATIN LETTER A WITH STROKE
0x023e 0x2c66 ; LATIN LETTER T WITH DIAGONAL STROKE
And the Vista Lowercase table:
0x1fc3 0x1fcc ; GREEK LETTER ETA WITH PROSGEGRAMMENI
0x1ff3 0x1ffc ; GREEK LETTER OMEGA WITH PROSGEGRAMMENI
0x2c65 0x023a ; LATIN LETTER A WITH STROKE
0x2c66 0x023e ; LATIN LETTER T WITH DIAGONAL STROKE
So indeed these four pairs have uppercase characters where one would expect lowercase, and vice versa.
Yick!
The interesting question is whether to fix (and when, if the decision is made to fix).
Now for the default behavior of the filesystem and the NT object namespace, which is case preserving as I talked about in this post and this other one, would actually not be affected, since the characters are still treated as equal in comparisons. And since none of the characters are in any code pages, there is no non-Unicode behavioir to worry about.
But the problem comes in for the people who actually do conversions and then use the results for thing like case insensitive hashes -- which really happens. So the fix would really require something along the lines of an "opt-in" flag for these four case pairs.
And how did this happen?
(People really get in to Root Cause Analysis around here, or maybe that is just me!)
Funny story, I suppose, bug this is really a bug in two parts!
The four Greek script characters have been around since Unicode 1.1 but they were never included in the case table for some reason. When they were added, it was through an automated process that was erroneously making the assumption that uppercase comes before lowercase, which in this case it was not.
And the four Latin script characters had the uppercase letters added in Unicode 4.1 and the lowercase letters added in Uncode 5.0, and it just looks like the distance between the upper and lower case letters confused things a bit. No even automated excuse, just plain old human error....
I guess I can just console myself thinking about the fact that on the bright side, like over 230 other case pairs were added that weren't wrong. And people had really already been thinking about whether it made sense in the long run to add the notion of versioning to the case table, so now the issue looks like it may be forced appropriately!
So all is not entirely lost.
But clearly Katy deserves a raise, in any case. Keep 'em coming, Katy!
This post brought to you by ῃ, ῌ, ῳ, ῼ, ⱥ, Ⱥ, ⱦ, and Ⱦ (U+1fc3, U+1fcc, U+1ff3, U+1ffc, U+2c65, U+023a, U+2c66, and U+023e, a.k.a. GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI, GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI, GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI, GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI, LATIN SMALL LETTER A WITH STROKE, LATIN CAPITAL LETTER A WITH STROKE, LATIN SMALL LETTER T WITH DIAGONAL STROKE, and LATIN CAPITAL LETTER T WITH DIAGONAL STROKE)
referenced by