When casing does not need to roundtrip in .NET

by Michael S. Kaplan, published on 2005/04/04 07:14 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/04/405174.aspx


A few days ago I reminded everyone about how every Unicode character has a story, and I was talking about U+03c2 a.k.a. GREEK SMALL LETTER FINAL SIGMA. You can read about it here.

In the deep, dark past I had also talked about the the meaning of "linguistic casing" on Windows. I never did talk about when/if that setting is used in .NET Framework (I wonder if anyone noticed I palmed that card?).

I also spent some time back then talking about the problems with that one-way casing issue with Georgian when I asked people to Get off my [lower] case. And how people should use uppercase operations rather than lowercase if they were trying to do any kind of case folding.

But this is not just a trip down memory lane....

The other day I was having an email conversation with one Jeff Cooperstein, an Architect over in the Developer Division. He was working through a bunch of these issues and he noticed something odd when he was trying out casing operations with .NET:

With the InvariantCulture, why is Final Sigma the only character other than the Georgian characters that doesn’t round trip:

ς (0x03C2) ToUpper -> Σ (0x03A3) ToLower -> σ (0x03C3)

With my default culture, there are lots of examples like this – For example: 

Dz (0x01F2) ToLower -> dz (0x01F3) ToUpper -> DZ (0x01F1)
ΐ (0x0390) ToUpper -> Ϊ (0x03AA) ToLower -> ϊ (0x03CA)
ϕ (0x03D5) ToUpper -> Φ (0x03A6) ToLower -> φ (0x03C6)
ϖ (0x03D6) ToUpper -> Π (0x03A0) ToLower -> π (0x03C0) 

However, with InvariantCulture, the only one that remains is ς

Does that second list ring bells? Indeed, it is the same as the so-called linguistic casing flag used by LCMapString with the LCMAP_LINGUISTIC_CASING flag. Jeff was spot on with what he found -- the .NET Framework will always pick up these additional mappings (something that of course works really well for casing in Turkic cultures like Turkish and Azeri). In the case of the invariant culture's CultureInfo (think CultureInfo.InvariantCulture), these extra mappings are not picked up.

So what about that GREEK SMALL LETTER FINAL SIGMA?

Jeff was right again -- that is the only "one-way" mapping that exists in the default casing tables on Windows and the .NET Framework. If you look at the sort keys for these three characters:

U+03c3  0F 13 01 01    01 01 00
U+0c32  0F 13 01 01 0A 01 01 00
U+03a3  0F 13 01 01 12 01 01 00

If you ignore the case weights (marked above in PINK), then they are equal. So from the standpoint of the Windows filesystem, registry keys, environment variables, and all of the other related items, these are all the same character.

I am not worried, though.

So why do I consider the Georgian case to be a terrible bug that ought to be fixed and this Greek case to be okay?

Well, there is the fact that the Georgian Khutsuri characters are not really ones that are understood by users, whereas the final sigma is very well known. Or the fact that the Greek case does have a two-way mapping, even if not one that allows for perfect round-tripping. Add to that the fact that the meaning of the text is not destroyed by being a SMALL LETTER SIGMA instead of a SMALL FINAL SIGMA -- so there is no real linguistic loss of meaning. And then the cherry on top of this sundae is the fact that the uppercase mapping goes along well with what the filesystem and everyone else uses for case folding!

There is a separate question as to why Windows and the .NET Framework are both not smart enough to handle this situation and put the right character in place depending on whether you are at the end of a word. For now I'll point rather glibly that we should first fix how we handle stuff at the beginning of words before we consider trying to tackle the endings!

 

This post brought to you by "ς" and "σ" (U+03c2 and U+03c3, a.k.a. GREEK SMALL LETTER FINAL SIGMA and GREEK SMALL LETTER SIGMA)


no comments

referenced by

2008/06/25 Seeing the tears, my heart went out to her as I asked her "Why the Long S?"

2007/06/12 The difference between 'Dangeous Characters' and 'Dangerous Minds' is the lack of Michelle Pfeiffer

2005/05/26 The last word on the FINAL SIGMA

go to newer or older post, or back to index or month or day