by Michael S. Kaplan, published on 2006/08/18 15:04 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/08/18/706383.aspx
Case differences in casing scripts (Latin, Cyrillic, Greek, Armenian, Ecclesastical Georgian, Coptic, Glagolitic, etc.) ought to be easy.
But it's not. And not just for the reasons I have talked about in the past.
All the technical folks want is a simple set of mappings that have a 100% roundtripping capability and no change in size of the string. It is needed for the filesystem, for the NT object namespace, and so on.
But their hopes must unfortunately be dashed if those technical folks wanted their simple needs to match the needs of customers, since individual languages have their own specific preferences and expectations here.
Only some of which are supported by Windows or the .NET Framework. And dare I say it, most of them are not supported.
A great example of this can be seen in Greek, which has so many different traditions across it's history from ancient to modern times that we are lucky to have sites like this one to try and wade through the issues, which go way beyond the Greek final sigma issue I have talked about previously.
Starting with ancient Greek, there are three different preferences that call for three entirely different conventions for case mapping, as described here:
And then moving in to modern times, the debate about the (currently out of favor but still taught and used) polytonic vs. (currently in favor and highly recommended) monotonic systems. And case is where it gets interesting for us, as described here:
Greek differs from Latin in that it capitalises letters with diacritics differently, depending on whether the entire word is in capitals (whereupon diacritics are eliminated), or the initial is capitalised only, as in the first word in a sentence or in a title (whereupon the diacritics are retained, although they appear to the left of the letter rather than above it.) Thus, polytonic ἄνθρωπος capitalises to ΑΝΘΡΩΠΟΣ, but in titlecase to Ἄνθρωπος; monotonic άνθρωπος capitalises to ΑΝΘΡΩΠΟΣ and Άνθρωπος.
even without the roundtripping requirement, it is clearly hard to decide what the default behavior should be.
And how do you balance the legitimate and illegetimate needs of roundtrip-ability with the needs of a script that wants a convention to drop the accents upon capitalization (thus losing them forever since you can't exactly get them back)?
The answer, just like it was in the post "Michael, why does ToTitleCase suck so much?", is not very well. Of course the practices for ancient texts are by and large completely ignored, but the default case mappings in modern practice don't really match the Greek expectation of dropping the accent, either.
Perhaps a simple example would help. :-)
Take the word Ρύθμιση (Regulation) The code points are:
03a1 03cd 03b8 03bc 03b9 03c3 03b7
If you run this through Windows or .NET, it will uppercase to the entirely reversible ΡΎΘΜΙΣΗ, which is:
03a1 038e 0398 039c 0399 03a3 0397
But the expectation of people in Greece is more likely to be ΡΥΘΜΙΣΗ, which is
03a1 03a5 0398 039c 0399 03a3 0397
That second character would be expected to lose it's TONOS, so that if you lowercased the uppercased string, you would get back ρυθμιση, not ρύθμιση.
Unless you created a font that would literally display U+038e without displaying the Tonos, which would give one the best of both worlds with the only bad part being that confusability of such a solution.
Note that there are no title case mappings to help mitigate this, so ToTitleCase is once again not useful....
And of course this example ignores the even thornier problem with what to do when it is on the first letter, but you get the idea.
The solution for ancient texts is even more elusive, especially given the many differences in user expectations.
This post really just scratches the surface, if you are interested in the area then I highly recommend the links I pointed to, which go into even greater detail on the difficulties involved with Greek.
Now this is an area where potential improvements can be considered in the future, but there are no immediate built-in solutions available. All I can say for now is that it is one's best interests to avoid converting Greek strings to uppercase if one wants to avoid having a bad situation in a localized application....
This post brought to you by ύ (U+03cd, a.k.a. GREEK SMALL LETTER UPSILON WITH TONOS)
# RubenP on 18 Aug 2006 6:00 PM:
# mlippert on 18 Aug 2006 8:06 PM:
# Michael S. Kaplan on 18 Aug 2006 8:19 PM:
# Michael S. Kaplan on 18 Aug 2006 8:21 PM:
# Pavanaja UB on 19 Aug 2006 2:44 AM:
# Michael S. Kaplan on 19 Aug 2006 11:18 AM:
# RubenP on 19 Aug 2006 2:46 PM:
# Mike Dimmick on 20 Aug 2006 3:38 PM:
# Michael S. Kaplan on 20 Aug 2006 3:43 PM:
go to newer or older post, or back to index or month or day