Sometimes, uppercasing sucks

by Michael S. Kaplan, published on 2006/08/18 15:04 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/08/18/706383.aspx


Case differences in casing scripts (Latin, Cyrillic, Greek, Armenian, Ecclesastical Georgian, Coptic, Glagolitic, etc.) ought to be easy.

But it's not. And not just for the reasons I have talked about in the past.

All the technical folks want is a simple set of mappings that have a 100% roundtripping capability and no change in size of the string. It is needed for the filesystem, for the NT object namespace, and so on.

But their hopes must unfortunately be dashed if those technical folks wanted their simple needs to match the needs of customers, since individual languages have their own specific preferences and expectations here.

Only some of which are supported by Windows or the .NET Framework. And dare I say it, most of them are not supported.

A great example of this can be seen in Greek, which has so many different traditions across it's history from ancient to modern times that we are lucky to have sites like this one to try and wade through the issues, which go way beyond the Greek final sigma issue I have talked about previously.

Starting with ancient Greek, there are three different preferences that call for three entirely different conventions for case mapping, as described here:

  1. The first approach (capital subscript) is to treat uppercase as equivalent to lowercase: if a mute iota is tucked underneath a lowercase vowel, it should also be tucked underneath an uppercase vowel. This is what the mediaeval scribes did, and it turns up both in pre-Modern Western typography of Classical Greek, and in Modern Greek typography—particularly that associated with the Church.
  2. The second approach (capital adscript) is to take capitals—and all caps text in particular—as equivalent to Ancient Greek text, which after all didn't have any lower case. In Ancient Greek writing, /oːi/ was written in capitals, as ΩΙ. So in a modern all caps text, the thinking goes, the same should be done: the capital version of ῳ is ΩΙ. This is seen sometimes in the west. In that case, the iota is no longer subscript/ypogegrammeni (written underneath the letter), but adscript/prosgegrammeni (written next to the letter).
  3. The third approach (small adscript) is a compromise: the capital version of the iota subscript is adscript, written next to the letter, but not as a full-sized capital iota. It appears either as a lowercase iota (full-sized or smaller than normal), or as a small-caps iota. This is usual practice in the West, and frequent outside the Church in Greece as well.

And then moving in to modern times, the debate about the (currently out of favor but still taught and used) polytonic vs. (currently in favor and highly recommended) monotonic systems. And case is where it gets interesting for us, as described here:

Greek differs from Latin in that it capitalises letters with diacritics differently, depending on whether the entire word is in capitals (whereupon diacritics are eliminated), or the initial is capitalised only, as in the first word in a sentence or in a title (whereupon the diacritics are retained, although they appear to the left of the letter rather than above it.) Thus, polytonic ἄνθρωπος capitalises to ΑΝΘΡΩΠΟΣ, but in titlecase to Ἄνθρωπος; monotonic άνθρωπος capitalises to ΑΝΘΡΩΠΟΣ and Άνθρωπος.

even without the roundtripping requirement, it is clearly hard to decide what the default behavior should be.

And how do you balance the legitimate and illegetimate needs of roundtrip-ability with the needs of a script that wants a convention to drop the accents upon capitalization (thus losing them forever since you can't exactly get them back)?

The answer, just like it was in the post "Michael, why does ToTitleCase suck so much?", is not very well. Of course the practices for ancient texts are by and large completely ignored, but the default case mappings in modern practice don't really match the Greek expectation of dropping the accent, either.

Perhaps a simple example would help. :-)

Take the word Ρύθμιση (Regulation) The code points are:

03a1 03cd 03b8 03bc 03b9 03c3 03b7

If you run this through Windows or .NET, it will uppercase to the entirely reversible ΡΎΘΜΙΣΗ, which is:

03a1 038e 0398 039c 0399 03a3 0397

But the expectation of people in Greece is more likely to be ΡΥΘΜΙΣΗ, which is

03a1 03a5 0398 039c 0399 03a3 0397

That second character would be expected to lose it's TONOS, so that if you lowercased the uppercased string, you would get back ρυθμιση, not ρύθμιση.

Unless you created a font that would literally display U+038e without displaying the Tonos, which would give one the best of both worlds with the only bad part being that confusability of such a solution.

Note that there are no title case mappings to help mitigate this, so ToTitleCase is once again not useful....

And of course this example ignores the even thornier problem with what to do when it is on the first letter, but you get the idea.

The solution for ancient texts is even more elusive, especially given the many differences in user expectations.

This post really just scratches the surface, if you are interested in the area then I highly recommend the links I pointed to, which go into even greater detail on the difficulties involved with Greek.

Now this is an area where potential improvements can be considered in the future, but there are no immediate built-in solutions available. All I can say for now is that it is one's best interests to avoid converting Greek strings to uppercase if one wants to avoid having a bad situation in a localized application....

 

This post brought to you by ύ (U+03cd, a.k.a. GREEK SMALL LETTER UPSILON WITH TONOS)


# RubenP on 18 Aug 2006 6:00 PM:

Isn't this true for French as well, that uppercasing often removes accents?

I do know that it's true for the Dutch stress mark: één, Eén, ÉÉN (used for een 'one, numeral 1' [e:n]). The stress marks here are used to distinguish it from the indefinite article een [ən] 'a, an'. It's considered bad form to use Één. Can't think of any Dutch words starting with a 'real' accent though.

Really, if you ever find a Latin script-based language containing exceptions to just about every rule you can throw at it, it's got to be named Dutch. Capitalisation, accents, hyphenation, collation, word order, heck the Dutch even violate the metric system with their own definition of pounds and ounces (not compatible with the British pounds and ounces, ofcourse). And hardly any Dutchman/woman even notices :-)

# mlippert on 18 Aug 2006 8:06 PM:

Wow, just when I think I've heard of most of the complexities you go and blow my mind again!

So what does NTFS do? Given what you've just said I'm guessing that it drops the accents so filenames that display with accents on lowercase letters are equivalent to those without the accents (ie you can't have 2 filenames differing just by the accents on letters).

Mike

# Michael S. Kaplan on 18 Aug 2006 8:19 PM:

Hi Ruben!

It is not as true for French as it once was.... I tend to think the rule was there because of the prevailing typewriter practice that simply made them look bad with accents, and now modern typography allows things to still look good and this changed the practice?

# Michael S. Kaplan on 18 Aug 2006 8:21 PM:

Hi Mike,

Well, Windows "solves" the problem by not supporting the idea. We need people to specify those names! And they would be quite unhappy if we did not (much moreso than their unhappiness about us not following the typographic practice!)

# Pavanaja UB on 19 Aug 2006 2:44 AM:

We, the Indic community, have no case to worry about. There is no case in Indic. JK...

Regards,
Pavanaja

# Michael S. Kaplan on 19 Aug 2006 11:18 AM:

Ah, that may be true. But then we don't have to worry about chillus, so I think it all evens out in the end....

:-)

# RubenP on 19 Aug 2006 2:46 PM:

Michael,

I guessed as much for French. In French the accents are so much more 'out there', it seems like it's a bad idea to drop them in the first place. In Dutch they're mostly there for resolving occasional ambiguity (such as the two 'een's) and stress.

[Often, you don't need things like italics to put emphasis on a word in Dutch; adding stress marks on the principal syllable is enough. Unfortunately, italics are easier to access than accents on most Dutch configurations. Using an actual stress mark beats the use of things like *this* :-)]

The Dutch example for Eén probably has the same origin. Though it's odd that people still insist on Eén rather than Één. But I guess you get used to it, and the exception becomes the only acceptable thing to see.

# Mike Dimmick on 20 Aug 2006 3:38 PM:

That really makes the 'why don't we get rid of the Caps Lock' key thing (which I saw here: http://www.edbott.com/weblog/?p=1438) seem very parochial. One commenter notes that on the French keyboard layout (paraphrasing), it's not so much Caps Lock as Shift Lock.

Some people suggest using Word's casing rules to get a title in capitals - either pressing Shift+F3 to rotate between lower case, Title Case and UPPER CASE, or using the All Caps font 'effect'. Can we rely on Word to do the Right Thing here? Does it depend on the setting of the selected language?

# Michael S. Kaplan on 20 Aug 2006 3:43 PM:

Well, Unicode's rules do not handle most of this, asnd currently Word's doesn't handle most of it, either. We simply aren't there yet in terms of all of the language rules being built in....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/09/29 The difference between Six Sigma and Sigma Diaresis is one must never fail; the other seems to do so by default

2006/09/14 Not just uppercasing or italicizing; bolding can cause problems too!

2006/08/30 If you wanted to get it done with the font...

2006/08/22 A localizability problem is an application bug, or alternately: Ρύθμιση σήματος

go to newer or older post, or back to index or month or day