"Michael, why does ToTitleCase suck so much?"

by Michael S. Kaplan, published on 2005/03/04 02:02 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/04/384927.aspx


In the title of this post I am actually quoting email I have received on the topic, mail similar to others I have been sent many times ever since I started posting about case issues over the last few months. And that is one of the tamer ones I have received!

People seem to hate the TextInfo class for its ToTitleCase method.

To quote a slightly "nicer" version of the question, someone named Ruben posted the following in the suggestion box:

Perhaps an article on the problems when using things like ToTitleCase, which is at war with just about any style guide, just loves acronyms (albeit the feeling is not mutual), and breaks spelling for languages like Dutch and Gaelic (e.g., IJmuiden and Oileán na gCapall, which are perfectly regular capitalizations in their respective languages; as an illustration on why Unicode doesn't solve linguistic issues, despite many people's assumptions that it does).

You can probably see why I put the word "nicer" in quotes, since many will feel as I do that you do not have to use foul language to post text that is harsh and biting!

For the origins of attemps at this method, we will need to head into the way-back machine to look at the old VB/VBA function, StrConv and its vbProperCase conversion. This function "Converts the first letter of every word in string to uppercase." It does so by defining the word breaking characters as follows:

The following are valid word separators for proper casing: Null (Chr$(0)), horizontal tab (Chr$(9)), linefeed (Chr$(10)), vertical tab (Chr$(11)), form feed (Chr$(12)), carriage return (Chr$(13)), space (SBCS) (Chr$(32)). The actual value for a space varies by country for DBCS.

Note that this function shows the same qualities of international ignorance, and even though the function has an LCID parameter, the actual amount of variation between locales is pretty small.

The VB function gets a little better in VB.Net (cf: VB.Net's StrConv), in that it now has a linguistic casing option, which is great for Turkic....

And then there is the Unicode Standard, which defines the title case property values in Unicode and the Unicode Character Database with the following excerpted quotes:

"Because of the inclusion of certain composite characters for compatibility, such as U+01F1 "DZ" LATIN CAPITAL LETTER DZ, there is a third case, called titlecase, which is used where the first character of a word is to be capitalized. An example of such a character is: U+01F2 "Dz" LATIN CAPITAL LETTER D WITH SMALL LETTER Z. "

"The choice of which words to titlecase is language-dependent. For example, "Taming of the Shrew" would be the appropriate capitalization in English, not "Taming Of The Shrew". Moreover, the determination of what actually constitutes a word is also language-dependent. For example, l'arbre might be considered two words in French, while can't is considered one word in English."

"In most cases, the titlecase is the same as the uppercase, but not always. For example, the titlecase of U+01F1 "DZ" capital dz is U+01F2 "Dz" capital d with small z."

"There are even single words like vederLa in Italian or the name McGowan in English, which are neither upper, lower, nor titlecase. This format is sometimes called innerCaps, and is often used in programming and in Web names. Once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation. There are also single characters that do not have reversible mappings, such as the Greek sigmas above."

Obviously Unicode hints at the complexities of title case in languages, but it does not really do much to support it in data (a task that would obviously require dictionaries, rules, and data. This even maks an interesting interview question, for people who are looking into those. :-)

While the text does talk a mean game, the actual data in Unicode for title casing is limited to a few of the digraphs like DZ (U+01f1, LATIN CAPITAL LETTER DZ), which through the miracle of title casing becomes Dz (U+01f2, LATIN CAPITAL LETTER D WITH SMALL LETTER Z. In this context the Unicode data is little more than making sure that diagaphs get their own say....

Now let us move to the TextInfo method ToTitleCase method, which explains itself as follows:

Generally, title casing converts the first character of a word to uppercase and converts the rest of the letters to lowercase.

The returned string might differ in length from the input string. For more information on casing, refer to the Unicode Technical Report #21 "Case Mappings," published by the Unicode Consortium (http://www.unicode.org). The current implementation preserves the length of the string; however, this behavior is not guaranteed and could change in future implementations.

Casing semantics depend on the culture in use. If using the invariant culture, the casing semantics are not culture-sensitive. If using a specific culture, the casing semantics are sensitive to that culture. Words that are selected for title casing depend on the language.

If a security decision depends on a string comparison or a case-change operation, use the InvariantCulture to ensure that the behavior will be consistent regardless of the culture settings of the system. However, the invariant culture must be used only by processes that require culture-independent results, such as system services; otherwise, it produces results that might be linguistically incorrect or culturally inappropriate.

Now, currently the only culturally different casing behavior is the same rule one sees in Turkic languages, as I described in The [Upper]Case of the Turkish İ (or: Casing, the 2nd). While the potential for richer behavior exists such as some of the cases Ruben is referring to, none of them currently happen. But the way is open in the future for such things to possibly happen.

This would, however, be an expensive operation to get right in terms of the amount of research that would be required. The help topic is therefore at best optimistic about such work happening. It may be best to set expectations more realistically and not talk about how culturally sensitive this method is (since it is not, at least not yet!).

Perhaps we could point out how it goes along wih Unicode's somewhat vague definition, so that at this point it is really just lame by the transitive theory of developing to a standard.

It makes a catchy slogan -- do you think we can we put

TextInfo.ToTitleCase -- No lamer than Unicode

on a T-shirt? :-)

 

This post brought to you by "NJ", "nj", and "Nj" (U+01ca, U+01cc, and U+01cb, a.k.a. LATIN CAPITAL LETTER NJ, LATIN SMALL LETTER NJ, and LATIN CAPITAL LETTER N WITH SMALL LETTER J)
(a.k.a. the Unicode UPPERcase, LOWERcase, and TITLEcase forms of the letter)


# AC on 4 Mar 2005 8:20 AM:

Or maybe you could get:

Unicode: No lamer than Microsoft

put on a shirt, instead.

# Sebastian Redl on 4 Mar 2005 12:29 PM:

Don't you think that "If it can't be implemented properly, don't implement it." would be a good maxim to follow?

# Michael Kaplan on 4 Mar 2005 12:32 PM:

Well, from the point of view of implementing a property that exists in Unicode, it *is* implemented.

The parts that have apirations to higher principles of cultural sensitivity, on the other hand... :-)

# Ruben on 4 Mar 2005 4:30 PM:

And you're calling *my* language harsh and biting? IMHO, neither Unicode nor ToTitleCase is lame. They're great once you know how (and when) to use them. It's the crappy documentation that accompanies them that's lame.

# Michael Kaplan on 4 Mar 2005 4:35 PM:

Heh heh heh -- ok, I guess I am harder on stuff I own. Given how it they are both documented (Unicode introduces the concept by talking about the same sort of linguistic issues that the .NET Framework hints at and that your posts gives additional concrete answers about).

Both are lame in that they hint at an important issue and then fail to deliver on it.

Your post reflects that fact, and is only biting at what (in my opinion) deserves to be bitten. :-)

# Norbert on 5 Mar 2005 10:32 PM:

So the only localized behavior right now is for Turkish "i"? It should be very easy to implement a far better ToTitleCase for German - capitalize the first letter of the string passed in and return the remainder unmodified. German uses the same casing rules for titles as for normal sentences. I think the same is true for a large set of other languages.

# Michael Kaplan on 5 Mar 2005 10:35 PM:

Norbert -- that much is there now (heck, that much was in the original VB function). It is all of the other special rules for various languages that are not followed.

# Joshua Drake on 24 Mar 2008 12:18 PM:

You forgot to mention that it ignores UPPERCASE strings.

# Michael S. Kaplan on 24 Mar 2008 12:48 PM:

Well, "ignores" is a relative term here, right?

Certainly based on the VB-type "proper case" functionality they wanted to emulate, it is broken.


referenced by

2008/08/08 What's in a name?

2006/08/18 Sometimes, uppercasing sucks

2005/04/04 When casing does not need to roundtrip in .NET

go to newer or older post, or back to index or month or day