Dere are qvestions? In zat case...

by Michael S. Kaplan, published on 2005/03/10 00:15 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/03/10/391564.aspx


J. Daniel Smith asked about ToLower() (and ToUpper()) and some trouble he was having with them:

The comment about Turkish in the docs with regards to "i" doesn't carry a lot of weight with fellow programmers and we only care about 8 languages: English, FIGS and CJK.

One example that occurs to me is the word "Straße" in German. When upper-cased it should become "STRASSE" (no ß), but I can't seem to get code to do that. Also, being a noun, you can't lower-case this word as nouns always start with a capital in German; "straße" is wrong (unless there is a verb "strassen").

Windows and the .NET Framework mainly support simple, reversible casing -- which is to say single code point casing that have ToUpper() and ToLower() as inverse operations that can "undo" each other. As such, you cannot use either method to convert one to the other.

Comparison, on the other hand, will handle this case. If you compare "ß" to "SS" with CompareString and the NORM_IGNORECASE flag in Windows or the CompareInfo.Compare method and the CompareOptions.IgnoreCase flag in the .NET Framework, the two strings will be considered equal. Because in truth, they are equal -- just a case pair apart....

This happens on all locales, not just in German -- because the "ß" (U+00df, a.k.a LATIN SMALL LETTER SHARP S) is considered to be a simple case difference away from "SS" in the default table. Give it a try!

J. Daniel went on further to ask some additional questions:

In German, there is always an alternate spelling for words with umlauts: "für" is the same as "fuer". However, the converse is now always true; not every "ue" can be replaced with "ü".

Similarly for "ß", it can always be replaced with "ss" (and must when UPPER-CASING as there is no such thing as an upper-case "ß"). But not every "ss" can be replaced with "ß".

First, I can't seem to get ToUpper() to turn "ß" into "SS".

Second, how do I correctly deal with "für"=="fuer"?

Ok, I think I took care of explaining the deal with the Sharp S. But let me add that this is not a conditional opertion -- Windows is neither drawing on huge German dictionaries to avoid treating them with this sort of equivalency nor using machine reading techniques and schoolboy knowledge of German to read the text....

For the second point, you will want to look at what is known as the German Phonebook Sort -- LCID of 0x00010407. It will have all of the following equivalences in collation:

Ä == AE
ä == ae
Ö == OE
ö == oe
Ü == UE
ü == ue

You can just think of collation as the technology that will travel to where casing fears to go.... :-)

 

This post is sponsored by "Ä" (U+00c4, a.k.a. LATIN CAPITAL LETTER A WITH DIAERESIS)


# Michael Fink on Thursday, March 10, 2005 2:30 AM:

I think most german people wouldn't write "fuer" when they would have umlauts available, e.g. on the keyboard. It's probably the same with the capitalized STRASSE. The SS looks so "unnatural" that some even use the lowercase ß for the SS instead. One exception of course are crossword puzzles.

# Michael Kaplan on Thursday, March 10, 2005 2:36 AM:

Yes, I have heard the same from many people. At the same time, people generally don't want to see incorrect results if they *do* use this form....

For the umlaut especially, it is more common to use it versus this other form. But there are times when people definitely prefer one over the other.

The consequences are interesting, in any case. :-)

# Henry Böhlert on Thursday, March 10, 2005 4:40 AM:

My last name is Böhlert but I usually spell it Boehlert, e.g. for electronic data processing. There are so many systems that will just mess up.

You might not always have a German keyboard and using US-International is little known or may be too much for some people's fingers.

In de-ch, 'ß' is not used at all. Which has the somewhat over-stressed drawback of not being able to distinguish between drinking "in Maßen" and drinking "in Massen". (http://dict.leo.org/?lp=ende&lang=de&searchLoc=0&cmpType=relaxed&relink=on&sectHdr=on&spellToler=std&search=in+Ma%DFen)

# Sam Jost on Thursday, March 10, 2005 6:13 AM:

Beware: If you convert ß to uppercase in german it wont *always* be changed into SS.

If I remember right there is some rule that in names ß will keep being ß even in uppercase!

Especially funny if you have a street name 'Große Straße', which will be uppercased to 'GROßE STRASSE'.
Even most germans do not know of this rule, and quite a lot don't believe this is true :)

do you?

# J. Daniel Smith on Thursday, March 10, 2005 6:52 AM:

So if I only care about English, FIGS and CJK, I can readily use ToUpper() and ToLower()?

In my current codebase, I've carefully avoided both using CompareNoCase() instead of upper (or lower)-casing the two strings. I guess CompareNoCase() is more efficient, but ToUpper()/ToLower() seems to be more intuitive to some people--for one you can use operator==().

if (s1.ToLower() == s2.ToLower())

vs.

if (s1.CompareNoCase(s2) == 0)

So other than Turkish and performance, are there any differences between the two?

I think this may be the difference: if s1="für" and s2="FUER", no amount of upper/lower casing is going to turn "ü" into "UE". However, I can write

if (s1.CompareNoCase(s2, DE_Phonebook_Sort ) == 0)

and the strings will compare as equivalent. Yes?

# Michael Kaplan on Thursday, March 10, 2005 6:58 AM:

Exactly. Oh, there is also the Georgian bug (mentioned earlier in this blog) but Georgian may not be on your list, either....

But do not minimize the performance aspects -- those calls to methods on the .NET Framework allocate new strings, and the more you can avoid gratuitous alllocations, the better off you are....

# Brian on Thursday, March 10, 2005 8:31 AM:

In addition to the String allocation overhead, even a little thought should reveal that the comparison has much better best and average case performance, since it can return as soon as it finds 2 different characters. In fact, it's worst case is comparable to the best case when comparing the results of ToLower (but this is before you add in the allocation overhead!). It's unlikely to be a performance bottleneck, but there's no reason to write gratuitously inefficient code.

# Michael Kaplan on Thursday, March 10, 2005 8:36 AM:

Well, it is hard to know what to do with a question that basically says "I know you are talking about best practices but it does not seem to apply to me. Can I just do it the way I am now since although it is slow and wrong, it does not seem to slow or too wrong for my purposes?"

The answer is NO, you should not do that. Slow and wrong are not absolutes, but there is no way to know for sure if the solution that is known to be slow and wrong will not cause problems later....

# J. Daniel Smith on Thursday, March 10, 2005 11:54 AM:

What I was really looking for is ammo to use against fellow programmers (I'm trying to do the "right thing" myself): they seem to like using ToLower() to do a case-insenstive string compare.

As Brian mentioned, such code is unlikely to be a performance bottleneck, and s1==s2 is arguably clearer than CompareNoCase()==0. And nobody cares about the problem with "i" in Turkish.

I think I can get a lot further with correctness: my understaning is that is no way to make "für" and "FUER" compare equal using ToLower(), but it can be done with CompareNoCase().

# Michael Kaplan on Thursday, March 10, 2005 11:57 AM:

Well, start with that case, and you can then let them know that it is just the tip of the iceberg. The road they are heading down will never lead to anything but hard to understand and repro bugs....

# Ruben on Thursday, March 10, 2005 1:45 PM:

So if German's got a phone book sort, why doesn't Dutch have one?

For Dutch, the difference with the normal sort and the phone book sort, is that pesky IJ (again!). Standard sort treats it as I+J, phone book sort treats it as Y: Bruyn A., Bruijn B., Bruyn C. - useful, as in Dutch names, the Y and IJ generally have the same pronunciation, and people often write ÿ/y instead of ij.

If you're still reading: ij is called a "lange ij" [long ij], ei a "korte ei" [short ei], and y a "Griekse ij" [Greek y], all pronounced /Ei/. Dutch children end their ABC with X IJ Z. Was that city called Ysselstein or IJsselstein? (Depends on which city you refer to.) Yes, it hurts :-(

And there's even a 'dictionary sort', for lack of a better name, which sorts W X IJ Y Z (granted: weird, but some dictionaries and encyclopedias do). While I'm at it, any chance of StringInfo returning IJ as a single element?

Get a feeling why title casing IJ as Ij feels so unnatural, like I indicated in my earlier post? I hate it when Word does that. "Auto Correct" is what that's called. Yeah, right!

But then again, Dutch is definately not on that list of 8 languages people seem to care about so... Either that, or customs keeps your technology up at the German border. ;-)

# J. Daniel Smith on Friday, March 11, 2005 9:37 PM:

Oh yea, I can see how that Dutch IJ (IJ U+0132) could be quite pesky. I would agree that there's no way "IJ" should become "Ij"

From a software & translation point of view, once you got English+Major European+Asian language, it's probably "easy" to do any LtoR language. It's probably things like documentation, brochures, support, etc. that limit more translations.

I think it's be cool if my company supported Dutch too; I lived in Amersfoort for 18 months and tried to learn the language, but German kept getting in the way. And the Dutch are all happy to speak English...

# Michael Kaplan on Friday, March 11, 2005 9:43 PM:

The Dutch requirement is a bit different -- and it would require a more intelligent engine than the "one character at a time" table-based one used by NLS currently.

It does not mean I do not care about Dutch, I can promise you that. But sometimes the harder it is to support a requirement the longer it takes to see it supported.

Example -- CJK text was vertical as least as often as it was horiontal, but the computer age made horizontal more appealing since it could be supported more quickly than vertical.

Vertical is now there in some apps, but still not all. Can you imagine what would happen if Japan could not make use of computers? Yikes!

Compare with right-to-left languages like Arabic which did not have such a practical alternative at hand -- they had to wait a lot longer to see their language supported in computers....

referenced by

2009/07/29 Every character has a story #33: U+1e9e (CAPITAL SHARP S, Microsoft edition - Part 2)

2008/02/24 The idea has to do more than just make sense to me (aka How S-Sharp are *you* feeling today?)

2007/05/05 All right, mistakes were made #2 (What the %#$* is wrong with German Phonebook sorting?)

2005/12/14 It may seem like a bug, but it is not....

2005/04/10 What the %#$* is wrong with German sorting?

go to newer or older post, or back to index or month or day