What the %#$* is wrong with German sorting?

by Michael S. Kaplan, published on 2005/04/10 01:25 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/04/10/406880.aspx


(Apologies to those who are offended by the South Park movie scene that inspired the title of this post!)

About a month ago, Daniel J. Smith asked me something that prompted me to say Dere are qvestions? In zat case...

Then last week, Martin Müller asked in the microsoft.public.dotnet.internationalization:

Recently I've stumbled across the fact that the CompareInfo for my default culture de-DE as well as for InvariantCulture considers "ss" and german "ß"  (szlig) equivalent, which is not correct!

For example, calling lassen".IndexOf("ß") yields 2 instead of 0.

CultureInfo.InvariantCulture.CompareInfo.Compare("lassen", "laßen") returns 0, which is wrong, too.

Using CompareInfo.IndexOf() without special CompareOptions gives the same incorrect results. When I use CompareOptions.Ordinal, however, IndexOf correctly returns -1 and  Compare returns inequality. But CompareOptions.Ordinal cannot be combined with any other flag, so a case insensitive comparison isn't possible this way.

This bug occurrs with IndexOf and Compare of both String and CompareInfo.

Any comment on this or info when this will be fixed?

Well, I have a comment, but things are working as designed so nothing is going to be "fixed". I will explain....

In the German language, the Sharp S ("ß" or U+00df) is a lowercase letter, and it capitalizes to the letters "SS". Now Microsoft's casing tables only support simple Unicode casing, which does not include any rules that would change the size of the string such as this one. So doing a "ß".ToUpper() call will not return "SS".

(for more info on those casing rules, see CaseFolding.txt in the Unicode Character Database)

But in any case, collation can be a bit more flexible. Since the Sharp S is very much a German letter and not one widely used outside of German, it is included in the default table rules used by all locales (which allows German to be kept in the default table and it will be used by all locales that do not conflict).

But obviously on most locales, "ss" is what uppercases to "SS". Even on German, "ss" would uppercase to "SS".

So it is only logical to assume that in such a case, that if

"ss".ToUpper() == "ß".ToUpper() == "SS"

then

"ss" "ß"

at least for the technical purpose of facilitating the ability to treat these other cases properly.This why on almost all locales (including the invariant locale), "ß" looks so much like "ss".

 

This post is brought to you by "ß" (U+00df, a.k.a. LATIN SMALL LETTER SHARP S)
And really, who elase would it be? :-)


# Thomas on Sunday, April 10, 2005 4:51 AM:

That's bad. Because "lassen" means "let"; "laßen" is the plural past of "read"

"ss" ≅ "ß" might be logical, but no-one ever claimed that German is a logical language ;)

Thomas

# Michael S. Kaplan on Sunday, April 10, 2005 5:41 AM:

Ah, and if someone does a ToUpper() on both words to compare them on a system that does full Unicode casing, they will be identical. And then if that same implementation does a ToLower() on that same string once again then the difference will still be lost.

Lucky for Germans that Microsoft does not currently do full Unicode casing, huh? :-)

# Peter Reiter on Sunday, April 10, 2005 7:05 AM:

Thomas,
'laßen' is now wrong and is the old orthography. Now you write 'lasen' according to the new otrhography.
Therefore, "lassen".ToUpper() != "lasen".ToUpper()

There might be problems with other words though, like "es floss" (== it flowed, from 'fließen') and "das Floß" (== the raft)

# CornedBee on Sunday, April 10, 2005 2:04 PM:

> but things are working as designed
Bugs in the design are still bugs. The default collation is not supposed to be case-insensitive, so even treating characters that have the same uppercase form is wrong.

I'll admit, though, that ß is a real problem.
Let's take a particularly difficult word, "fun".
Fun translates to "Spaß" or "Spass" - both are correct spelling. "Spaß" would be the de-AT version, "Spass" the de-DE or de-CH version. The reason is as follows:
The new rules for ß say that it only comes after vowels that are dragged out, or after diphtongs. (eu, au, ...)
In most Austrian dialects, "Spaß" is pronounced "shpahs", dragging out the 'a'. In most German dialects it's pronounced "shpuss", with a short 'a' that's pronounced as in "fuss".
So suddenly the spelling is different for the two languages.
Switzerland doesn't use ß at all, so the spelling can only be "Spass".

It is not entirely correct to say that ß is uppercased to "SS". Sure, nowadays it is, but not so long ago, the correct uppercase form was "SZ". There is a distinction between ß and ss, even in uppercase, although there it's no longer detectable.
As other people pointed out, uppercasing ß to SS is a true loss of information. Floß-floss is just one of many words. Straße (street) and Strasse (current head of the political party FPÖ in Austria) become the same, too. And many others. This loss of information is bound to uppercasing - is it really necessary that collation has the same loss if the string is not uppercased?

# Michael S. Kaplan on Sunday, April 10, 2005 2:37 PM:

Unfortunately, the architecture does not allow the distinction to be noted in one case and ignored in the other.

# erich on Sunday, April 10, 2005 8:39 PM:

Another example:
"in Massen" means "in huge quantities"
"in Maßen" means "within limits" (i.e. rarely)
which is more or less the opposite!
They are also pronounced differently - think of "Maasen" for the second one. A general (new) rule is that a sharp s is used after a long vocal, whereas a double s follows a short vocal.

the first is not used too often, since we have the Term "Massenproduktion" for mass production, which allows to avoid this "kind of ambiguity", while the second is common use!

I agree that ß=ss when it comes to sorting. But only for sorting, not for comparing!

# Michael S. Kaplan on Sunday, April 10, 2005 10:05 PM:

Ah, but there is the rub -- sorting *is* comparing....

# Thomas on Monday, April 11, 2005 5:41 AM:

> Ah, but there is the rub -- sorting *is*
> comparing....

that's not true ;)

if you ignore uppercase/lowercase when sorting a list it doesn't mean, that AAA and aaa are identical.

One thing to add:
The new German orthography has nothing to do with this Problem. You can't expect, that everybody writes "Spass" instead of "Spaß" because the second might be the wrong way.
And there are a lot of old texts, too.

Imo, the ß in uppercase should be ß. Afaik the ß<->ss substitution was only for typewriters without this special char. But in times of unicode?


Mr. Kaplan, it's a bug, not a feature ;)


Thomas

# Michael S. Kaplan on Monday, April 11, 2005 6:53 AM:

I understand you may feel this way on an emotional basis, but is it somwthing you noticed thath as negatively impacted an application?

It has been in Windows for the last decade, and many people do at least subconsciously accept the equivalence. Sorting *IS* de facto collation and vice versa -- because they are both done via the same data and the same APIs.

# J. Daniel Smith on Monday, April 11, 2005 9:46 AM:

The problems with
"ss".ToUpper() == "ß".ToUpper() == "SS"
gets back to my original question of more reasons (other than Turkish "i") to avoid calling ToUpper() (or ToLower()) in code.

Although as some of the follow-up comments indicate, that doesn't solve all the problems...

I'm starting to wrap all my strings in classes so that I get more semantic information. For example, on Windows, filenames should always be compared case-insenstive (but don't upper/lower case as NTFS is case-preserving). I can make operator==() do that for a Pathname class.

# Rainer Bauer on Tuesday, April 12, 2005 11:38 AM:

The German ß in an abreviation of "sz". So theoretically, the correct uppercase would be "SZ".

# Thomas on Wednesday, April 13, 2005 10:48 AM:

Ahh... I like this topic :)

I had a look in my (old) Duden-dictionary:

shortened translation:

If you're going to write ß as a capital letter use SS
If misunderstanding is possible, use SZ (not for lower case letters)

Btw: ß is a ligature out of the old German type

http://de.wikipedia.org/wiki/Bild:Sz_modern.png


Thomas

# Carsten &amp;lt;c.posingies@gmx.de&amp;gt; on Thursday, May 05, 2005 6:09 AM:

> Fun translates to "Spaß" or "Spass" - both are correct
> spelling. "Spaß" would be the de-AT version,
> "Spass" the de-DE or de-CH version.

I beg my pardon to have to correct you regarding that matter. It seems, Austrians tend to divide Germany into Bavaria and "Rest". Anyhow, "Spaß" is de-AT and de-DE. Standard German defines "Spaß" as "shpahs". Anything like "shpus" is dialect, like in Cologne ("Mir hatte' so richtisch Spass!" / "me.ah hut.te so [now it's getting complicated... for the UK folks: pronounce the following "gh" like the "ch" in "Loch Ness"] ree.gh.tish shpus"), and Berlin.

> The new rules for ß say that it only comes after vowels
> that are dragged out, or after diphtongs. (eu, au, ...)

Quite correct, so far. But please don't assume that anything that doesn't sound like Vienesse is Standard German.

> In most German dialects it's pronounced "shpuss",

As I said: not in Standard German.

> so the spelling can only be "Spass".

Wrong. It's still "Spaß". The challenge starts elsewhere, namely that Standard German isn't exclusive but inclusive. CH doesn't use the "Eszet ligature" at all. They pronounce "Straße" as "strahse" but write "Strasse". They say "spahs" but spell it "Spass". This spelling is also correct in DE and AT. The other way round, AT and DE spelling is also correct (nevertheless not used) in CH.

Regarding Microsoft and the .NET-Framework, we have kind of a Gordian knot: "straße".ToUpper() /should/ result in "STRASSE", but how to deal with "STRASSE".ToLower()? Not really insolvable, but tricky.

# Michael S. Kaplan on Thursday, May 05, 2005 7:38 AM:

>>>Regarding Microsoft and the .NET-Framework, we have kind of a Gordian knot: "straße".ToUpper() /should/ result in "STRASSE", but how to deal with "STRASSE".ToLower()? Not really insolvable, but tricky.

Well, the .NET Framework would not really do anything like this until/unless they decided to implement full Unicode casing (right now we just do simple Unicode casing, minus a bunch of characters since we hvae not updated the table in a while).

But it is a tricky problem. Would people want *all* incidences of SS to be lowercased to ß? Or would that also be confusing? This particular issue is one that is easier to do in the collation tables than the casing ones....

# David on Monday, June 20, 2005 10:25 AM:

As pointed out in an earlier post, "ß" is to be treated as "sz" and there is no capitalized version of this letter. That's how it is. Any sorting that claims to sort correctly for the german language and screws this one up is plain broken. It's not a feature, it is a bug.
Also mentioned in some comments, the spelling of "Strasse" is common, but inconsitent with the pronounciation rules of the german language. Those state that a vowel before a double consonant is to be pronounced short. That means writing "Strasse" results into pronouncing the "a" short, which gives a word that does not exist in the german language.
Replacing "ß" with lower case "sz" is the only logical approach, especially when the letter "ß" is even called "esszet". I have no idea wich braindead ignorant came up with "ss" as a replacement for "ß". For example, writing "Straße" alternatively as "Strasze" will also not collide with the pronounciation rule as there is no double-consonant and thus the "a" is pronounced longer.
German is simple and logical, unless some freaks come along and destroy this language as has happened with the new rule set (the "Schlechtschreibdeform" = "bad writing deformation").

# Michael S. Kaplan on Monday, June 20, 2005 10:32 AM:

There is certainly no shortage of passionate opinions here....

# Tanveer Badar on Thursday, December 20, 2007 9:53 AM:

I think you intended to write

For example, calling lassen".IndexOf("ß") yields 2 instead of -1.

instead of

For example, calling lassen".IndexOf("ß") yields 2 instead of 0.

Sorry about so many instead of.

# Michael S. Kaplan on Thursday, December 20, 2007 12:08 PM:

????

I was quoting someone else, who (as I pointed out) was mistaken anyway....


referenced by

2010/03/15 Thus the problems resist solution, and the workarounds are often inadequate

2009/07/29 Every character has a story #33: U+1e9e (CAPITAL SHARP S, Microsoft edition - Part 2)

2008/02/24 The idea has to do more than just make sense to me (aka How S-Sharp are *you* feeling today?)

2007/08/24 Every character has a story #28: U+1e9e (CAPITAL SHARP S)

2007/05/05 All right, mistakes were made #2 (What the %#$* is wrong with German Phonebook sorting?)

2005/12/14 It may seem like a bug, but it is not....

2005/11/22 More on the fabled EqualString

2005/11/13 Hungarian is even more complicated than I thought

2005/09/25 Every character has a story #15: CAPITAL SHARP S (not encoded)

go to newer or older post, or back to index or month or day