by Michael S. Kaplan, published on 2008/02/24 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/02/24/7864631.aspx
I am writing this blog from my own laptop waiting in the ER at the hospital (all of the quotes are from archives of old mails on my machine, not from memory!). It all happened when I was heading back after seeing a show in Ballard last night. By the time you read this I will actually be out again so there is no need to panic. Let's just say that Michael should not push himself too hard (especially when his scooter is in the shop) and leave it at that. I am sure that I am just fine and will probably explain what happened at some point. Crap like this keeps me humble, and anyone who knows me will claim that I can always use more of that....
Some time near the beginning of the month, Peter Edberg of Apple asked:
We have had several bug reports at Apple complaining that in case-insensitive string search, U+00DF "ß" matches "ss". Apparently this is due to the following line in CaseFolding.txt:
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
Isn't this an error? In the Unicode collation data, there is a secondary difference between U+00DF "ß" and "ss". The way I read the Unicode Collation Algorithm, case folding should preserve primary and secondary differences, and only eliminate differences at the tertiary level and below.
Am I misunderstanding something?
Peter is actually spot on here and as usual missing nothing. In fact, this is an issue I have talked about in relation to Windows before (long-time readers may recall Dere are qvestions? In zat case... and What the %#$* is wrong with German sorting?).
In any case, regular reader here John Cowan pointed out:
> We have had several bug reports at Apple complaining that in case-insensitive string search, U+00DF "ß" matches "ss".
It pretty much has to. There is no way to tell (without knowing German) whether any given "SS" in text is an upcased version of "ss" or "ß". Consequently, if you want "SS" and "ss" to match, and likewise "SS" and "ß", then "ss" and "ß" will naturally match too. You could special-case this in code at the expense of making all case-folding slower.
> Isn't this an error? In the Unicode collation data, there is a secondary difference between U+00DF "ß" and "ss".
That's because "ss" is also used as a fallback when "ß" is not available. It is in effect both a secondary and a tertiary difference.
In Unicode 5.1 there will be a capital sharp S, but that is never used in running text, only in all-caps display text, and not always then. So it doesn't solve the problem, which is simply an inescapable quirk of German orthography. (The real answer is not to upcase German, but that battle is long since lost.)
Microsoft specifically avoids that weirdness of the mixed secondary/tertiary difference, in part because our architecture kind of requires it, though I suspect it would have happened anyway....
Anyway, Peter responded to John:
John,
Thanks, that makes sense. Ken Whistler (here at the UTC meeting) also just clarified this. He also indicated that a particular implementation of case-insensitive string search could choose a different approach to matching of U+00DF "ß" and "ss" without being non-conformant with Unicode (It would just not be following CaseFolding.txt).
And also Markus Scherer also added some additional interesting words to the mix:
A couple of years ago I wrote an email to DIN proposing to change DIN 5007 so that ß and ss are a tertiary difference, to make them consistent with mostly being case-different. However, their response was that they see ß more as a ligature, and ligatures sort as secondary differences in DIN 5007. The default UCA table follows DIN 5007 with respect to this as a secondary difference.
For Unicode case folding, there is really no choice: ß needs to uppercase to SS (at least for most users) which lowercases to ss. Therefore, ß and ss are in the same equivalence class, and that's how case folding is constructed.
In addition, Germans don't always understand when to use ß vs. ss, and in Switzerland ss is always used instead of ß, so it makes sense for somewhat-lenient string comparisons to equate them.
In my opinion, treating ß and ss as a case difference is the best behavior for this somewhat messy situation. (I did grow up in Germany,up to and including college.)
And there you have it!
Of course the view is likely going to get rockier soon, with Unicode 5.1 and the new CAPITAL SHARP S. The Sharp S and many of the issues surrounding it of course represents an issue that I have been blathering about for some time, considering all of the following prior posts, just for starters....
That last one has in it my recommended changes for what I think Microsoft (and Windows) ought to do for both case and collation in the next version of Windows, which will be released at some unknown date after U+1e9e (CAPITAL SHARP S) is out there in the upcoming Unicode 5.1, which are:
For Microsoft, it raises some interesting questions for both collation and case for the next version of Windows.
I mean, think about the issues I have already talked about in posts like What the %#$* is wrong with German sorting? where we make ss equal to ß so that the uppercase version "SS" will sort near the ß in a sort ignoring case -- where we do things that make less linguistic sense in order to give regular results that are intuitive.
So who would expect that if U+00df is equal to ss that U+1e9e wouldn't be made equal to SS? Meaning that in the collation tables, U+00df and U+1e9e would simply be case variants, with no real choice in the matter.
And as to casing....
Now just because we make the relationship in casing does not mean we make it in collation. After all, as I have pointed out several times before, collation != case.
But on the other hand, the case table is used in order to enforce the case insensitivity in the NT object namespace and the file system. And one clear issue is that there is no good reason to allow one to put filenames differing only by the presence of U+00df and U+1e9e in the same directory. Users would either never try it or they would never expect it to work. So it is quite possible that in the next version of Windows (which only does simple casing) it may make the most sense to make the two characters case variants of each other -- to enforce reasonable use of both letters!
The collation change is kind of obvious -- what else could it be, ever?
The casing change is a bit more controversial, though, since it does not technically match Unicode.
Though since the simple casing requirements of Windows where the length can never change keep SS from ever being an option there, and in a case insensitive file system the notion of putting the lower and uppercase variants of the character in the same directory just feels like the wrong answer. Having these four entities:
all be the same in collation (when ignoring case distinctions) and having both the first two paired to each other and the second two paired to to each other in the casing tables just makes sense -- anything else will lead to unintuitive results in normal situations -- and those variations would amount to genuine bugs from a user and a linguistic standpoint....
Now of course I am the developer owner of neither case nor collation at this point, which means that having it make sense to me is not necessarily the principal criteria to having the idea championed and eventually seeing the behavior updated in either case or collation.
But I do still chat with the various owners in development, program management, and test from time to time.
And some of them even read this blog now and again....
So they will at least have the opportunity to have my opinion on the matter. :-)
This blog brought to you by the ever-popular ẞ (U+1e9e, aka LATIN CAPITAL LETTER SHARP S)
John Cowan on 24 Feb 2008 4:43 PM:
You might want to fix the encoding in my part of the text sometime. I don't know how that happened.
The important issue around capital sharp-S is that the uppercasing of ordinary sharp-S shouldn't be changed: the simple uppercasing should be to itself, the full uppercasing should be to SS. We need that not only for backward compatibility, but because it's what normal German text does.
After that, everything else can be natural: lowercasing capital sharp-S can provide ordinary sharp-S.
Michael S. Kaplan on 24 Feb 2008 4:58 PM:
On Windows, one-way casing does not work, and is a bad idea. Casing has to go in both directions. And stability there does allow one to add mappings if new characters are added, as they were here....
What is wrong with the encoding? It looks the same for me as it did in the email.
Michael S. Kaplan on 24 Feb 2008 6:26 PM:
One other important note is that the case insensitivity on Windows is designed to be case PRESERVING -- so that if you had the lowercase letter then it will stay that way; it will not become the uppercase one. All that the table entry will do is keep people from using both letters in the same directory, the same way that it stops "AAA" and "aaa" from both being there....
John Cowan on 28 Feb 2008 2:52 PM:
Your browser is too forgiving, then (as is mine, but less so).
Looking at the raw HTML, most of the sharp-S characters are mis-encoded as a C3 byte followed by "Ÿ" instead of a C3 byte followed by a 9F byte. Apparently both of us see this as a sharp-S instead of an encoding error mark followed by a LATIN CAPITAL LETTER Y WITH DIAERESIS.
But then in the quotation from me, the sharp-S is encoded with *three* bytes, EF BF BD, which is U+FFFD, the REPLACEMENT CHARACTER. I have no idea where or how that got inserted into the text.
My point about stability is that it's Just Wrong to convert a sharp-S into a capital sharp-S when uppercasing; it should always become SS unless the user requests otherwise.
Michael S. Kaplan on 29 Feb 2008 11:41 AM:
Okay, I think it is better now...
But in any case, Microsoft never did the upcase Sharp s to SS thing in its casing table, so there is no backcompat break in what I am suggesting. and given the case-preserving nature of Windows no existing text will be broken by the addition.
John Cowan on 4 Mar 2008 1:19 PM:
Treating standard and capital sharp-S as the same under folding is very sensible: what I am objecting to is changing sharp-S to capital sharp-S when you are explicitly converting to uppercase. If you don't convert it to SS, at least leave it alone. Capital sharp-S is a display freak, and should never appear unless explicitly asked for.
Michael S. Kaplan on 4 Mar 2008 1:32 PM:
Unless the Germans want it that way in a few years? :-)
The platform does not have the option of providing the functionality unless this entry is added -- it is like the ss=sharp s issue I have blogged about in the past -- yes, people can object, but if it is not there then what they do want in comparisons (won't be able to happen. So this is a building block piece that is crucial to the functionality...
referenced by