Every character has a story #33: U+1e9e (CAPITAL SHARP S, Microsoft edition - Part 2)

by Michael S. Kaplan, published on 2009/07/29 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/07/29/9851698.aspx


You may want to start with Part 1 of this series....

Okay, here we go. Last time some of the font stuff was covered.

It is nice when letters are given the opportunity to look good.

In the end, what can often be more important than how something looks is how it arrives, and how it behaves!

I'll divide this up into a few parts.

Part 1: INPUT

This letter, LATIN CAPITAL LETTER SHARP S, was not added to the German keyboard.

So if you wanted to input it, then you are on your own.

Good luck with that for the average user. Oh wait, the average user kniws there us no such letter anyway. Never mind, the lack of a way to input the letter is okay....

Part 2: COLLATION

Well, if you go back to blogs like Dere are qvestions? In zat case... and the one that started it all What the %#$* is wrong with German sorting?, the following collation equivalences already existed:

ß == ss

Now note that now, just as was the case then, I was not stating any special kind of linguistic truth.

I am simply point out that if you want ß to be treated as equivalent to SS when you ignore case that you have to recognize that ss is also equivalent to SS when you ignore case. So that this relationship which looks so unnatural to a native speaker allows for meaningful results to be returned.

Now that we have a new uppercase form of the letter, let's see where it fits.

As Peter Gibbons pointed out yesterday:

But what's more important is that the collation algorithms seem to process "ẞ" right. At least in explorer with filenames.

To make this happen, the following equivalences are needed:

ß == ss

ẞ == SS

Now with these in place, everything behaves the way people expect them to, even if the above equivalences are like fingernails on the chalkboard to folks in Germany.

Points given for intuitive results, right? :-)

Anyway, there you have it.

Though, as I pointed out way back in Collation != case, still, collation is not case.

So now we get to part 3.

Part 3: CASE

The wheels are gonna come off the wagon a bit here, I will admit.

Basically, it seems that the casing relationship between and ß was added, but only in the linguistic tables, and only to say that lowercases to ß.

This means that two files that differ only by their use of one letter versus the other can co-exist in the same directory.

This kind of sucks from an intuitive nature of results, so I am going to hope that the PART 1 issues with the inability to input the letter can be relied on. Though technically an obfuscationary solution, I won't think about the potential security and spoofing issues of these letters that look so similar in so many fonts and focus on the lack of built in input and the fact that it is technically someone else's problem now (as this ant in Alaska likes to frame the situation!).

I am on the record (for what it's worth) as to how I felt the situation should have been handled here

But on the other hand, the case table is used in order to enforce the case insensitivity in the NT object namespace and the file system. And one clear issue is that there is no good reason to allow one to put filenames differing only by the presence of U+00df and U+1e9e in the same directory. Users would either never try it or they would never expect it to work. So it is quite possible that in the next version of Windows (which only does simple casing) it may make the most sense to make the two characters case variants of each other -- to enforce reasonable use of both letters!

There is still lots of time to decide, though at present I am leaning this way since it will give the most intuitive behavior for end users (even at the expensive of giving slightly unintuitive results for developers).

Ah well, coulda woulda shoulda. Or whatever.

I did accidentally discover an unrelated thing, a thing that I'll talk about tomorrow....


# Mihai on 29 Jul 2009 3:33 PM:

"ẞ == SS"???

In what locale would that be?

Feels really wrong.

# John Cowan on 29 Jul 2009 4:19 PM:

Case folding should have been treated as a legacy ASCII feature in the file system, and not extended to the whole Unicode world.

# Michael S. Kaplan on 29 Jul 2009 8:01 PM:

John,

1) I disagree with you.

2) Many other people do as well.

3) This is hardly a legacy filesytem thing -- this is the core behavior of the NT object namespace and has been since NT 3.1.

4) Even if all of that was untrue, reversing behavior that has existed since 32bit Windows first started being used (1993!) is an unrealistic goal.

# Michael S. Kaplan on 29 Jul 2009 8:07 PM:

Hey Mihai,

Simple logic.

If UCase("ß") == "SS" and UCase("ss") == "SS" then that means that "ß" == "ss".

And further if UCase("ß") = "ẞ" as well, then "ẞ" = "SS".

Follow the logic trail and you'll get there too, I doubt the formal proof would be needed! :-)

# Carl on 30 Jul 2009 12:35 AM:

I don't even see why we need to have one file per filename per directory anymore. Sure, files need to have canonical names behind the scenes, but those could be UUIDs for all the user cares. Put in enough scaffolding, and it could all be hidden under a layer of pseudo-localization.

Bring back WinFS!

# Mihai on 30 Jul 2009 12:46 PM:

But why UCase("ß") = "ẞ"?

Why would a German character map to a Turkish one?

# Michael S. Kaplan on 30 Jul 2009 1:07 PM:

It is not Turkish -- that is the Capital Sharp S.

Peter Gibbons on 27 Aug 2009 7:45 AM:

SQL Server 2008 R2 (August 2009 CTP) doesn't seem to have collations that sort "ẞ" right.

Michael S. Kaplan on 27 Aug 2009 8:57 AM:

I didn't expect that there would be, to be honest (other than binary collations, of course). But I will inquire....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2009/07/30 I know I'll Never say Never... again, at least

go to newer or older post, or back to index or month or day