Every character has a story #15: CAPITAL SHARP S (not encoded)

by Michael S. Kaplan, published on 2005/09/25 06:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/25/473632.aspx


Regular reader Maurits asked, in the Suggestion Box:

Can you comment on Andreas Stötzner's 2004 proposal for an upper-case ß code point, which was rejected by the Unicode consortium?

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2888.pdf

The proposal in question underwent a great deal of (not always entirely civil) conversation on the "member's only" list of Unicode....

I have also posted about the Sharp S before, on this blog, for example here.

The initial post about the proposal came from Markus Scherer of IBM:

Purely personal opinion:

I would have expected a proposal like this to see the light of day in a little less than 5 months...

Aside from few "discussions" and other curiosities, the majority of the document's samples shows clear _lowercase_ ß in otherwise uppercase text. Using normal ß in this way (like applying simple case mappings rather than full ones) is reasonably common. While German school children might at first scratch their heads about this irregularity, I am pretty sure that there is no pressure at all for introducing an uppercase variant - other than possibly by a local font vendor in search of a market.

It might be more likely for Germans to give up on ß than to add an uppercase version.

markus

http://www.daujones.com/comments_all.php?usrid=3504
http://faql.de/eszett.html
http://www.eibe-online.de/schulen/bfs_bensheim/darstellung_bensheim/FachbereichFarbtechnik.htm

Michael Everson then weighed in:

Look again. It shows capital sharp esses, though it does show small sharp esses in capital use because nothing else was available. The Duden evidence is not to be ignored.

People have been discussing this issue for a century. I think Stötzner has shown clear evidence for e capital sharp s.

Nobuyoshi Mori took a more technical approach to the analysis of the propoal:

My understanding is:
    1) Technically toupper( U+00DF ) should be defined.  It is currently defined as : toupper( U+00DF ) -->  U+00DF
    2) There are several ways to "display" an "uppercase ß" in German:
      2-1) "SS"    This is what German orthography says. It is also the most usual way to handle it.
      2-2) "ß"     This is used in exceptional cases when either there is no space for two characters, or for typographic reasons, or by ignorance of the correct orthography.
      2-3) "SZ"     This is an old variant of 2-1, only very rarely used.

The change of the current definition 1) breaks many existing Unicode implementations and data, and will cause compatibility issues.  The major issue is that the result of toupper( U+00DF ) becomes Unicode standard version dependent. 

\I know huge amount of Unicode implementations and Unicode customer data which will run into problems with the suggested Standard change.  Most of the database implementations, OS and PC products, Computer language implementations such as Java, C#, etc would be some of the examples. 

...

I therefore would like to request UTC to refuse the proposal.

Mark Davis agreed but had one small correction:

See http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G33992

The Unicode data supports two types of case operations: full and simple. The simple operations are for restricted environments where the number of characters cannot be changed. For any other situations the full mappings should be used. And when a full mapping is used, toUppercase( U+00DF ) --> "SS"

Markus Scherer was also unconvinced by Michael Everson's response:

I did look at the whole document including at each and every sample. Most of them are clearly lowercase ß between uppercase letters.

"People" may have been talking about it depending who "people" is. I spent my first 27 years in Germany and have never heard of any serious discussion of an uppercase ß. (Not sure I even need to qualify this with "serious".) Unless there has since then been an outcry in the population that I missed while visiting about once a year or while talking with my relatives, I don't see that this is on anyone's mind.

Real issues in discussion included the spelling of Kaiser (Keiser?) and other beloved words when the spelling reform was published.

Michael Everson responded thusly:

Which is what you would expect to find, in the absence of a more widespread availability of fonts with capital sharp esses. The evidence, and the author's arguments, suggest that the sharp ess is certainly acquiring case, and indeed has done so, whether the use of it is widespread or not. In my view, the Universal Character Set should encode such entities where they exist. They are facts.

Ken Whistler had to respond to that argument, though (I was tempted to respond myself, but I am glad he did instead since he argued the case more convincingly:

You are getting caught up in your own rhetoric about the UCS. The Universal Character Set is *not* the Universal Encyclopedia of Writing Systems -- it is a practical attempt at an engineering solution that everyone can use for digital representation of text.

Introduction of a capital ess-tzet just because it "is a fact", and despite the manifest evidence of overwhelming German practice and implementation to the contrary, while utterly ignoring all the kinds  of implementation problems that would result -- just hinted at by Nobu -- is just foolish.

The problem you are trying to deal with, namely the appearance of an majuscule design in some fonts for an ess-tzet in an all-uppercase context, can be dealt with by other techniques, specific to fonts and to word-processing systems (if there even proves to be a demand for it, which I doubt, given Markus' testimony). It does not require a muley insistence that because somebody shows in some context that it *might* be treated as a distinct uppercase letter, that that resolves all issues and makes it obvious that separate encoding is required in the Unicode Standard for this "thing".

I am getting *really* impatient with the kind of rhetorical stance you have taken here. It is not your job nor the job of the UTC nor WG2 to reform the German writing system. And it certainly is not in the UTC's interest to introduce, on spec, a dubious new German character, without a demonstrated need, but with a horrendous downside potential for screwing up German casing implementations. My prediction is that the UTC is quite likely to turn this one down flat, without a single member in favor of it.

As I said, much more convincing.... :-)

But it looks like everyone dug their heels in; Michael responded to Ken:

As a student of the world's writing systems, I maintain that what I said is true. The evidence, and  the author's arguments, suggest that the sharp ess is certainly acquiring case, and indeed has done so, whether the use of it is widespread or not.

This may be an issue for some, or many, or most, current implementations. That's a concern for industry today. My work on the Universal Character Set, as you know, looks to tomorrow.

Not being a complete idiot, my response to Stötzner on this particular character is "Get Germany and Austria behind the proposal."

But facts are facts. You recently wrote a piece where you acknowledged that many of Unicode/10646's current "mistakes" will one day be purged. A hack for casing sharp-ess would seem to be one such. Stötzner's Weise, Weisse, Weiße casing to WEISE, WEISSE, WEISSE/WEIßE is a problem German implementations have to deal with now. I strongly suspect that the "solutions" are not AT ALL uniform or satisfactory to the Mr Whites out there. A capital scharp-ess would allow a consistent solution, and would, in my view, be superior to some sort of smart-font hack where a sharp-ess preceded by a capital letter would take on a different shape. That is not very portable, and, if I may remind you, from the 10646 side we are concerned with data preservation and transfer, not just implememtation by big companies.

Yes, these are philosophical differences in the two standards, but they are ther nonetheless.

>I am getting *really* impatient with the kind of  rhetorical stance you have taken here. It is not your job nor the job of the UTC nor WG2 to reform the German writing system.

No, the Germans have been looking at that themselves. In 1902 they did, and Stötzner is doing it again today. That's also a fact.

>And it certainly is not in the UTC's interest to introduce, on spec, a dubious new German character, without a demonstrated need, but with a horrendous downside potential for screwing up German casing implementations.

I wouldn't encode the character on foot of this one proposal either. But there is a case to be made for this character, and it would be wrong to reject the proposal out of hand.

Others such as Asmus Freytag and Benson Margulies also chimed in agreeing that an answer of 'proposal insufficient' seemed best at this point.

John Hudson then mentioned:

By the way, I met Andreas Stötzner at the recent ATypI conference in Prague, and am familiar with his journal. He is an intelligent and reasonable man, and I doubt if he would be insistant about the encoding of a Capital Double S if the text encoding and processing impact were explained to him. He has documented, in an admirably thorough way, a development in *some* German typography, which needs to be addressed at some level of text encoding or display. It is not obvious that the best way to do this is to encode a new character.

A bit more discussion but it kind of petered out without any real sense of consensus.

Shortly thereafter, at the November 2004 UTC meeting in Cupertino, CA, a bunch of discussion ensued, but in the end Conensus 22 happened:

[101-C22] Consensus: The UTC concurs with Stoetzner that Capital Double S is a typographical issue. Therefore the UTC believes it is inappropriate to encode it as a separate character.

and it was added to the Rejected Characters list with the following comment:

LATIN CAPITAL LETTER DOUBLE S (existence as character not demonstrated; would cause casing problems for legacy German data)

I probably would not have worded the consensus in just that way, but the end result would have been the same....

Three months later, a thread came up on the Unicode List about the Sharp S and uppercasing it, which mainly dwelt on issues other than adding the character. So I will spare everyone the further conversation. :-)


# Mathias Raacke on 25 Sep 2005 6:53 AM:

I'm german and I can't remember any text where I would have needed an uppercase version of ß. It would not make much sense. So in my oppinion, it's not necessary to change anything in Unicode. The "SS" alternative would be best, but I think compatibility with existing Unicode implementations is more important.

# CornedBee on 25 Sep 2005 9:49 AM:

As a native of Austria, I can honestly claim that 99% of the cases I've seen used SS, not some sort of uppercase ß. And I still have an old version of DKT (which is effectively the same game as Monopoly) which writes every single street name as -STRASZE (capitalization of Straße, street).
That particular game is over thirty years old, though.

# Chris Nahr on 26 Sep 2005 2:56 AM:

The varying SS/SZ interpretations are due to the murky history of sharp s. The voiceless s was originally a single letter in Middle High German, written as z or z with a tail. When Germany standardized on the Roman alphabet and Fraktur writing/printing, sharp s was disambiguated from lowercase z by prepending a long s. The combination looks like this: ß.

So SZ is typographically correct but SS is the correct sound... well, except that a double s should shorten the preceding vowel but a sharp s doesn't necessarily do this!

Personally I think it would be neat if Unicode capitalized ß to a single glyph that contains two capital S, perhaps written closer together than usual. That would be a workable compromise...

# Suzanne McCarthy on 26 Sep 2005 10:34 AM:

Great story, Mike, thanks.

Suz

# Jonas Grumby on 26 Sep 2005 11:30 AM:

Discussions about that character come up pretty frequently on comp.lang.c++.* (given that its troubles aren't really specific to C++). My guess is because it's one of the trickier bits you have to deal with while supporting just western European text. See http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/5943768f1cf35f5d/5f047b0e543ace52?lnk=st&q=%C3%9F&rnum=1&hl=en#5f047b0e543ace52 for an example.

# Maurits [MSFT] on 26 Sep 2005 12:12 PM:

Wow.

So there are three interpretations of ß (oversimplifying:)

1. ß is a character in its own right
2. ß is a ligature of "medial s"/"terminal s": ss
3. ß is a ligature of "medial s"/z: sz

Hmmm... why not make each of these interpretations their own code point?
1. ß-proper (existing code point)
2. ß-ss (new code point)
3. ß-sz (new code point)

Then "ß-proper".ToUpper() can be this new proposed "big ß"
"ß-ss".ToUpper() can be "SS"
and
"ß-sz".ToUpper() can be "SZ"

It is true that there is no convenient reverse mapping for the latter two cases - but this is typical of ligatures, and no particular cause for alarm. For example, "fi".ToUpper().ToLower() != "fi".ToLower()

# Michael S. Kaplan on 26 Sep 2005 2:50 PM:

Given the past history, you will have a hard time finding people to be interested in *that* idea, Maurits!

Scott on 11 Oct 2009 10:56 PM:

Thank you very much for your diligent documentation of the various viewpoints and issues.


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/08/26 Every character has a story #34: LATIN LETTER T WITH CEDILLA (U+0162/U+0163)

2009/07/28 Every character has a story #32: U+1e9e (CAPITAL SHARP S, Microsoft edition - Part 1)

2008/05/15 A celebration of the LATIN CAPITAL LETTER SHARP S

2008/04/15 Kind of ironic how Germany seems so okay with Capital *Letter* punishment, huh?

2008/02/24 The idea has to do more than just make sense to me (aka How S-Sharp are *you* feeling today?)

2007/08/24 Every character has a story #28: U+1e9e (CAPITAL SHARP S)

2007/05/03 Every character has a story #26: CAPITAL SHARP S (might be encoded?)

go to newer or older post, or back to index or month or day