The last word on the FINAL SIGMA

by Michael S. Kaplan, published on 2005/05/26 00:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/05/26/421987.aspx


Back in the beginning of April, I explained about the one scenario where casing does not need to roundtrip in .NET -- the Greek final sigma.

Anyway, the day before yesterday I got an email from someone who had been reading my blog and was looking at all of the one-way mappings that are in the linguistic tables (accessed with the LCMAP_LINGUISTIC_CASING flag, which I have discussed previously). He was wondering why that FINAL SIGMA could not be put into the linguistic tables since it is a one-way mapping.

A fair question, one I thought worthy of a post. :-)

If you are a native speaker of Greek, then you know that both ς (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA) and σ (U+03c3, a.k.a. GREEK SMALL LETTER SIGMA) do indeed uppercase to Σ (U+03a3, a.k.a. GREEK CAPITAL LETTER SIGMA). But if we added this character to the linguistic table, then it suddenly ς would never work in the CharUpper/CharUpperBuff functions and would not work in the default call to LCMapString with the LCMapString function with the LCMAP_UPPERCASE flag.

Obviously that would not be a good thing.

Try to imagine how you would feel if attempting to uppercase the string hello would come out as HELLo. Wouldn't you consider it a bug? Especially is it used to come out with the HELLO you were expecting? You might be thinking about telling the platform GooDBYE, if you know what I mean.

Of course ideally the functions would notice whether the Σ was at the end of a word and then decide whether to use ς or σ, depending. But LCMapString does not really look beyond the character level here, so until it does that would not really be an option.

Though of course a more sophisticated application might work to provide results beyond the character boundary. Though I do not envy such programs; the boundary for them becomes quite fuzzy if you have non-Greek characters after the ς. Does that count as a new word or doesn't it? That is the kind of question where an API can never win -- no matter which way it goes, there will be some people who do not like the answer.

Anyway, that is why ς is not uppercased only in the linguistic table. Because there are too many cases where the results simply don't make sense, at least not as things are implemented currently....

 

This post brought to you by "ς" (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA)
A character that wonders whether Unicode would have been simpler if it did not exist as an independent entity, and fionts could then decide whether to make it a "final" form or not....


# Maurits on Thursday, May 26, 2005 1:42 PM:

I suppose English has the same problem with medial s's?

# Michael S. Kaplan on Thursday, May 26, 2005 3:27 PM:

Well, we don't sort or case those differently...

# Maurits on Thursday, May 26, 2005 3:53 PM:

Sorry, I meant "ſ"
LATIN SMALL LETTER LONG S
http://www.fileformat.info/info/unicode/char/017f/index.htm

AKA "medial s"

Looks like Greek and German both have problems with multiple "s"-es - luckily English abandoned the long s prior to computers becoming popular.

# Michael S. Kaplan on Thursday, May 26, 2005 3:59 PM:

Ah yes, *that* character.

In Windows today, we do not case it at all. In Windows tomorrow there are interesting conversations about whether to put it into the linguistic table or not. Admittedly these conversations have more heat and less light, but we are working toward a conclusion....

# Maurits on Thursday, May 26, 2005 5:10 PM:

From a mathematical point of view (if I may...)
Define U to be the ToUpper operator.
Define L to be the ToLower operator.

The naive expectation of a typical user is that UL and LU will be the identity operator. As far as I can make out, this is unfixably broken (or at least made much more difficult) by such things as ligatures and medial forms. So I find it reasonable to expect that UL != LU for these problem characters.

However, I would like to count on U == ULU for everything... and L == LUL for everything. In other words, though U and L are not strictly inverses, it would be nice if they were at least stable.

In particular, I would like to see L(U("ſ")) = "s", and L(U("ß")) = "ss" - or perhaps "sz". I suppose I'm asking for a way to escape the problem cases by pushing things through a s.LowerCase().UpperCase() operator...

Is this sensible?

# Michael S. Kaplan on Thursday, May 26, 2005 5:18 PM:

Well, I won't say your expectations are not sensible. But they do not match the current behavior, which does only imple Unicode casing with no extra context rules.

# Michael S. Kaplan on Thursday, May 26, 2005 5:19 PM:

Replace imple with simple. :-)

# Michael S. Kaplan on Thursday, May 26, 2005 5:33 PM:

Oh, I do slightly disagree that a user who is sophisticated enough to call an API is one who we would class as "naive" :-)

# Maurits on Thursday, May 26, 2005 7:26 PM:

I think we're at cross purposes over the term "user".

I meant "user" as in someone who:
* Installs an application I write
* Puts text in a box
* Chooses Format | Case | UPPERCASE
* Thinks "Hmmm, no, I don't like that"
* Chooses Format | Case | lowercase
* Thinks "Hey, what happened to my capital D in Donald" or "Hey, what happened to my ß in straße" or ...
* Immediately calls me to inform me that my application corrupted their data

I probably shouldn't have used the word "user" - as an API writer, you probably hear "user" and think "application developer." I meant to say "end user". :)

# Michael S. Kaplan on Thursday, May 26, 2005 7:45 PM:

Unfortunately, the answer is the same there -- the API does not handle either of those cases. Certainly we have no way of supporting proper casing anyway, but we do not currently handle strings that would increase the size of the buffer....

If you need support like that, you have to build it yourself -- we are very low tech here. :-)

# Maurits on Friday, May 27, 2005 4:41 PM:

On this capitalization note, I'd like you to know that I've created an Evil Small-Caps Test for browsers:
http://www.geocities.com/mvaneerde/small-caps.html
(The "evil" is a nod to Ian Hickson's "evil" CSS test)

referenced by

2009/07/30 I know I'll Never say Never... again, at least

2008/06/25 Seeing the tears, my heart went out to her as I asked her "Why the Long S?"

2007/09/14 How do I feel about lstrcmpi? I think it blows....

2007/06/12 The difference between 'Dangeous Characters' and 'Dangerous Minds' is the lack of Michelle Pfeiffer

2005/06/24 LCMapString's *other* job

go to newer or older post, or back to index or month or day