Seeing the tears, my heart went out to her as I asked her "Why the Long S?"

by Michael S. Kaplan, published on 2008/06/25 13:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/06/25/8652525.aspx


Over the last few years, quite a few of my blogs have mentioned the LCMAP_LINGUISTIC_CASING flag for LCMapString:

There are indeed others too -- these are just the ones I remembered off the top of my head.

Of these, the most important two, in my opinion, are What does "linguistic casing" mean? (which explains what the flag does, conceptually) and İn tıtlıng thıs ınclusıon ın re: the ınterests of Turkısh İSVs, am İ just tryıng to buıld İ's and ı's ınto the tıtle of thıs daıly contrıbutıon to SİAO (SıaO), amıgo? (which gives the actual one-way mappings that the flag adds to the casing table.

These one way mappings really tend to be kept out of the default casing table, except for the Greek final sigma (for reasons I explain in The last word on the FINAL SIGMA), and there are really good reasons for thus -- because of the destructive way that people use casing.

Not just the destructive things that people within Microsoft use it, e.g. No Regex in the Unicode room! and 'The 44' (*not* 'The 4400'), but even outside of Microsoft.

For some reason people feel it makes sense to do case insensitive comparisons by changing the case and then comparing.

Even though this is slower and even though it is destructive to the original string, people like to do this anyway.

Of course in managed code, the distinction between passing LCMAP_LINGUISTIC_CASING and not passing it does not exist, unless you use the invariant casing support like I mentioned in Comparing Unicode file names the right way.

Which is not to say that all is nirvanas now. There are a few bugs still, and there are some entries missing. For example:

The upside is that even though people might for example expect ſ (U+017f, aka LATIN SMALL LETTER LONG S) to become S (U+0053, a.k.a. LATIN CAPITAL LETTER S) in an uppercasing operation, they won't on Windows or .NET. Not just in the default table (which no one would really want, even if they think they would), but also not in the LCMAP_LINGUISTIC_CASING tables, where having it might have been nice.

This even pops up in bizarre places, like Word's uppercasing conversion or HTML/CSS text transformations support.

Though of course many of them suffer from the same issue inherent in not passing LCMAP_LINGUISTIC_CASING -- which is that even if the mappings existed, they wouldn't have been seen anyway!

When I think about it all, Microsoft did something really awful to the word "linguistic" with the LCMAP_LINGUISTIC_CASING flag, in doing (and not doing) so many decidedly non-linguistic things....

 

This blog brought to you by ſ (U+017f, aka LATIN SMALL LETTER LONG S)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day