Get off my [lower] case! (or: Casing, the 1st)

by Michael S. Kaplan, published on 2004/12/02 00:01 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2004/12/02/273619.aspx


"Trying a case the second time is like eating yesterday morning’s oatmeal." -- Lloyd Paul Stryker

Of course when Stryker said this, he was referring to legal jurisprudence. But believe it or not, the quote can be just as applicable to alphabetical casing operations!

For reasons that surpass my understanding, there are many developers who run code that uppercases a lowercasing operation or vice versa as a way of making sure that any one-way mappings are taken out of the equation. They usually do this to try to match the NTFS filesystem, which does not seem to them to have any problems with one-way mappings. Which is just a little bit dumb, since the filesystem works by simple uppercasing. Because of that, the only true one-way mappings that cause these developers to have different behavior in their code are the lowercasing ones.

The problem is most acute in one particular situation (one that seems to come up frequently when people combine the above incorrect technique with extensive test passes). Basically, the issue is that Georgian has spent many years in the Unicode Standard with two scripts encoded, despite the fact that there are three scripts existing. As of Unicode 4.1, the plan is to finally add that third script and provide a two-way case mapping between the two older scripts (the uppercase Khutsuri and the new lowercase Nushkuri), leaving the modern (Mkhedruli) script completely caseless.

Now Mkhedruli has always been caseless according to Unicode so they are fine. We at Microsoft are not so fortunate, because ever since Windows 2000, Microsoft has had a one-way mapping from uppercase Khutsuri to Mkhedruli, but no converse mapping from the caseless Mkhedruli to Khutsuri. No worries for NTFS (which only uppercases), but for those developers who run through two casing operations will get incorrect results back.

(Obviously, removing the bogus mapping some time between now and when we have to do the new mapping for Unicode 4.1 would be a good thing, for both Windows and the .NET Framework. More on that when I know more....)

The moral of the story? If you want to mimic the filesystem, then skip the step that the filesystem skips -- no lowercasing!

Future "casing" columns will talk about other fun and un-fun casing factoids, issues, problems, and features in Windows and the .NET Framework.


# Eusebio Rufian-Zilbermann on Thursday, December 02, 2004 6:44 AM:

Very interesting. Yesterday I was trying to explain why upper/lowercasing operations are a bad idea and better avoided whenever possible. Now I can add Georgian to the list of reasons, along with Turkish dotted I's :)

# Norman Diamond on Tuesday, December 21, 2004 7:02 PM:

NTFS only does uppercasing? Which uppercasing? If I understand correctly, the question of whether e with acute accent uppercases to plain E or to E with acute accent depends on the country where the uppercasing is performed. Is it possible to transport a hard disk from one country to another? Or to share it across an international network? What happens to the uppercased filenames?

# Michael Kaplan on Tuesday, December 21, 2004 10:07 PM:

I am only dealing with casing of Unicode characters, as defined the Microsoft casing tables. As I point out in another post, if the "linguistic casing" flag is not passed then no one-way mappings are seen anyway.

But the case mapping info is stored with the hard drive format operation, so even future casing table changes would not impact this....

# Norman Diamond on Tuesday, December 21, 2004 11:48 PM:

12/21/2004 10:07 PM Michael Kaplan

> As I point out in another post

Sorry I didn't see it yet.

> case mapping info is stored with the hard
> drive format operation

Hmm. Then if a USB hard drive is formatted by a system running Windows XP under one country's locale and then attached to a system running Windows XP under a different country's locale, that NTFS partition keeps its filenames reliably. I'd like to make a few comparisons.

Under Windows NT4 and 9x, if a filename had been written by a German system it might be inaccessible by a Japanese system and vice-versa. It was possible for each system to write files to the same partition, and then no version of Windows Explorer would be able to delete all of the filenames and no version of Scandisk would be able to adjust (sometimes called "fix") all of the filenames. Actually this could also happen just between US and Japanese, since US Windows installers created filenames containing a single-byte character that looks roughly like "1/2". But I mostly saw that in FAT.

I guess you're saying that if NTFS were used, then sometimes it would not be possible for an executing Windows system to create a file in its own language, because the stored mapping information would prevent it. But you're not quite saying that, because my experience is with different language characters and here we should only be talking about different casing rules. But even when it's casing, I guess you're saying that sometimes an executing Windows system will be unable to create a file in its own national version of a language, for example a filename containing E accent acute when the NTFS partition had been formatted under a locale that doesn't accent capital vowels.

What happens when a network is involved? I've seen clients create files in network shares using the code page of the client instead of the server, and then the server can't access its own file. With NTFS this would be impossible? Or only when it's a casing issue it would be impossible?

# Michael Kaplan on Wednesday, December 22, 2004 12:02 AM:

No, you have misunderstood.

I am referring to NT-based platforms, and to NTFS.

Different machines with different locale settings will not get different results. Its the same casing table. Thus there is no issue with locale-specific behavior as there is no locale-specific casing on Windows other than the Turkic casing rules, which are never applied to NT filesystems.

# Norman Diamond on Wednesday, December 22, 2004 1:44 PM:

> I am referring to NT-based platforms

So was I when mentioning XP and NT4. Sorry I mentioned 9x one time. Sorry I mentioned FAT one time too.

> different locale settings will not get
> different results

Then my previous questions revive. When e accent acute is uppercased in some countries it becomes E accent acute but in some countries it becomes plain E. Which uppercasing does NTFS do? And whichever it does, is the resulting file accessible when the disk is moved or shared to a machine whose locale does the other kind of uppercasing? But, umm...

> there is no locale-specific casing on
> Windows other than the Turkic

Then how do you contend with national variants of French etc.?

# Michael Kaplan on Wednesday, December 22, 2004 1:50 PM:

How I do contend with national variants ofr French casing *personally*? I don't.

Windows does not, either. See my post from the next day that talks about the so-called "linguistic" results, which is where this sort of thing might go if it were in there.

But its not there, so its not an issue.

Of course anyone is free to create a filesystem that is incompatible with changes in user or system settings.... but I will stick with NTFS.

# Norman Diamond on Thursday, December 23, 2004 5:28 PM:

> How I do contend with national variants ofr
> French casing *personally*?

I meant how does Microsoft make Windows contend with them.

> Windows does not, either. See my post from
> the next day

OK I read it now. (Hadn't read it before because, although aware of Turkish, I hadn't needed to worry about it, though was glad that Windows contended with it.)

I think I do see the answer in that posting. You have cases where Windows sometimes doesn't boot. So I guess that in varying French locales, although (I guess) problems will not be as severe as failing to boot, the answer is that Windows will have problems.

> I will stick with NTFS

Sure. So if a program running under one locale uppercases a filename and creates that file in NTFS, and then a program running under a slightly different locale uppercases the same filename to a different result and then can't open the file, tough. I'm a bit surprised, but I understand.

In order to find the answer to one of my questions (when NTFS uppercases a filename does e accent acute convert to E accent acute or does it convert to plain E), I guess I'll have to do some experimenting.

# Michael Kaplan on Thursday, December 23, 2004 7:18 PM:

No need to try -- NTFS does not handle this case, nor does Windows. If there is an acute in one, the acute stays in the other, in all locales.

# Norman Diamond on Thursday, December 23, 2004 7:57 PM:

> nor does Windows

You mean that e accent acute uppercases to E accent acute even in locales for French in countries that don't use accents on uppercase letters? Don't you get complaints from those countries?

# Michael Kaplan on Thursday, December 23, 2004 8:19 PM:

Yes, that is what I mean. And no, we do not get complaints, either in France or Canada (of the two Canada has a more active community of peeople looking at standards).

referenced by

2006/03/12 Traditional versus modern sorts

2005/09/13 The Is* Unicode script ranges in .NET's RegEx

2005/08/20 New in Vista Beta 1: Updated OS casing tables

2005/06/24 LCMapString's *other* job

2005/06/02 The New String recommendations

2005/05/08 Similar descriptions does not mean similar methodologies

2005/04/26 Intelligent unmanaged string comparison

2005/04/04 When casing does not need to roundtrip in .NET

2005/01/16 My apparent obsession with "case" puns

2005/01/16 How [case-]insensitive (apologies to Frank Sinatra)

go to newer or older post, or back to index or month or day