More on case insensitivity and its intuitivality

by Michael S. Kaplan, published on 2006/06/05 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/05/617447.aspx

I realized I needed to actually point out a little more information here to paint the full picture....

On the surface this would suggest that Apple's OS only handles the casing of ASCII letters, but that would be misleading due to several related facts:

Now I do not have a Mac or anything, but assuming the first two points are 100% true and third point is just a little misleading based on the the stuff Geoff noted that I talked about yesterday (i.e. the case insensitivity is incomplete rather than supported fully by some file systems/not at all by others).....

#1 The OS X file system is potentially much cooler than NTFS on Windows since it does Unicode normalization.

#2 The OS X file system is potentially much lamer than NTFS on Windows if the only case insensivity is ASCII A-Z ~ a-z.

#3 The OS X file system potentially mitigates Point #2 above for the Latin script since it fully decomposes everything and the "cased" portion of every Latin script letter is handled in a much smaller table than Windows has to cover the same area.

#4 NTFS on Windows has a slightly higher maintainable burden since it has all of the characters in its casing tables since it would have to potentially update its tables more often to handle new Unicode characters that are added.

#5 NTFS on Windows sees an effective mitigation in the fact that Unicode no longer adds precomposed character that can be decomposed into already existing sequences that are canonically equivalent, meaning this advantage would not apply to future versions.

#6 If the OS X casing truly only handles ASCII A-Z then there is actually 3.6 buttloads of characters that it is not properly handling across other cased scripts from Cyrillic to Greek to Coptic to Armenian to Georgian. Note: since I honestly do not know if this is the case, I hesistate to say whether Apple is lame on this point or not -- they may be just fine. Does anyone know?

#7 NTFS on Windows is technically not case insensitive at its lowest levels, so in theory it has the same problem Geoff noted with some of the tools that are available on the Mac. The difference is that architecturally there is no way to get at this mode from within Win32, so Windows has a more complete wrapper that avoids the inconsistency that can be seen on Apple's platform....

Ok, those are all the useful conclusions I was able to think of at the moment. :-)

This post brought to you by Ə and ə (U+018f and U+0259, a.k.a. LATIN CAPITAL LETTER SCHWA and LATIN SMALL LETTER SCHWA)

FWIW, I just tried named two files Ə and ə. OS X correctly gave me the error that "ə" was already taken. I'm note sure if you've had a chance to look at http://developer.apple.com/technotes/tn/tn1150.html but it is very informative (all about the HFS Plus format, the default format for OS X formatted volumes, and the only truly supported boot volume format (UFS is somewhat supported, with severe limitations).

I also tried РОССИЙСКОГО and российского and I correctly got the error about the file name being taken.

I don't have an Armenian or Greogian font to test this because I have no way to tell if the uppercase conversion is working.

To address these points,

1. I don't know that storing filenames normalized is such a good idea. In the case of Form D, as used by HFS+, you're liable to make strings longer. This means that a filename you think is valid may be invalid because it turns out to be too long when normalized. It could also cause a problem of not getting back the same filename you saved, possibly causing you to not recognize what you just wrote. I would argue that normalization should be considered when doing comparisons (i.e. does the file already exist?), but not necessarily stored.

6. Judging by the algorithm presented at http://developer.apple.com/technotes/tn/tn1150.html#StringComparisonAlgorithm it appears as though Apple has realized that cases exist outside ASCII.

7. The case-sensitive nature of filesystems such as NTFS is certainly available from Win32. If you look up CreateFile, you'll see the flag FILE_FLAG_POSIX_SEMANTICS has the following meaning: "Indicates that the file is to be accessed according to POSIX rules. This includes allowing multiple files with names, differing only in case, for file systems that support such naming. Use care when using this option because files created with this flag may not be accessible by applications written for MS-DOS or 16-bit Windows."

Hey - check your posting format. Google Reader is displaying some of the markup from your posts. I'm seeing things like:

PFONT face=TahomaYesterday in 'STRONGA HREF="/michkap/archive/2006/06/04/616904.aspx"

The formatting for your posts is completely lost. All I see is a single line of wrapped text.

To everyone: case insensitivity is now (IIRC as of Windows XP SP2, possibly earlier) enforced by default, not just for files but for all named objects, regardless of subsystem and flags passed. It was a security issue. When case sensitivity was a settable flag, it was a bit too easy to hijack object names by exploiting the (undocumented) order in which names only differing in case were searched

I tried a few more experiments in HFS+, and concur that it does indeed recognize case outside ASCII. I tried several examples from the Latin Extended unicode block, and HFS+ consistently normalized the characters and was appropriately case-insensitive. (My testing was on OS version 10.3.9.)

Gabe, the round trip issues seems to have been specifically handled by Apple. IE, there's a huge table (http://developer.apple.com/technotes/tn/tn1150table.html) that shows suggested sequences and whatnot. Some ranges in unicode were also changed in how they are stored in HFS+ to closer match HFS encodings (and to survive round trips). But remember, these technotes are mostly for people developing tools that work directly with the FS like Disk repair utilities or drivers for other platforms.

Yes, Rosyna, I see they addressed the round trip issue, but I didn't see anything that actually explained how. For example, let's say you have the filename u+00E9. It will be stored decomposed as u+0065 u+0301. When you go to read the filename back in, it will be different.

Of course they may automatically recompose the filename back into u+00E9, but then you would get back the wrong thing if you wrote the filename decomposed in the first place.

Well, it does work with actual accents... Using true HFSX:

http://img230.imageshack.us/img230/1387/picture1dm1.png

I've tried a couple of French accents also and it worked just fine.