How [case-]insensitive (apologies to Frank Sinatra)

by Michael S. Kaplan, published on 2005/01/16 03:19 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/16/353873.aspx


Tor Lillqvist noted that in some of my previous entries on casing (cf: Get off my [lower] case! (or: Casing, the 1st) and The [Upper]Case of the Turkish İ (or: Casing, the 2nd)) I made some hints about the casing table and NTFS. He goes on to ask:

The "Get off my [lower] case" entry mentions the issue a bit, but on the other hand from that one gets the impression that the case-insinsitivenes of NTFS would be handled by code in the Windows kernel.

If I have understood correctly, in fact there is this (hidden from casual view) file called $UpCase on *each* NTFS volume that defines the case-insensitivess of file names on *that* volume. The file contains an uppercase mapping from the characters in the BMP to *single-character* uppercase ones. (Thus the German ess-zet in a filename doesn't map to SS, etc.)

This file is created when the volume is formatted, and I guess never modified even if newer Windows versions are installed or whatever? What would be interesting to know is what determines the contents of this file, and how it has changed in various NT versions, whether the mapping depends on the locale of the user (or default locale of the machine) doing the formatting, etc. I tried asking for this in the microsoft.public.win2000.file_system group, but no replies...

Well, the summary of what happen does indeed refer to magical hidden file 10, also known as $UpCase. You can actually find out lots about the internals by searching on Google (having had a chance to look at (a) those resources, (b) available specs on NTFS, and (c) NT source code, I can state pretty definitively that the external resources are easier to understand!).

Anyway, the file ($UpCase) contains a copy of the very same uppercase information that NLS defines, burned into the drive at the time the drive is formatted.

The NLS uppercasing table is only occasionally updated and it is really only updated when new code points are added to Unicode, which is fairly rare for letters that have case mapping.

What Tor notes about the single code point only mapping is true, and it really true for the NLS casing in general (and not just the NTFS casing).

The locale settings on the machine have no affect on this setting, ever.

Looking at the file across many versions (just a quick glance, I could be missing details):

Looking ahead:

I am a huge proponent of finishing up the Georgian fix (removing the bogus lowercase mapping), which will of course have no effect on the filesystem (which only keeps that uppercase mapping anyway).

I am also a fan of picking up those new characters that have been added to Unicode since the last update to the table (since the apparent partial case sensitivity of NTFS gets more and more visible over time as we miss more and more new letters that do have a case mapping). This would let us get in stuff like U+01bf (ƿ, a.k.a. LATIN LETTER WYNN) that is not there now but which ought to be mapped to U+01f7 (Ƿ, a.k.a. LATIN CAPITAL LETTER WYNN).

Now this last one is pretty much the main reason why NTFS stores the $UpCase table as it does, independent of what the NLS tables does. Because you have to be able to trust that the files you save on a disk will be available tomorrow or when you update your operating system. It still does make the change more controversial, especially with people who do not use the characters in question (since we only look truly stupid to the people who actually use these letters)....

Now on my wish list would be an API that provides the file system casing results -- not just for NTFS but for any file system, even case sensitive ones like CDFS. But it is unclear whether such a function even could exist given all of the differences between file systems. But I guess that's why we call it a wish list. :-)

1 - Also known as Windows 2000
2 - Also known as Windows XP
3 - Also known as Windows Server 2003

 

This post brought to you by "ѝ" and "Ѝ" (U+045d a.k.a. CYRILLIC SMALL LETTER I WITH GRAVE and U+040d a.k.a CYRILLIC CAPITAL LETTER I WITH GRAVE)
Neither of the sponsors are in the NLS casing tables either, and they seem a little bitter about that!


# Tor Lillqvist on 16 Jan 2005 3:29 PM:

Yes, an API to find out the exact case mapping that would be used for a certain file name in a certain directory would be nice. Unfortunately, I doubt whether it would be feasible to implement it. For local file systems, sure, but donsidering CIFS, it would require CIFS enahncements, wouldn't it?

# Michael Kaplan on 16 Jan 2005 4:40 PM:

Indeed, that is true.

Ok, how about a scaled back wish list item -- a casing function that would properly handle the NTFS case, using the $UpCase table on a specified drive? Obviously that exists for code deep down in the NTFS redirector, so exposing it is not impossible....

# Jonathan Wilson on 17 Jan 2005 2:04 AM:

This begs the question as to why NTFS (and win32 in general including the long filename extentions to FAT) didnt do what just about every unix filesystem since the dawn of time has done and have full case sensativity with filenames.
It would store a filename with the exact case that the program making the file I/O call requested.
So if the program called the file DoCuMeNt.TxT that is what would be stored to the disk.

# Michael Kaplan on 17 Jan 2005 4:50 AM:

Jonathan, it is, it is. This functionality is just what it used to do the testing. But case is preserved as people type it in for the long file name.

# Michael Grier on 17 Jan 2005 1:36 PM:

The reason that it's infeasible to make these changes for existing volumes is because it would involve recreating the indices (since they're stored in the canonicalized case as well as the preserved case as Michael points).

Maybe someone can convince the FS folk to add a command to resync the casing tables with the current OS but then what happens when you take that volume down to an older OS and mount it?

I've always meant to experiment with a volume that had some wacky case conversions stored in $UpCase to see how much system code broke due to its assumption that the kernel's notion of case insensitive has to match the filesystem's.

# Michael Kaplan on 17 Jan 2005 2:04 PM:

Yes, I would never suggest making these changes for existing volumes....

Though the question that comes up is to do with if the NLS tables (which is to say the OS tables) are updated, as they were between NT 3.51 and NT 4.0, then what is the result?

This is something I would like to see tested out, so we can see what the impact is. Given how different our case tables are from Unicode's, even for characters that can be in common use for customers....

# Jonathan Wilson on 18 Jan 2005 5:50 AM:

I know NTFS and FAT32 and such do preserve case.
But if they were properly case sensitive (such that Document.txt and document.txt were different files like they are on every unix filesystem I have had experience with), there would be no reason to need to worry about case in the filesystem.
The filesystem (or its drivers, OS layers etc) wouldnt need to care that U+00C3 and U+00E3 are a matched pair of capital and lowercase letters. It would be up to the application and the user to deal with any issues to do with filename casing.

I do understand that doing this now after all this time would never be fesable but I am curious as to why win32 is (as far as I am aware) the only OS/filesystem that preserves the case of filenames but is not specifically Case Sensitive where the case of filenames matters to the OS and to the file I/O layers.

# Paul Sanders on 29 Jan 2009 3:17 PM:

OS X has a case insensitive filesystem, I notice.  Not sure how long this has been the case (sic), but I like it!  It's how humans' brains work (well, mine anyway).

bobince on 8 Aug 2009 7:13 PM:

Do you know what case tables are used internally for ordering subkeys (and uniqueness) in Registry hives (ie. in the li/ri/lf/lh records)? I've noticed they're always in a case-insensitive order, but there is no NLS data stored in the hive structure as far as I can see, so it's presumably not using the same strategy as NTFS.


referenced by

2005/01/16 My apparent obsession with "case" puns

go to newer or older post, or back to index or month or day