by Michael S. Kaplan, published on 2005/01/16 03:19 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/16/353873.aspx
Tor Lillqvist noted that in some of my previous entries on casing (cf: Get off my [lower] case! (or: Casing, the 1st) and The [Upper]Case of the Turkish İ (or: Casing, the 2nd)) I made some hints about the casing table and NTFS. He goes on to ask:
The "Get off my [lower] case" entry mentions the issue a bit, but on the other hand from that one gets the impression that the case-insinsitivenes of NTFS would be handled by code in the Windows kernel.
If I have understood correctly, in fact there is this (hidden from casual view) file called $UpCase on *each* NTFS volume that defines the case-insensitivess of file names on *that* volume. The file contains an uppercase mapping from the characters in the BMP to *single-character* uppercase ones. (Thus the German ess-zet in a filename doesn't map to SS, etc.)
This file is created when the volume is formatted, and I guess never modified even if newer Windows versions are installed or whatever? What would be interesting to know is what determines the contents of this file, and how it has changed in various NT versions, whether the mapping depends on the locale of the user (or default locale of the machine) doing the formatting, etc. I tried asking for this in the microsoft.public.win2000.file_system group, but no replies...
Well, the summary of what happen does indeed refer to magical hidden file 10, also known as $UpCase. You can actually find out lots about the internals by searching on Google (having had a chance to look at (a) those resources, (b) available specs on NTFS, and (c) NT source code, I can state pretty definitively that the external resources are easier to understand!).
Anyway, the file ($UpCase) contains a copy of the very same uppercase information that NLS defines, burned into the drive at the time the drive is formatted.
The NLS uppercasing table is only occasionally updated and it is really only updated when new code points are added to Unicode, which is fairly rare for letters that have case mapping.
What Tor notes about the single code point only mapping is true, and it really true for the NLS casing in general (and not just the NTFS casing).
The locale settings on the machine have no affect on this setting, ever.
Looking at the file across many versions (just a quick glance, I could be missing details):
Looking ahead:
I am a huge proponent of finishing up the Georgian fix (removing the bogus lowercase mapping), which will of course have no effect on the filesystem (which only keeps that uppercase mapping anyway).
I am also a fan of picking up those new characters that have been added to Unicode since the last update to the table (since the apparent partial case sensitivity of NTFS gets more and more visible over time as we miss more and more new letters that do have a case mapping). This would let us get in stuff like U+01bf (ƿ, a.k.a. LATIN LETTER WYNN) that is not there now but which ought to be mapped to U+01f7 (Ƿ, a.k.a. LATIN CAPITAL LETTER WYNN).
Now this last one is pretty much the main reason why NTFS stores the $UpCase table as it does, independent of what the NLS tables does. Because you have to be able to trust that the files you save on a disk will be available tomorrow or when you update your operating system. It still does make the change more controversial, especially with people who do not use the characters in question (since we only look truly stupid to the people who actually use these letters)....
Now on my wish list would be an API that provides the file system casing results -- not just for NTFS but for any file system, even case sensitive ones like CDFS. But it is unclear whether such a function even could exist given all of the differences between file systems. But I guess that's why we call it a wish list. :-)
1 - Also known as Windows 2000
2 - Also known as Windows XP
3 - Also known as Windows Server 2003
This post brought to you by "ѝ" and "Ѝ" (U+045d a.k.a. CYRILLIC SMALL LETTER I WITH GRAVE and U+040d a.k.a CYRILLIC CAPITAL LETTER I WITH GRAVE)
Neither of the sponsors are in the NLS casing tables either, and they seem a little bitter about that!
# Tor Lillqvist on 16 Jan 2005 3:29 PM:
# Michael Kaplan on 16 Jan 2005 4:40 PM:
# Jonathan Wilson on 17 Jan 2005 2:04 AM:
# Michael Kaplan on 17 Jan 2005 4:50 AM:
# Michael Grier on 17 Jan 2005 1:36 PM:
# Michael Kaplan on 17 Jan 2005 2:04 PM:
# Jonathan Wilson on 18 Jan 2005 5:50 AM:
# Paul Sanders on 29 Jan 2009 3:17 PM:
OS X has a case insensitive filesystem, I notice. Not sure how long this has been the case (sic), but I like it! It's how humans' brains work (well, mine anyway).
bobince on 8 Aug 2009 7:13 PM:
Do you know what case tables are used internally for ordering subkeys (and uniqueness) in Registry hives (ie. in the li/ri/lf/lh records)? I've noticed they're always in a case-insensitive order, but there is no NLS data stored in the hive structure as far as I can see, so it's presumably not using the same strategy as NTFS.
referenced by
2005/01/16 My apparent obsession with "case" puns