Comparing Unicode file names the right way
by Michael S. Kaplan, published on 2005/10/17 00:31 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/10/17/481600.aspx
Michael Brown asked the following question in the microsoft.public.win32.programmer.international:
What is the correct way to compare two unicode filenames to see if they would end up referring to the same file? I'm trying to merge several filename lists that may have the same file name but with different captalisation, and want to avoid duplicates. The files may or may not exist on the system, which makes things a little more complicated. CompareString looks to be a good start, but there's lots of different possibilities for the options and I have no idea which configuration would be correct for comparing two filenames.
Any hints appreciated!
If you are a regular reader, then you know why (as I read his question) I realized that my team and I have a big job ahead of us....
CompareString, which I am quick to champion as an awesome way to do all sorts of things, is not a good start here -- it is most assuredly the worst start possible if one is trying to mimic the filesystem's rules. Think of the following contrived situation, a directory that contains all of the following files in it:
Now while this something perfectly legal for the file system (in fact, one can randomly intersperse capital letters in there as well to appear to violate the rules of case sensitivity, but viuolating the rules of de facto identical appearance seems more impressive to me!). It is not new topic here (many will remember the post Normalization as Obfuscation in C# where I made every single identifier act as a different underlying character encoding of the same basic string), but I thought it would be a good idea to point out the issue as it applies to file systems too....
(Anyone with memory of their probability and statistics course work will be able to caculate the exact number of permutations that the file system will allow in this simple 10 character string, and that is even before other characters are added to the mix that look similar to us or to CompareString but not to the OS -- something I could post about some other day if there was interest)
Now, by the rules of Unicode decomposition, the rules of Unicode normalization (where the first string is Form C and the last is Form D), and the rules of the Win32 CompareString function, all of these strings are identical, and (assuming that you have a new enough browser) they do in fact look the same.
One could of course argue that the file system ought to consider possibly respecting the Unicode rules, but that is an argument for another day (and yes, it is a true 'My dear boy...' situation if ever there was one!).
The argument for today is that the CompareString function is not the way to compare file names, and I think the above example debunks the notion nicely. :-)
For the file system, an "UpCase and Binary Compare" approach is the best way to go to mimic the behavior, and this rule should stand for at least as long as the file system is not using Unicode normalization or a comparison method that tries to respect the equivalences it creates.
For the uppercasing operation, you can use CharUpper, CharUpperBuff, or LCMapString with the LCMAP_UPPERCASE flag (and without the LCMAP_LINGUISTIC_CASING flag!). Binary comparisons are easily done with functions like memcmp and wmemcmp, of course.
Now even when I talked about the FAT/FAT32 oddness on Windows 2000, I proved that the underlying file system allowed such differences, even if a higher-level process in the Win32 API was following different rules. Obviously the behavior of the higher level process was considered to be a bug and it was fixed in later versions, so the underlying OS behavior clearly has rules even if every once in a while somebody unknowingly uses a function like CompareString when they oughn't. :-)
This post brought to you by "õ" (U+00f5, a.k.a. LATIN SMALL LETTER O WITH TILDE)
# Mihai on Monday, October 17, 2005 5:27 AM:
Only that (if my memory serves me well), there is a way to enable POSIX compatibility for the file system.
This means case-sensitive file system.
And yes, the documentation for the setting was accompanied by a lot of warnings :-) It can break a lot of things!
# Andreas Magnusson on Monday, October 17, 2005 9:29 AM:
Wow, cool, I was just thinking about this actual thing for an application I'm currently writing! Although I was leaning towards the binary upper/lower-case comparison even before I read it.
# Nick Lamb on Monday, October 17, 2005 9:40 AM:
The underlying NTFS design is POSIX style (case sensitive, no forbidden characters), but the Windows fs driver is case-insensitive using a simple fix case and compare style approach iirc. There is a small additional overhead for this.
The really /huge/ overhead comes from Win32, which implements a huge number of additional confusing rules (e.g. no colons), the enforcement of which, as Michael has sort-of observed, varies considerably. On top of the actual Win32 API there are some extra rules which seem to be observed only by a few UI components like Explorer (e.g. no runs of punctuation). You can skip all of Win32's rules by using NT APIs instead, or by going through a different subsystem (e.g. POSIX).
It's easy these days to find Windows systems where 3rd party software ported from Unix has created files that Explorer can list but won't otherwise deal with, or where other 3rd party software can't open certain filenames that appear legitimate in Explorer... and that's before even mentioning removable media.
The end result is that the initial question is best answered as "try to avoid doing that", the Apache HTTPd and Microsoft's IIS group have wrestled extensively with this problem on Windows for security reasons. They now both have solutions that work well enough for their product, but it took a lot of work to get there. Still, Michael's solution is a good first approximation if correctness isn't vital.
Meanwhile if as a developer you find a Windows API that seems to be in a world of its own when it comes to filenames, please report that through the proper channels.
# Nick Lamb on Monday, October 17, 2005 10:54 AM:
"no forbidden characters"
Of course it goes without saying that there's an implicit "... except for U+0000 aka NUL" here. U+002f slash isn't actually a forbidden character, but obviously no POSIX system is going to let you sneak the path separator into a filename, at least not a real one. U+2044 and U+2215 should be allowed, probably even getting past Explorer if anyone would like to try.
# J. Daniel Smith on Monday, October 17, 2005 11:48 AM:
How about a Compare() method on System.IO.FileSystemInfo that does this the right way?
# Mike Dunn on Monday, October 17, 2005 1:03 PM:
When using LCMapString to do the uppercasing, what's the right LCID to use? LOCALE_INVARIANT?
The CharUpperBuff docs say it "uses the language driver for the current language selected by the user at setup or by using Control Panel" which sounds to me like a language-sensitive operation that would have the same problems as using LCMAP_LINGUISTIC_CASING.
# Maurits on Monday, October 17, 2005 2:57 PM:
> no POSIX system is going to let you sneak the path separator into a filename, at least not a real one
Back in the Mac OS 9 days, I had an NT server with Services for Macintosh serving an AFP file share. Macs on the LAN were always copying files up to the share that had slashes in the file names.
Mac OS 9 (and previous) uses : for the path separator... is that why : is disallowed in Windows filenames?
From the Windows alert box:
A filename cannot contain any of the following characters:
\ / : * ? " < > |
\ and / are path separators...
* and ? are "Find" metacharacters...
" is the quote character for filenames with spaces...
<, >, and | are the indirection operators...
which leaves : as the odd one out. A nod to Macs?
# Maurits on Monday, October 17, 2005 3:03 PM:
# Andy on Monday, October 17, 2005 3:35 PM:
Does anybody know any compression/archiving tools that handle such filenames?
I'd love to create a folder with a bunch of these 'special' filenames myself and then 'zip'-it to create a single file with a simple name and send it to some of our developers.
ZIP files don't seem to be applicable here, so what is?
# Mike Dunn on Monday, October 17, 2005 5:03 PM:
The colon also separates the filename from the stream name. If C: is NTFS then this:
means the stream named "bar" in the file named "foo.txt"
# Michael S. Kaplan on Monday, October 17, 2005 6:24 PM:
"When using LCMapString to do the uppercasing, what's the right LCID to use? LOCALE_INVARIANT?"
Funny you should ask that -- there will be the answer in a post tomorrow! :-)
# Michael S. Kaplan on Monday, October 17, 2005 6:34 PM:
Hey J. Daniel!
"How about a Compare() method on System.IO.FileSystemInfo that does this the right way? "
Well, we are talking about unmanaged code at the moment, right? :-)
For managed code OrdinalIgnoreCase is good enough for now, at least until the FS exposes better methods....
# Michael S. Kaplan on Monday, October 17, 2005 7:31 PM:
# J. Daniel Smith on Tuesday, October 18, 2005 9:52 AM:
In <3 weeks, there's not going to be the big gulf between managed & unmanged code (I'm talking about VS2005 & C++/CLI).
So if you show us the unofficial 100% correct "static int Compare(FileSystemInfo a, FileSystemInfo b)" in managed code, it will soon be fairly straight-forward to use it from unmanaged code.
(Of course, this assumes that you're OK doing such a thing...there may be other reasons for staying completely in unmanaged code).
# Michael S. Kaplan on Tuesday, October 18, 2005 10:37 AM:
Since Whidbey was frozen long ago and is now frozen rock solid for new features, and since managed code would use OrdinalIgnoreCase here, I am not sure what you are getting at, J. Daniel?
# J. Daniel Smith on Tuesday, October 18, 2005 11:28 AM:
Yes, I realize Whidbey is frozen harder than the Antarctic ice pack…I thought the "static" function taking two arguments made things clear; I guess not. Sorry.
Show us the exact C# code for your FileNameCompare() utility function; that way there can be no confusion as to the proper technique. My preference would be to take stronger-typed FileSystemInfo parameters (rather than just strings)…and also to indicate that eventually such code should perhaps be part of that class.
# Michael S. Kaplan on Tuesday, October 18, 2005 3:51 PM:
Ah, the reasons I am resistant to *that* particular path are the subject of another blog entry, coming soon! :-)
go to newer or older post, or back to index or month or day