Identifying illegal characters before they cross the border?

by Michael S. Kaplan, published on 2006/11/03 05:32 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/11/03/941420.aspx


Developer Lei Tan asked a question that comes up fairly often these days:

Is there a SDK document to tell what characters are not valid for file name, i.e., cannot be used in APIs to create directory, file, etc. When I try to rename a file to, say “<”, I got a tooltip saying “file name cannot have \/:*?”<>|”. Is this a complete list of invalid chars for file/dir name?

Josh Poley (a man with the fascinating address book title of EMULATION NINJA! I'd love to see the CSP on this one, and plan to ask about the possibility of a job title like LINGUISTIC NINJA myself...) pointed out the MSDN topic entitled Naming a File which I'll get back to in a moment, and Michael Grier gave a concise reality check on the notion of the hopes of a "complete" answer:

The answer is per-volume and there does not seem to be any programmatic way to get this information.

So the basic answer is “no” if you’re serious about “complete”.

In practice, files can’t have slashes in them and if you use the win32 APIs ... they also can’t have trailing spaces and a bunch of other canonicalization rules apply.

Now this answer points out the core issue, which is that any attempt to capture all the rules will always be massively incomplete, probably misleading, possibly incorrect, and certainly misleading to the bulk of people who read it. And mainly because there is no simple function with one can to query a particular volume to ask what its rules are!

"But Michael," you may complain, "What about .NET's Path.GetInvalidPathChars and Path.GetInvalidFileNameChars?"

To you (if you are one of those people!) I have to suggest that you read the topics, both of which are very clear about the fact that:

The array returned from this method is not guaranteed to contain the complete set of characters that are invalid in file and directory names. The full set of invalid characters can vary by file system. For example, on Windows-based desktop platforms, invalid path characters might include ASCII/Unicode characters 1 through 31, as well as quote ("), less than (<), greater than (>), pipe (|), backspace (\b), null (\0) and tab (\t). 

The problem here is a general and hopeful tendency to try and help developers by providing an answer, knowing that the answer is basically incomplete but only noting this fact remarks that may never be seen and which are certainly not as visible as the methods themselves. It may serve often, but is never really a complete answer.

"But Michael," you may point out, not being dissuaded by your last attempt, "can't you just try and create a file with the characters and see what succeeds, then delete the file?"

To you (if you are one of those people!)   I would recommend taking a look at the "attempt at codifying the uincodifiable" Naming a File topic that our Emulation Ninja pointed out, and then adding to it all of the additional issues like security (what if you have no create permissions, or even worse no delete permissions and you are littering the user's system with many files?

So now that I have said that the attempts to either write functions or documentation to capture the problem won't solve it, what do I think would solve it?

Well, if you go back and time and actually require a method to be implemented in the actual file system drivers that return the rules and/or validates a particular name, and then at higher levels create a Win32 API function that adds to this list any additional rules that it would apply to such a creation, then it may have been possible to try and solve. Of course, trying to create such a requirement now will not help very much since one can always access older volumes over a network and thus there is no 100% coverage guarantee anyway.

So we will limp along with partial solutions....

You can probably guess what I think about and text from Naming a File like

"Use any character in the current code page for a name, including Unicode characters, except characters in the range of 0 (zero) through 31, or any character that the file system does not allow. A name can contain characters in the extended character set (128–255)."

and it will therefore save me some time not having to talk about it.:-)

Just kidding. I'll post another time about all of my locale and case related concerns here....

 

This post brought to you by and (U+ff0c and U+ffec, a.k.a. FULLWIDTH SOLIDUS and FULLWIDTH REVERSE SOLIDUS)


# alaw on 3 Nov 2006 8:33 AM:

Your comments about trying to identify all the 'bad' characters are familiar. With ASP.NET, an Anti Cross Site Scripting library was created recently which tries to address this - by starting with EVERYTHING being invalid, and then adding the valid characters.

Here is the link, and I think a 1.5 version will come out soon with many more capabilities

http://www.microsoft.com/downloads/details.aspx?familyid=9a2b9c92-7ad9-496c-9a89-af08de2e5982&displaylang=en

# Adam on 3 Nov 2006 8:55 AM:

Could you try to create a "FileInfo" for the requested full path? That is supposed to return an ArgumentException() if the filename "is empty, contains only white spaces, or contains invalid characters." If that doesn't know what characters are valid for the volume you're trying to create the file on, but it is filesystem-specific (as you seem to imply), how does it know when to throw that exception?

Although....how would this work with filesystems that allow filenames that consist only of whitespace (i.e., All unix filesystems) - does it throw the exception or not?. Or filesystems that allow a colon in the middle of a path? (The FileInfo constructor is supposed to throw a NotSupportedException() in this case)

Or is the .NET documentation incorrect/incomplete here?

# Adam on 3 Nov 2006 9:14 AM:

Actually... according to the Wikipedia comparison of file systems[0], FAT and NTFS all actually allow chars 1-31, as well as < >, etc... (but not ':' for NTFS) as filename characters.

So, if the set of invalid characters varies by file system, and all the underlying file systems for .NET on NT allow these characters, how come these characters are *not* allowed in .NET on NT?

Is it MSDN or Wikipedia at fault here?

[0] http://en.wikipedia.org/wiki/Comparison_of_file_systems

# Michael S. Kaplan on 3 Nov 2006 11:54 AM:

Hi Adam,

So you see the problems once you go down the path of trying to capture this in documentation? :-)

# Nick Lamb on 3 Nov 2006 4:01 PM:

Adam, Wikipedia is telling you what is permitted by the filesystem itself, but .NET is usually run on top of Win32 and Win64, which in turn are mostly based on the Win16 APIs, which inherits various semantics from MS DOS in the 1980s.

Each layer (.NET, Win32, NT kernel VFS, NTFS driver) has its own restrictions in addition to those imposed by lower layers, and the Win32 layer in particular has a lot of restrictions intended as backwards compatibility or (especially in shell related code) as ill conceived "ease of use" improvements.

Applications written to use other subsystems (where they exist) or the NT system calls directly, have a much broader selection of file names available to them.


go to newer or older post, or back to index or month or day