How are the file names encoded?

by Michael S. Kaplan, published on 2006/09/10 11:53 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/10/748699.aspx


So, the question that Dixon asked was:

Can you tell me how window XP encoding its filename/directory name? Is it already UTF-8?

(I assume we are talking about NTFS here)

It is definitely not in UTF-8.

Furthermore, it is not in UCS-2, since you can have a filename with a supplementary character in it.

And it isn't in UTF-16, since it allows any sequence of unsigned short values which are not limited to valid Unicode characters and

So in one sense you could call it UTF-16 Plus since it basically adds a whole bunch of characters, though it is obviously less cool than actually using UTF-16 so perhaps it would be better to think of it as more of a UTF-16 Minus?

Or even better we just keep in mind that it isn't really a true Unicode encoding, just one that supports a lot of Unicode's characteristics and features and properties, while not really having a larger understanding of it....

 

This post brought to you by 𐒅 (U+10485, a.k.a. OSMANYA LETTER KHA)


# Adam on 10 Sep 2006 6:54 PM:

"[NTFS] allows any sequence of unsigned short values [as a filename]"

Interesting. I've heard that NTFS and the NT kernel have much looser restrictions on things like filename and path length than the Win32 layer above them - so it wouldn't surprise me to know that they allow more characters also.

But I'd be surprised to find out that even 0x0000, 0x005c (\) and 0x002f (/) are allowed at the kernel/NTFS level. Even Unix systems disallow the bytes 0x00 and 0x2f in filenames.

However, I'm not sure about 0x0001 - 0x001F, 0x002a (*), 0x003f (?) and 0x007c (|). Do you have any idea which of these are allowed in filenames at the kernel/fs level? I know Win32 (and .NET) will barf on them all, but it'd be interesting to have an idea of which layer those restrictions are fundamentally at.

# Michael S. Kaplan on 10 Sep 2006 8:53 PM:

There are specific characters that are disallowed at various levels for reasons that have very little to do with Unicode support, of course. I am not sure which ones are under Win32 and which are lower, though....

# Tom Gewecke on 11 Sep 2006 11:23 AM:

Can you provide the url of any detailed explanation of the XP filename encoding system?   It would be useful for me to compare it with the Mac system.  Various compatibility issues arise when transferring files, some of which depend on normalization (OS X must have everything decomposed).

# Michael S. Kaplan on 11 Sep 2006 12:32 PM:

I don't know of a URL, though I am working on a post to do here eventually and I can point you at that when it is done. :-)

NTFS does no Unicode normalization at all, which can definitely cause compat issues if you are transferring files between any system that does normalize, such as OS X.

# Tom Gewecke on 11 Sep 2006 2:09 PM:

Thanks!   The Mac info is found here, in case it's of interest:

http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties

# Matt Seitz on 30 Nov 2006 2:47 AM:

If NTFS allows any unsigned short, would it be more accurate to say that NTFS does not do any encoding?  Should one instead say it is the Win32 subsystem which encodes and decodes characters as UTF-16, and then stores them in and reads them from the raw NTFS file name buffer?

# Michael S. Kaplan on 30 Nov 2006 3:34 AM:

Well, perhaps. I doubt I could convince anyone to update the documentation to say it that way, though. :-)


referenced by

2006/12/05 Validation of Unicode text is growing up

2006/09/24 NTFS and Unicode?

go to newer or older post, or back to index or month or day