NTFS and Unicode?

by Michael S. Kaplan, published on 2006/09/24 15:40 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/24/769540.aspx


A recent question I received via email from a colleague who preferred to remain anonymous on the blog:

Hope everything is going well with you first of all...

May I ask for your help on an NTFS technical question? I'm currently involved in some CIFS/NTFS compatibility related issue discussion and wondering what would be the first Windows release that supported UTF-16 and characters of beyond the BMP area?

Based on the http://en.wikipedia.org/wiki/NTFS, it is Windows 2000 but I'm not quite sure if that's official or really correct.

Would you please let me know if you have the info handy or point me to one of the public documents available at Microsoft web sites? (I was trying to do web search but I wasn't really able to find the info from www.microsoft.com...)

Thanks very much in advance for your help and hope this isn't a trade secret that I'm asking for...

Well, since as far as I know I don't know any trade secrets about NTFS, we are probably safe on that count, at least! Just to make sure, I'll stick to stuff that anyone can verify themselves if they want.... :-)

Of course there is the info I just put up in this blog post for starters, and I'll go a step further and make it clear that you can use high surrogate and low surrogate code units in NT even before they were actualy defined (since none of the current or past incarnations of NT disallow unassigned code points).

The Wikipedia article is really quite misleading on this score with its text:

File names are stored in Unicode (encoded as UTF-16, although limited to the Basic Multilingual Plane in early versions before Windows 2000).

Well, I'll point out that whoever wrote this bit either confused NTFS with Active Directory (which is actually limited on this point until Windows XP/Server 2003 which is when surogate code units first received weight) or they simply don't understand NTFS and did not test creating such files on NT 4.0 or earlier.

In my ideal world, a future version of NTFS would actually (optionally) take into account both characters defined in Unicode and also Unicode normalization, but as far as I know there isn't anyone planning such a thing yet.

So if I absolutely had to describe NTFS in terms of a Unicode version, I'd say it uses a very early version of Unicode and it assumes that anything it believes to be unassigned code points it allows for forward compatibility. :-)

 

This post brought to you by / (U+002f, a.k.a. SOLIDUS)


# Carl on 24 Sep 2006 4:59 PM:

"nornalizattion"? That's a j○ke, right?

# Michael S. Kaplan on 24 Sep 2006 5:22 PM:

Whoops, that was just a typo. Now *your* comment is a normalization joke, what with that confusable character and all. :-)

# Michael Dunn_ on 25 Sep 2006 4:22 AM:

You can go and fix up the misleading text in that Wikipedia article yourself. :)

# Michael S. Kaplan on 25 Sep 2006 4:59 AM:

Hi Mike,

In theory I could. But I am not really a Wikipedia contributor. Plenty of regulasr readers here are, so someone else may take a stab at it if they feel strongly. :-)

# Sergei on 25 Sep 2006 9:13 AM:

Beta versions of Vista regard this files as different:
file 1: U+0419
file 2: U+0418 U+0306

and these two too (encoded in UTF-16, of course):
file 1: U+10400
file 2: U+10428
I was hoping that the last example might work in Vista, because NTFS file "$UpCase" in Vista is different from the one in Windows XP.

# Michael S. Kaplan on 25 Sep 2006 11:05 AM:

Nope, normalization still does not happen, and the updated casing table is limited to the BMP....

# WikiServerGuy on 25 Sep 2006 4:43 PM:

I'm a bit surprised Wikipedia is used as a reference; don't you guys have an internal wiki or something? IIRC Ward Cunningham himself works for the company. If if there's no wiki I would have thought that there would be an internal website that had info on it or something...

BTW, great post! :)

# Michael S. Kaplan on 25 Sep 2006 5:21 PM:

The original question came from someone outside of Microsoft who wanted to remain nameless....

go to newer or older post, or back to index or month or day