Our non-Unicode heritage

by Michael S. Kaplan, published on 2006/07/25 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/25/676295.aspx


Apologies for the small George Carlin riff in italics below, it is based on the Civil War bit he did during his New Jersey HBO special back in the early 90s. I lack the budget to have Mr. Carlin do a Podcast saying this bit, so please use your imagination to get the full effect!

The first version of Windows (1.0) shipped back in 1985, and it didn't have all that much in the way of impressively compelling international support. There were other good reasons for nobody to buy it, so most people probably did notice the lack.

Anyway, about seven years after the first version was released, seven years later, Windows NT 3.1 was shown at the PDC in San Fransisco. And it supported Unicode.

Not so you'd really notice it, of course.

Just sort of 'on paper.'

Of course now, fourteen years later, and Microsoft is planning on shipping Windows Vista, a fully Unicode operating system.

But not so you'd really notice it.

Because we still have these components that don't support Unicode.

Components who figure a code page is a really keen way to encode.

And the developers study the encoding carefully, and they try to improve on the strategies and the tactics to increase the component's utility. In case we have to go through writing new non-Unicode support some time. [sarcasm]

In fact, some of these components actually get used in top of the line applications and they go out and shoot for the moon with the features they provide.

You know what I say? Use live ammunition, would you please?

That was fun. :-)

Anyway, let's get down to it.

One of those components, I mentioned briefly in this post: wininet.dll.

It came up because we had an interest in changing the defaults for the NtfsAllowExtendedCharacterIn8dot3Name setting, documented as:

Specifies whether the characters from the extended character set, including diacritic characters, can be used in short file names using the 8.3 naming convention on NTFS volumes.

Value

Meaning

0

On NTFS volumes, file names using the 8.3 naming convention are limited to the standard ASCII character set (minus any reserved values).

1

On NTFS volumes, file names using the 8.3 naming convention may use extended characters.

This entry does not exist in the registry by default. You can add it by using the registry editor Regedit.exe.

Of course what is not mentioned in that informational topic is that years ago it was decided that this value should be set anytime the default system locale was Chinese, Japanese, or Korean (and unset anytime it wasn't).

There are several problemes here --

  1. It is mildly inconvenient as we are trying to reduce the number of dependencies to the system locale
  2. we also (generally) reduce the odd interactions like that are so hard to track.
  3. The old logic actually stomps on the preferences of anyone who actually uses the setting, any time the locale is changed

After talking with various partners and knowledgable people in the file system and the various markets, we tried just setting it always and being done with it. Unicode had been around for some time, maybe it was "time to cut the cord" (the exact words of one of the file system architects).

In fact, if you have Beta 2 of Vista then that is what you have on your install.

Everything was going great until we found out that that one several-year old baby still had the cord attached. :-(

The wininet cache (that is used to basically cache everything that various processes including IE use accessing the internet) does not support Unicode, since wininet.dll doesn't (wininet supports a Unicode interface that converts anything you throw at it, but that is more or less it).

Now for a page on the web it would not be too noticeable; after all, if a cache item of an internet access cannot be reached, then it just wouldn't get used -- you just go right to the internet. Unfortunately if you have a user name that isn't on your default system code page then the path to the cache itself is broken. So you fail even trying to get to it to fail -- so basically you lose Internet Explorer.

Oops.

Anyway, no worries, even though no beta customers had reported the problem, there was no sense waiting for a report -- clearly this was a big enough regression that it had to be fixed.

The change has been reverted for future builds, so that wininet.dll's lack of Unicode support (and incidently of the Windows non-Unicode heritage!) is preserved for another version.

Though I suppose it means that there aren't a whole lot of Windows user names off the default system code page that are used on CJK system locale machines. Or if there are then those customers probably don't try to use IE much. Since they are as broken in the prior versions as they will be in the new version.

And of course the people who use that NtfsDisable8dot3NameCreation setting to block the creation of short file names are probably not going to be too happy either if they have long user names or names with characters off the default system code page, for roughly the same reason.

In the end I am not really too worried about it since both ANSI support and short file name support on NTFS are there for backwards compatibility. So I suppose the overlap is consistent enough that people are not hitting this particular bug much.

But it is a story that I have been shaking my head about since the problem was identified....

This post brought to you by (U+0da4, a.k.a. SINHALA LETTER TAALUJA NAASIKYAYA)


# Random Reader on 25 Jul 2006 4:09 AM:

From this I infer that the wininet cache only uses short file names; why is that still the case today?  I would have expected that to change for the NT line at some point, especially if the IE team is responsible for it.

Is it just a case of it not being a priority to change, or is there something "bigger" behind it?

# Michael S. Kaplan on 25 Jul 2006 7:51 AM:

Yep, short names is their business. Hard to know why, exactly (when they had to install on Win9x the reason was more obvious).

# Ben Cooke on 25 Jul 2006 1:58 PM:

I did try to answer this for myself, but I can't find any easy way to get the short filename of a file under Windows XP without writing code. (Side note: I'm finally running Windows XP! I've been using 2000 as my primary OS for years now. My Windows XP install is so fresh that I don't have a compiler installed yet.)

If you have the setting on to store short filenames in the system code page, what does it store if your filename consists entirely of characters that aren't in the system code page? Does it end up just called ~1? :)

Also, does toggling that option when there are already files stored with short names confuse NTFS, or does it track for each file the character encoding used for its short filename?

(Side note 2: Back on my old system I actually had NtfsDisable8dot3NameCreation on for a bit, but I realised how often I actually use the short filenames as shortcuts. I quite often run "c:\progra~1", for example. It's too bad that you can't have the creation of the entries in the FS disabled but just have the OS (or even just the shell!) preserve the illusion that they are there. I realise that'd get very ugly very quickly, though...

# Michael S. Kaplan on 25 Jul 2006 2:09 PM:

From a CMD prompt, dir /x will do the trick here....

The algorithm actually fills in other characters that fit within the ACSII range if 8.3 is limited to ASCII.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/05/29 Ask a simple question, and then duck!

2006/12/31 More on our non-Unicode heritage

go to newer or older post, or back to index or month or day