Sometimes, you have to keep it in ASCII

by Michael S. Kaplan, published on 2006/04/30 19:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/04/30/587267.aspx

Well, as Heath points out in this post, Palm has re-released their update, sans the non-ASCII characters in file names.

Given the current limitations in ZIP and the additional complications I pointed out in that earlier post, I suppose I could hardly blame them....

It is funny how Microsoft can't really win in this kind of situation -- if we do nothing, then we are branded a bunch of ignorant provincials for the limitations in the platform; if we provide any kind of extension to compression, then we are evil for ignoring existing standards in favor of our proprietary solutions.

Of course those two groups are (mostly) different people, so I guess it makes sense.

But the fact that we need to keep things in ASCII is more than a little disappointing. :-(

This post brought to you by "®" and "™" (U+00ae and U+2122, a.k.a. REGISTERED SIGN and TRADE MARK SIGN)

I have to deal with this in supporting a legacy interface that uses octets for character strings and is code-page specific.

The only thing I can figure out for the reference implementation of a Unicode/XML-oriented repository is to identify and store the code-page-based material verbatim, employ a Unicode translation where I can, and do the best guessing I can when material deposited via a client using one code page is retrieved by a client using a different code page.

I don't want to think about ordering of query results at all, and may punt in the reference implementation, with lots of caveats in the code.

I suspect multiple code pages will be rare in actual use, but I want a solution for the reference implementation.

I think Windows could have had the default as interpreting filenames inside zips as UTF-8, and fall back to the legacy codepage if the filename wasn't legit UTF-8 (this is what IRC clients have recently started doing, easing the transition since charset isn't specified in IRC). The transition would have been even faster if Windows had also been bundled with a zip compressor program that stored the filenames in UTF-8, although I can see that that would have raised some complaints from certain shareware program authors *cough*.

I don't think any of the Windows ZIP programs do this at the moment, but if you zip up a file containing non-ascii characters in Linux it becomes UTF-8 automatically since filenames in Linux distros these days are UTF-8 exclusively.

This is a worthwhile "upgrade" to the zip format (let's face it, zip isn't going anywhere, so we might as well make it *work* the best it can), and IMHO it doesn't count as embrace and extend since the usage of different character sets in filenames without specifying them is already a huge mess while UTF-8 would clear it up.

mpz> "I can see that [encoding filesnames with UTF-8] would have raised some complaints from certain shareware program authors *cough*."

Why? If filesnames are encoded with CP_OEMCP and CP_ACP, but the actual codepages used are not stored in the zip file, then the zip libraries have to be aware that any zip file they get could have filenames stored in any codepage.

Also, if any existing windows systems are set up so that their code page is 65001 (UTF-8) then they'll already be producing zips with UTF-8 filenames.

Producing zips with UTF-8 filenames won't break anything that isn't already broken, and trying to decode as UTF-8 first is pretty robust as it's really unlikely that you'll get a valid UTF-8 stream with non-ASCII chars by accident.

To be really safe, if MS announced an intention to only produce UTF-8 ZIPs starting a year from now, that ought to give all ZIP companies plenty of time to get UTF-8 sniffing decompressors out there. In a year's time, all new zip programs start producing UTF-8 encoded zips, in 2 the problem would be pretty much solved.

"then the zip libraries have to be aware that any zip file they get could have filenames stored in any codepage. "

They are not aware of that. They simply operate under the rules of the old world ("ANSI"), where you didn't have to care about anything and it was the user's problem if (s)he tried to extract a zip file that contained characters from a different code page.

If zips with UTF-8 filenames inside suddenly started appearing, some of these shareware authors would most certainly first blame whoever started supporting them because they simply can not fathom the need for characters outside the selected legacy code page.

That doesn't mean I'm opposed to UTF-8. On the contrary, I *want* to see the change and the sooner we start the migration the sooner it will be over. ZIP decompressors should simply test if the filename is a valid UTF-8 string, and if not, then decode it with the legacy code page. And of course compressors should start storing the filename as UTF-8, *always*.