It's not that they're putting the Pressure on Windows, but maybe the Pressure.Net? :-)

by Michael S. Kaplan, published on 2012/01/10 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/01/10/10255017.aspx


The other day when I blogged abut how If someone blathers on about how Windows supports Unicode, you can suggest they just ZIP it, if you like!, I didn't tell the whole story.

I focused on Windows, and a peculiar engineered schizophrenia that is Windows compressed folders, the most non-Unicode piece of Windows outside of stuff that was yanked out of the product years ago.

But if you look to Windows 8, there is a component shipping with it that is aspiring to do more.

I'm referring to the .Net Framework, whose version 4.5 has added the new ZipArchive Class to the venerable (for .Net, at leas!) System.IO.Compression Namespace.

Done right, too -- using UTF-8 so they can support whatever kind of characters you want -- anything in Unicode!

The code was in no way connected to the code Windows licensed for the Compressed Folders feature.

And you know how Windows isn't isn't allowed to provide a programmatic way to get int those compressed folders?

Well everything you do in .Net is something programmatic!

On top of all of that, you can install the Microsoft® .NET Framework® 4.5 Developer Preview - Full on Server 2008, Windows 7, or Server 2008 R2 as well as just using it on the Developer Preview of Windows 8.

Amazing!

There must be a catch though, right?

Don't worry, there is.

The catch is most easily described by comparing/contrasting the CP_ACP code page-based support of ZIP in Windows and the UTF-8-based support of ZIP n the Developer Preview of .Net 4.5.

These two technologies, much like the two encodings supporting them, have one thing in common.

And when I say one thing, I mean 127 things.

That's right -- what they have in common, all they have in common, is ASCII.

So although .NET is doing the right thing here, it s essentially only doing the right thing if either:

Otherwise, I suppose bugs like System.IO.ZipArchive zipped only UTF-8 encoding are just going to be par for the course in markets like Japan, where all that JIS X 213 and IVS and Emoji are important, lots of people are still happy within their own code page, at least.

This suggests one of the big problems of having no user interface:

You have no user interface!

Damn.

So we still have this flaw in Windows, though this design means that for many customers it is too easy to be dragged down in .Net, too.

Now of course the is not too much of a sign that .Net wouldn't try to help here, to at least bridge the scenario a little.

Or maybe they could take advantage of some of that new managed side by side stuff and provide a couple of context menu items for zipping and unzipping files? Though that approach would also be fraught with peril an confusion for many people in the Long run, unless they literally took over the ZIP handling for always, even though they localize into just ~10% of the languages of Windows.

They would be saviors, but not everyone would fully appreciate their largess.

Perhaps including Windows....

Or maybe Windows could do this same thing themselves and use the managed code ZipArchive Class. I mean, they know about shell handlers than anyone in the universe, and they have a much larger Extent Of Localization. Hooking up .NET here might be caper than fixing their ZIP problem they never have fixed yet.

They could take the Notepad approach to Unicode support:

Not perfect, but certainly it has been good enough for text files in Notepad since the Windows 2000 beta!

Or they could just fix the bug they have now....

Okay enough spitballing, you get the idea.

The current work highlight both the best and the worst of two different business units within Microsoft.

Looking at that bug report I mentoned above, the needs are expressed pretty clearly:

Greg,
Here is what Mr./Ms. Kamegawa wrote back (with some edit by me). Hope it helps.
I think you've already answered most of the issue, but could you please recap it and/or add any comment?
(Personally I would agree with him/her. Whatever the ZIP file specification is, it seems like the de facto standard in Japan has been MBCS ZIP files. For a very long time. It is great that the .NET API uses UTF-8 for globalization, but I'm afraid the lack of a viable option to create MBCS ZIP files would make the .NET API practically useless in Japan.)
Thanks.

---
[Summary]
System.IO.Compress.ZipArchive stores MBCS file names in UTF-8.
Windows Explorer can't handle UTF-8 ZIP files.
So the ZIP files compressed by System.IO.Compress.ZipArchive are not extracted correctly by Windows Explorer.
Windows Explorer should support extracting UTF-8 ZIP files. Otherwise System.IO.Compress.ZipArchive should support storing MBCS file names. And the latter seems more practical.

[Background]
Most Japanese users extract Windows Explorer to extract ZIP files. Japanese file names are used so frequently in Japan that the incompatibility between Windows Explorer and the ZIP files compressed by System.IO.Compress.ZipArchive is unacceptable.

[Repro Steps]
1. Create Japanese named files in c:\temp\.
2. Compress them into ziptest.zip using the code below.
--- Begin ---
using (var zip = new ZipArchive(@"c:\temp\あ\ziptest.zip", ZipArchiveMode.Create)){
var files = new DirectoryInfo(@"c:\temp").GetFiles("*.*");
Array.ForEach(files, x => zip.CreateEntryFromFile(x.FullName, x.Name));
}
--- End ---
3. Extract ziptest.zip using Windows Explorer.
-> The Japanese file names are corrupted as Greg depicted before.

[Expected Behavior]
Twofold:
1. Make Windows Explorer support extracting UTF-8 ZIP files.
2. Make System.IO.Compress.ZipArchive support compressing ZIP files using MBCS or the system locale.

The best would be 1. But I don't think it possible to make all the widespread Windows versions (i.e. XP/Vista/7/2003/2008/2008 R2) support UTF-8 ZIP files.

So I would like to ask for 2. It will provide the maximum compatibility with Windows including the legacy-but-widely-used versions of Windows.

I could hardly say more here than this.

So no matter how you look at it, the new ZipArchive Class of the Microsoft® .NET Framework® 4.5 Developer Preview - Full is certainly putting the pressure on them to do the right thing, huh?

Or at least the Pressure.Net!


Mike Dimmick on 10 Jan 2012 8:43 AM:

.NET strings are all Unicode, all the time. If you need to convert to some serialization format, you need to specify the Encoding you're going to use. You might be implementing some dynamic website that creates a zip file on the fly, and you might want to set the filename of the contained files according to the user's preferred language, rather than the selected locale of the server. This isn't that unlikely, I've certainly seen websites where you can select multiple documents to download as a single ZIP file, and the filenames could well depend on what language the user agent asked for.

My suggestion - which I've just posted as a comment - is to add new overloaded constructors that take an Encoding specifying what character set the filenames are in.  

You could encounter ZIP files that were created on one system, then updated on another with a different character set, but these would be pretty rare. You'd probably want to prevent .NET creating one, of course!

cheong00 on 10 Jan 2012 8:23 PM:

Actually the NTFS driver will convert any filename you specified for creating new file to Unicode when it writes to the filename entry. So it's pretty pointless waste of effort for making ZipArchive support changing filename to MBCS and then actually have filesystem driver change it back on the fly.

Michael S. Kaplan on 10 Jan 2012 10:29 PM:

Ah, you are missing the point of the plan -- it is to support CP_ACP and failing if there are characters outside of it )working like Notepad does)....

HomeCloset on 10 Jan 2012 11:37 PM:

As for Windows, the suggested approach like Notepad would be great. In this way people will be able to migrate to Unicode gradually.

If the current situation is left as it is, people in Japan will not use ZipArchive at all. Neither will they migrate to Unicode.

Random832 on 11 Jan 2012 7:33 AM:

The zip format does have a flag - which I hope that the ZipArchive class is setting - to specify if filenames are in UTF-8 or not. (The standard is vague on what "not" means, so it can just be whichever of ACP/OEMCP the compressed folders app does now). That's better than the notepad method (text files, of course, contain no metadata)

HomeCloset on 11 Jan 2012 8:44 AM:

ZipArchive class sets the flag, reportedly. Of course the notepad method should also set the flag when saving in UTF-8 after presenting the "magic UI".

Peter Gibbons on 11 Jan 2012 8:59 AM:

I would appreciate if Microsoft would extend the support for codepages in the ZipArchive Class beyond japanese MBCS! How many times do we receive ZIP files from e.g. Russia encoded in their codepages! Even Microsoft AppLocale doesn't help with this because it only affects the ANSI not the most often used OEM codepage and it doesn't work for 64 Bit programs either. The same the other way around. Even nowadays some russians can't cope with Unicode in ZIP most times because Windows' lack of support for it. It is a hassle whenever I want to send them ZIP files with russian filenames in it.

Final cut: _Please_ support _all encodings_, especially UTF-8, OEM, ANSI AND MBCS, when creating AND unpacking ZIP files with System.IO.Compress.ZipArchive !!! Powershell will be our friend with this.

cheong00 on 12 Jan 2012 9:59 PM:

Oh, I was commenting on their demand for ZipArchive MBCS support. It'd make more sense for Windows compressed folder feature to add Unicode support, or suggest them to use any of the alternatives which provides handling to Unicode-filename containing ZIP files if it's not going to change.

HomeCloset on 13 Jan 2012 10:15 AM:

cheong00,

It is fine and really nice to add Unicode support to Windows compressed folder. However -

Will Microsoft push the Unicode ZIP feature update to Windows XP via Windows Update as a critical update? Unless it happens, ZIP API without MBCS support could make no sense.

Even pure bug fixes are not to be released these days for Windows XP.

Joshua on 13 Jan 2012 12:51 PM:

Meh. We had to dump .zip for .7z for format exchange due to massive incompatibility between .zip versions.

Dave Bacher on 29 Jan 2012 1:13 PM:

Here's the standard:

www.pkware.com/.../APPNOTE.TXT

So lets read what it says:

-----

APPENDIX D - Language Encoding (EFS)

------------------------------------

The ZIP format has historically supported only the original IBM PC character

encoding set, commonly referred to as IBM Code Page 437.  This limits storing

file name characters to only those within the original MS-DOS range of values

and does not properly support file names in other character encodings, or

languages. To address this limitation, this specification will support the

following change.

-----

So, either the file is CP437 or it is UTF8, according to the standard.  You get to pick one or the other.  Now, they do add some ambiguity later:

-----

Applications may choose to supplement this file name storage through the use

of the 0x0008 Extra Field.  Storage for this optional field is currently

undefined, however it will be used to allow storing extended information

on source or target encoding that may further assist applications with file

name, or file content encoding tasks.  Please contact PKWARE with any

requirements on how this field should be used.

The 0x0008 Extra Field storage may be used with either setting for general

purpose bit 11.  Examples of the intended usage for this field is to store

whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC.  Similarly, other

commonly used character encoding (code page) designations can be indicated

through this field.  Formalized values for use of the 0x0008 record remain

undefined at this time.  The definition for the layout of the 0x0008 field

will be published when available.  Use of the 0x0008 Extra Field provides

for storing data within a ZIP file in an encoding other than IBM Code

Page 437 or UTF-8.

-----

Now despite this, out in the wild you'll find plenty (looking at you, FidoNet) of files that were exchanged and have directory entries in multiple code pages, without any indication anywhere in the file as to which encoding was used.  As files moved around through FidoNet -- and that was a very real use case, back in 1995, for Microsoft -- bulletin boards would often add additional files, kind of an advertisement for the systems that the file had passed through, as well as some standardized control files.  

Anyway, if I'm zipping up a file, here's what I want out of the library:

* Zip32 until/unless there's a file that needs Zip64. (2.04g version marker)

* US-ASCII 8.3 names, using only upper case letters, numbers, tilde and a single period.

* Some synthesized short name (file0000.000 is fine) based off the above rule, when the file name doesn't match.

* InfoZip's UTF8 extension -- regardless of if the name actually randomly complied with the above or not -- with the actual file name.

The reason I want to produce files following that pattern is that every archiver on the planet can read that file.  Sure, the user might get some bad file names -- but that's better than it crashing the application that tries to read it (as the GP11 technique is known to) and still getting a bad file name if the archiver writes UTF8 names using CP437 (heart, brick, double left .txt?)

When reading the files, I would look for the InfoZip extensions.  If I found them, I would use them and ignore the original name in the entry.  Then things get complicated.

The GP11 technique was a bad idea, and crashes older archivers.  It evolved over time, but just because the bit is set, you can't assume every directory entry you read is UTF8 because some archivers will add entries to the end of the directory without clearing the bit.  Likewise, just because the bit is clear, you can't assume the entries aren't UTF8 -- because Java wrote UTF8 without setting GP11, then after they figured out it was crashing archivers, they changed it to setting bit 11 -- and then later asked PkWare to add it to the standard.

And so if either GP11 is set or if there is a Java metadata.xml file, you want to try UTF8 processing on the name.  If the UTF8 processing fails, then you want to use CP437.

If GP11 is clear, then either you assume CP437 (and are wrong a lot of the time), or you try to use a heuristic like Notepad uses (and are also wrong a lot of the time).  You can let the application specify -- if you really think the person reading the zip file has a clue how code pages work.

If I were designing the library, I would probably just provide the raw bytes parallel to the CP437 or UTF8 decoded file name.  I'd process the file according to the stanard (which says those two cases), and then I'd provide the byte data so that if someone wanted to process data that way outside the standard, they'd have that option.

That's where the nightmare is, though -- from Microsoft (and your Vendor's) viewpoint, you have users downloading zip files made on random websites with random software.  If the shell fails to open it, then they blame Microsoft -- not the site that wrote the bad file in the first place.  Also despite the fact that PkZip says that CP437 is "the correct" code page, version 4 was the first one that tried to convert anything down to 437 -- older versions just memcpy'd whatever the File I/O stuff returned.

The issue is that Zip is the most widely used archive format on the planet, and people depend on their old files being able to open with newer software.  And so any time you're providing a general purpose Zip library, that library must be compatible with the down-level files.  For example, System.IO.Packaging.ZipPackage only needs to be compliant with the ISO OPC spec, not with some PkZip 1.04-era file downloaded from a BBS that hasn't been in operation for over a decade.

Incidentally:

The GP11 UTF8 technique became standard in September of 2006, and so Vista would have been the first version of Windows that could use that (and that format has known back-compat issues)

The InfoZip extension became standard in September of 2007, and so that would be post-launch for Vista.

The Zip support in .NET 4.5 was being written (at one time) by an intern, who had a blog on MSDN talking about it (didn't bookmark it)

And yeah, I can definitely see using 7z, RAR, etc. instead of Zip if you have a choice.

Another real good way to get compatibility is to use InfoZip's DLL -- InfoZip has excellent compatibility with other archivers, and the license is compatible with most commercial software.  For example, IBM uses it -- a lot.

DooBeDoo on 1 Mar 2012 1:08 AM:

In Windows 8 Consumer Preview you still get the old errormessage when you try to zip files with non CP_ACP characters in the filenames :-((


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day