It's not that they're putting the Pressure on Windows, but maybe the Pressure.Net? :-)

by Michael S. Kaplan, published on 2012/01/10 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/01/10/10255017.aspx

I focused on Windows, and a peculiar engineered schizophrenia that is Windows compressed folders, the most non-Unicode piece of Windows outside of stuff that was yanked out of the product years ago.

But if you look to Windows 8, there is a component shipping with it that is aspiring to do more.

Done right, too -- using UTF-8 so they can support whatever kind of characters you want -- anything in Unicode!

The code was in no way connected to the code Windows licensed for the Compressed Folders feature.

And you know how Windows isn't isn't allowed to provide a programmatic way to get int those compressed folders?

The catch is most easily described by comparing/contrasting the CP_ACP code page-based support of ZIP in Windows and the UTF-8-based support of ZIP n the Developer Preview of .Net 4.5.

These two technologies, much like the two encodings supporting them, have one thing in common.

That's right -- what they have in common, all they have in common, is ASCII.

So although .NET is doing the right thing here, it s essentially only doing the right thing if either:

Otherwise, I suppose bugs like System.IO.ZipArchive zipped only UTF-8 encoding are just going to be par for the course in markets like Japan, where all that JIS X 213 and IVS and Emoji are important, lots of people are still happy within their own code page, at least.

So we still have this flaw in Windows, though this design means that for many customers it is too easy to be dragged down in .Net, too.

Now of course the is not too much of a sign that .Net wouldn't try to help here, to at least bridge the scenario a little.

Or maybe they could take advantage of some of that new managed side by side stuff and provide a couple of context menu items for zipping and unzipping files? Though that approach would also be fraught with peril an confusion for many people in the Long run, unless they literally took over the ZIP handling for always, even though they localize into just ~10% of the languages of Windows.

Or maybe Windows could do this same thing themselves and use the managed code ZipArchive Class. I mean, they know about shell handlers than anyone in the universe, and they have a much larger Extent Of Localization. Hooking up .NET here might be caper than fixing their ZIP problem they never have fixed yet.

Not perfect, but certainly it has been good enough for text files in Notepad since the Windows 2000 beta!

The current work highlight both the best and the worst of two different business units within Microsoft.

Looking at that bug report I mentoned above, the needs are expressed pretty clearly:

Greg,
Here is what Mr./Ms. Kamegawa wrote back (with some edit by me). Hope it helps.
I think you've already answered most of the issue, but could you please recap it and/or add any comment?
(Personally I would agree with him/her. Whatever the ZIP file specification is, it seems like the de facto standard in Japan has been MBCS ZIP files. For a very long time. It is great that the .NET API uses UTF-8 for globalization, but I'm afraid the lack of a viable option to create MBCS ZIP files would make the .NET API practically useless in Japan.)
Thanks.

---
[Summary]
System.IO.Compress.ZipArchive stores MBCS file names in UTF-8.
Windows Explorer can't handle UTF-8 ZIP files.
So the ZIP files compressed by System.IO.Compress.ZipArchive are not extracted correctly by Windows Explorer.
Windows Explorer should support extracting UTF-8 ZIP files. Otherwise System.IO.Compress.ZipArchive should support storing MBCS file names. And the latter seems more practical.

[Background]
Most Japanese users extract Windows Explorer to extract ZIP files. Japanese file names are used so frequently in Japan that the incompatibility between Windows Explorer and the ZIP files compressed by System.IO.Compress.ZipArchive is unacceptable.

[Repro Steps]
1. Create Japanese named files in c:\temp\.
2. Compress them into ziptest.zip using the code below.
--- Begin ---
using (var zip = new ZipArchive(@"c:\temp\あ\ziptest.zip", ZipArchiveMode.Create)){
var files = new DirectoryInfo(@"c:\temp").GetFiles("*.*");
Array.ForEach(files, x => zip.CreateEntryFromFile(x.FullName, x.Name));
}
--- End ---
3. Extract ziptest.zip using Windows Explorer.
-> The Japanese file names are corrupted as Greg depicted before.

[Expected Behavior]
Twofold:
1. Make Windows Explorer support extracting UTF-8 ZIP files.
2. Make System.IO.Compress.ZipArchive support compressing ZIP files using MBCS or the system locale.

The best would be 1. But I don't think it possible to make all the widespread Windows versions (i.e. XP/Vista/7/2003/2008/2008 R2) support UTF-8 ZIP files.

So I would like to ask for 2. It will provide the maximum compatibility with Windows including the legacy-but-widely-used versions of Windows.

Here's the standard:

www.pkware.com/.../APPNOTE.TXT

So lets read what it says:

-----

APPENDIX D - Language Encoding (EFS)

------------------------------------

The ZIP format has historically supported only the original IBM PC character

encoding set, commonly referred to as IBM Code Page 437. This limits storing

file name characters to only those within the original MS-DOS range of values

and does not properly support file names in other character encodings, or

languages. To address this limitation, this specification will support the

following change.

-----

So, either the file is CP437 or it is UTF8, according to the standard. You get to pick one or the other. Now, they do add some ambiguity later:

-----

Applications may choose to supplement this file name storage through the use

of the 0x0008 Extra Field. Storage for this optional field is currently

undefined, however it will be used to allow storing extended information

on source or target encoding that may further assist applications with file

name, or file content encoding tasks. Please contact PKWARE with any

requirements on how this field should be used.

The 0x0008 Extra Field storage may be used with either setting for general

purpose bit 11. Examples of the intended usage for this field is to store

whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC. Similarly, other

commonly used character encoding (code page) designations can be indicated

through this field. Formalized values for use of the 0x0008 record remain

undefined at this time. The definition for the layout of the 0x0008 field

will be published when available. Use of the 0x0008 Extra Field provides

for storing data within a ZIP file in an encoding other than IBM Code

Page 437 or UTF-8.

-----

Now despite this, out in the wild you'll find plenty (looking at you, FidoNet) of files that were exchanged and have directory entries in multiple code pages, without any indication anywhere in the file as to which encoding was used. As files moved around through FidoNet -- and that was a very real use case, back in 1995, for Microsoft -- bulletin boards would often add additional files, kind of an advertisement for the systems that the file had passed through, as well as some standardized control files.

Anyway, if I'm zipping up a file, here's what I want out of the library:

* Zip32 until/unless there's a file that needs Zip64. (2.04g version marker)

* US-ASCII 8.3 names, using only upper case letters, numbers, tilde and a single period.

* Some synthesized short name (file0000.000 is fine) based off the above rule, when the file name doesn't match.

* InfoZip's UTF8 extension -- regardless of if the name actually randomly complied with the above or not -- with the actual file name.

The reason I want to produce files following that pattern is that every archiver on the planet can read that file. Sure, the user might get some bad file names -- but that's better than it crashing the application that tries to read it (as the GP11 technique is known to) and still getting a bad file name if the archiver writes UTF8 names using CP437 (heart, brick, double left .txt?)

When reading the files, I would look for the InfoZip extensions. If I found them, I would use them and ignore the original name in the entry. Then things get complicated.

The GP11 technique was a bad idea, and crashes older archivers. It evolved over time, but just because the bit is set, you can't assume every directory entry you read is UTF8 because some archivers will add entries to the end of the directory without clearing the bit. Likewise, just because the bit is clear, you can't assume the entries aren't UTF8 -- because Java wrote UTF8 without setting GP11, then after they figured out it was crashing archivers, they changed it to setting bit 11 -- and then later asked PkWare to add it to the standard.

And so if either GP11 is set or if there is a Java metadata.xml file, you want to try UTF8 processing on the name. If the UTF8 processing fails, then you want to use CP437.

If GP11 is clear, then either you assume CP437 (and are wrong a lot of the time), or you try to use a heuristic like Notepad uses (and are also wrong a lot of the time). You can let the application specify -- if you really think the person reading the zip file has a clue how code pages work.

If I were designing the library, I would probably just provide the raw bytes parallel to the CP437 or UTF8 decoded file name. I'd process the file according to the stanard (which says those two cases), and then I'd provide the byte data so that if someone wanted to process data that way outside the standard, they'd have that option.

That's where the nightmare is, though -- from Microsoft (and your Vendor's) viewpoint, you have users downloading zip files made on random websites with random software. If the shell fails to open it, then they blame Microsoft -- not the site that wrote the bad file in the first place. Also despite the fact that PkZip says that CP437 is "the correct" code page, version 4 was the first one that tried to convert anything down to 437 -- older versions just memcpy'd whatever the File I/O stuff returned.

The issue is that Zip is the most widely used archive format on the planet, and people depend on their old files being able to open with newer software. And so any time you're providing a general purpose Zip library, that library must be compatible with the down-level files. For example, System.IO.Packaging.ZipPackage only needs to be compliant with the ISO OPC spec, not with some PkZip 1.04-era file downloaded from a BBS that hasn't been in operation for over a decade.

Incidentally:

The GP11 UTF8 technique became standard in September of 2006, and so Vista would have been the first version of Windows that could use that (and that format has known back-compat issues)

The InfoZip extension became standard in September of 2007, and so that would be post-launch for Vista.

The Zip support in .NET 4.5 was being written (at one time) by an intern, who had a blog on MSDN talking about it (didn't bookmark it)

And yeah, I can definitely see using 7z, RAR, etc. instead of Zip if you have a choice.

Another real good way to get compatibility is to use InfoZip's DLL -- InfoZip has excellent compatibility with other archivers, and the license is compatible with most commercial software. For example, IBM uses it -- a lot.