Zipping up Unicode file PATHs

by Michael S. Kaplan, published on 2006/12/07 08:21 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/12/07/1232365.aspx

Now I have talked about the general issue with ZIP previously in Zipping up Unicode file names. The specific issue Kecia brings up if you have a user name outside the default system code page is arguably much more worrisome.

As Mihai commented, there is room for future expansion in the ZIP format, which is good. Although it seems like no one wants to bite the bullet and have the backcompat problems (more on this another day).

I do have some good news, which was reported by Rostislav in response to Kecia's post:

Ok, so we are wimpy too about the backcompat thing. But note that this particular feature works around the problem that Kecia mentioned in relation to the off-CP_ACP user name (the change will work around any off-CP_ACP characters in the path, in fact!

This post brought to you by "Ž" (U+017d, a.k.a. LATIN CAPITAL LETTER Z WITH CARON)

According to Appendix D of the current spec (http://www.pkware.com/business_and_developers/developer/popups/appnote.txt):

"The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. To address this limitation, this specification will support the following change.

If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification. The Unicode Standard is published by the The Unicode Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM)."

Not a particularly clever decision, because older software not looking for GP bit 11 will write out a UTF-8 encoded filename as if it were CP437 or whatever the current codepage is, depending on whether it obeys the rule above.

It then produces some drivel about '"modified-UTF-8" (JAVA) or UTF-8-MAC'. Wikipedia mentions these non-conformant non-UTFs at http://en.wikipedia.org/wiki/UTF-8#Java and http://en.wikipedia.org/wiki/UTF-8#Mac_OS_X. Clearly these do need to be indicated in the ZIP file or certainly handled by the unzipping tool as I'd guess that Java is already generating ZIP files with these malformed file names. The Java errors are basically due to a strict UTF-16 to UTF-8 conversion without understanding surrogate pairs; the OS X issue leads to problems like the one you mentioned at http://blogs.msdn.com/michkap/archive/2006/07/19/670674.aspx. However, it appears it's more a problem with the OS X file APIs - that filenames are supplied to the file system in sort-of NFD - than with the files themselves. Conversion to NFC might be necessary if the Windows file APIs are performing ordinal comparison and if the common dialogs present the filename in NFC.

Don't know about .zip, but the .rar archive format is able to handle filenames with Unicode characters [different than the current Windows codepage]. I used it at the time I made several tests with the Romanian ș and ț characters inside filenames in XP and Vista.

An example is this http://www.secarica.ro/misc/abcd.rar or this http://www.secarica.ro/misc/abcd_no_caps.rar . The first example is interesting if extracted on XP, because there are two file pairs with same filenames but different caps that can coexist in same directory.

Cristi

Mike Dimmick, there's nothing invalid or non-conforming about the UTF-8 that OS X requires for file names. It's just that file names must be in the NFD variant and only the NFD variant. And it's a complete non-issue when dealing with file APIs. Namely since paths are horrible and not recommend and the APIs that don't take paths take UTF-16 file names (or file names in which the encoding of the bytes in the string is purely an implementation issue).

I absolutely agree that WinZip should fix this immediately. It's quite true that they keep releasing new versions without significant advances, whilst ignoring something as fundamental (to me) as Unicode support. It baffles me slightly that applications are still being released without this. Monoglot programmers are, in fact, generally pretty ignorant of the whole Unicode issue.

My particular problem is that a backup program I use (SecondCopy) uses the zip format to archive backed-up files. As a result, Unicode filenames either don't get copied, or end up "lost" inside the zip file with screwed-up names. This is obviously extremely dangerous in backup software.

I'm now looking for alternative backup software.

"Don't use Unicode filenames" is simply no longer acceptable.