Zipping up Unicode file PATHs

by Michael S. Kaplan, published on 2006/12/07 08:21 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/12/07/1232365.aspx


Kecia asked:

We understand there is a known issue with winzip where it doesn’t work if certain Unicode characters are in the extract path (since the path to the user’s temp directory contains the user name, this issue occurs when a user name contains the offending Unicode characters).

  • Has anyone shipped with this issue (vs. opting to use winrar or some other app)?
  • If you did ship with it, how bad was the issue out in the field (how many people see it, call PSS, etc).
  • Does anyone know a way to work around it?

Now I have talked about the general issue with ZIP previously in Zipping up Unicode file names. The specific issue Kecia brings up if you have a user name outside the default system code page is arguably much more worrisome.

As Mihai commented, there is room for future expansion in the ZIP format, which is good. Although it seems like no one wants to bite the bullet and have the backcompat problems (more on this another day).

I do have some good news, which was reported by Rostislav in response to Kecia's post:

 "compressed folders" in XP have this limitation too. Due to the feedback, we've fixed a part of the problem in Vista - now you can extract to a multilingual path. Still you can't have files with multilingual names in the archive.
-Ros

Ok, so we are wimpy too about the backcompat thing. But note that this particular feature works around the problem that Kecia mentioned in relation to the off-CP_ACP user name (the change will work around any off-CP_ACP characters in the path, in fact!

Progress, I say. And I'll talk more about why/how in an upcoming post....

 

This post brought to you by "Ž" (U+017d, a.k.a. LATIN CAPITAL LETTER Z WITH CARON)


# Mike Dimmick on 7 Dec 2006 11:08 AM:

According to Appendix D of the current spec (http://www.pkware.com/business_and_developers/developer/popups/appnote.txt):

"The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437.  This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. To address this limitation, this specification will support the following change.

If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding.  If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification.  The Unicode Standard is published by the The Unicode Consortium (www.unicode.org).  UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM)."

Not a particularly clever decision, because older software not looking for GP bit 11 will write out a UTF-8 encoded filename as if it were CP437 or whatever the current codepage is, depending on whether it obeys the rule above.

It then produces some drivel about '"modified-UTF-8" (JAVA) or UTF-8-MAC'. Wikipedia mentions these non-conformant non-UTFs at http://en.wikipedia.org/wiki/UTF-8#Java and http://en.wikipedia.org/wiki/UTF-8#Mac_OS_X. Clearly these do need to be indicated in the ZIP file or certainly handled by the unzipping tool as I'd guess that Java is already generating ZIP files with these malformed file names. The Java errors are basically due to a strict UTF-16 to UTF-8 conversion without understanding surrogate pairs; the OS X issue leads to problems like the one you mentioned at http://blogs.msdn.com/michkap/archive/2006/07/19/670674.aspx. However, it appears it's more a problem with the OS X file APIs - that filenames are supplied to the file system in sort-of NFD - than with the files themselves. Conversion to NFC might be necessary if the Windows file APIs are performing ordinal comparison and if the common dialogs present the filename in NFC.

# Mihai on 7 Dec 2006 1:42 PM:

Half nice :-)

Now, if we would also get some Unicode in msi ...

# Michael S. Kaplan on 7 Dec 2006 1:58 PM:

I agree. :-)

# Dean Harding on 7 Dec 2006 5:34 PM:

This particular bug has nothing to do with the ZIP file format, it's just a hangover from the fact that Winzip is a non-Unicode program. I'm not really sure why they keep issuing new versions of Winzip - there never seems to be any new features (or worth)...

# Cristian Secară on 7 Dec 2006 8:51 PM:

Don't know about .zip, but the .rar archive format is able to handle filenames with Unicode characters [different than the current Windows codepage]. I used it at the time I made several tests with the Romanian ș and ț characters inside filenames in XP and Vista.

An example is this http://www.secarica.ro/misc/abcd.rar or this http://www.secarica.ro/misc/abcd_no_caps.rar . The first example is interesting if extracted on XP, because there are two file pairs with same filenames but different caps that can coexist in same directory.

Cristi

# Rosyna on 7 Dec 2006 9:28 PM:

Mike Dimmick, there's nothing invalid or non-conforming about the UTF-8 that OS X requires for file names. It's just that file names must be in the NFD variant and only the NFD variant. And it's a complete non-issue when dealing with file APIs. Namely since paths are horrible and not recommend and the APIs that don't take paths take UTF-16 file names (or file names in which the encoding of the bytes in the string is purely an implementation issue).

# David Pritchard on 4 Feb 2007 8:45 AM:

I absolutely agree that WinZip should fix this immediately. It's quite true that they keep releasing new versions without significant advances, whilst ignoring something as fundamental (to me) as Unicode support. It baffles me slightly that applications are still being released without this. Monoglot programmers are, in fact, generally pretty ignorant of the whole Unicode issue.

My particular problem is that a backup program I use (SecondCopy) uses the zip format to archive backed-up files. As a result, Unicode filenames either don't get copied, or end up "lost" inside the zip file with screwed-up names. This is obviously extremely dangerous in backup software.

I'm now looking for alternative backup software.

"Don't use Unicode filenames" is simply no longer acceptable.


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2012/01/04 If someone blathers on about how Windows supports Unicode, you can suggest they just ZIP it, if you like!

2008/05/13 WinZip, the [long awaited] Unicode edition!!!

go to newer or older post, or back to index or month or day