Zipping up Unicode file names

by Michael S. Kaplan, published on 2005/05/10 20:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/10/416181.aspx


Let's create the following filenames:

(they can be empty or have data in them)

And then try to zip them up with your favorite program (I'll use WinZip, you can use anything you like here).

The zip will fail, in the case of WinZip with the following error:

---------------------------
WinZip
---------------------------
Error: No files were found for this action that match your criteria - nothing to do. (C:\TEMP\TEMP.zip).
---------------------------
OK   Help  
---------------------------

And then if you choose to look at the error log you will see why you had zero files instead of the four you asked it zip up:

Action: Add (and replace) files Include subfolders: yes Save full path: no
Include system and hidden files: yes
"C:\TEMP\aß?de???.txt" is not a valid file name and was skipped
"C:\TEMP\????????.txt" is not a valid file name and was skipped
"C:\TEMP\????????.txt" is not a valid file name and was skipped
Warning: name not matched: C:\TEMP\????????.txt
"C:\TEMP\aß?de???.txt" is not a valid file name and was skipped
"C:\TEMP\????????.txt" is not a valid file name and was skipped
"C:\TEMP\????????.txt" is not a valid file name and was skipped
Warning: name not matched: C:\TEMP\????????.txt
"C:\TEMP\???????.txt" is not a valid file name and was skipped
Warning: name not matched: C:\TEMP\???????.txt
"C:\TEMP\aß?de???.txt" is not a valid file name and was skipped
Warning: name not matched: C:\TEMP\aß?de???.txt
Error: No files were found for this action that match your criteria - nothing to do. (C:\TEMP\TEMP.zip)

Your mileage may vary if your default system code page supports one of these filenames, and those question marks with best fit mappings for the Greek names will probably give the clues as to what is going on here.

The ZIP format is fine with Unicode data in filenames, but is not so fine with the filenames themselves being off of the default system code page.

Curses, foiled again!

Now one could work around this by using the short file names, but this would have a negative impact on being able to use them in the ZIP file:

I think we need to have someone look into an extension to the ZIP format....

 

This post brought to you by "Ž" (U+017d, a.k.a. LATIN CAPITAL LETTER Z WITH CARON)
(which is unfortunately not a zippable file name character on most code pages)


# Maurits [MSFT] on 10 May 2005 6:45 PM:

Interesting - I've submitted this to WinZip.
Another question occurs to me.

Suppose John zip's up a file containing some eight-bit characters in his code page.

Then he emails the file to Mischa who has a different code page.

Mischa extracts the file.

What does he see? Are the eight-bit characters in the file name just whatever those eight-bit numbers happen to map to in his code page?

I wonder if there's any kind of ../.. vulnerability possible here.

# Michael S. Kaplan on 10 May 2005 6:55 PM:

Yes, you are right -- the cross-codepage issues are kind of scary here -- since you could have files that you would be unable to open on some platforms? Yikes!

I don't think you will find the other kind of vulnerability here though -- the path separators and related characters are pretty firmly embedded in the ASCII range....

# Maurits [MSFT] on 10 May 2005 6:59 PM:

../.. is something to watch out for if Unicode filenames are allowed, though. It's easy to imagine a malicious zip file containing a "filename" of "../../../../../../boot.ini" - where the .'s or "/"'s are obfuscated with overlong Unicode escapes.

# Dean Harding on 10 May 2005 7:06 PM:

Plus, if you used the short filenames, you'd not be able to expand them back to their originals after unzipping them, making it kind of useless anyway (it'd be better to manually rename to something in English I reckon).

Perhaps an extension to WinZip (or whatever) could be to encode the Unicode characters using utf-7 or maybe even using the same scheme as IDNs or something? That way, you'd still be able to unzip with another program, just not with the friendly filenames.

# Michael S. Kaplan on 10 May 2005 7:07 PM:

Hmmmm.... I am not sure if that would even work. Plus, there are no escapes that are legal here anyway? It currently fails any attempt to use the characters, and all of the cases I know of with relative paths start with the highest path, don't they?

# Michael S. Kaplan on 10 May 2005 7:10 PM:

Previous comment was to Maurits, not Dean. :-)

Dean, UTF-7 might actually do the trick here....

# MGrier on 10 May 2005 9:04 PM:

We noticed this problem when doing the packaging and deployment support for VS6.

The net result is that this is why _A_UTF8 was added to the fci/fdi headers for CABs; we just utf8 encoded the filenames in the CAB as it was already 8-bit clean. (I claim that the cab code owner added it at my request but maybe someone else beat me to the request...)

For ZIPs we had a similar issue but I believe that ZIPs for JARs were already specified as being either 7-bit clean or UTF-8 encoded already. I think we forced a fix in the Java Package Manager to deal with UTF-8 encoded class names but these still tended to have problems on Win9x since the underlying registry APIs couldn't deal with characters in key or value names not representable in the system's CP_ACP.

# Michael S. Kaplan on 10 May 2005 9:24 PM:

Yep, thats the problem -- now the next step is to solve it for everyone. :-)

# pb on 11 May 2005 2:15 AM:

Actually NTFS-conformance problems are very common with Win32 software. Many do not understand that NTFS is Unicode (and then I haven't mentioned ADS).

The scariest part is that most file managers also have complete ignorance of Unicode. This can be funny when you have an English locale and you try to copy a file with e.g. cyrillic or just East-European characters in its name.

Another typical area of problems is with email clients.

# Alexey Chernjayeff on 11 May 2005 6:49 AM:

I tried to compress these files with RAR and 7-Zip (http://www.rarsoft.com/ http://www.7-zip.org/)...
They work well, all files in both archives are extractable as original names. So it seems to be only zip format problem.

# Michael S. Kaplan on 11 May 2005 8:19 AM:

Alexey, that is great news! None of those formats are in the default install of winzip, but I believe they can be plugged in if you have them. Thanks for posting this.

# Maurits [MSFT] on 11 May 2005 11:05 AM:

There is unfortunately some ambiguity within Zip files as to the character set that the filename is stored in. Whenever a file can be stored in the OEM character set (which is the character set originally used by MS-DOS), WinZip does so for compatibility with other Zip utilities. In this case, it marks the Zip file as made by MS-DOS, since the original Zip utilities were DOS utilities that used the OEM character set for filenames.

One thing to note is that on NT-based systems such as Windows NT, Windows 2000, and Windows XP, file and folder names are stored as Unicode (a multi-byte character set that includes characters from various ANSI and OEM code pages). Such files and folders can only be processed by Unicode-aware applications. WinZip is not a Unicode-aware application at this time. For this reason, WinZip can only display and process characters that exist in your current codepage. If your current code page is English 1252, for example, you will not be able to see or process files that have Chinese characters. On the other hand, if your current code page is set to Traditional Chinese - 950, you should see and be able to process files with Chinese characters but you may not be able to see or process files and folders whose names contain Greek characters.

We are looking into how to handle this better in a future version of WinZip.

Please let me know if you have any further questions.

--Chuck Campbell, WinZip Technical Support

# Bob Burger on 11 May 2005 2:26 PM:

It works just fine in Mac OS X. The Mac uses UTF-8, which would be the ideal choice for encoding filenames in the existing zip format.

When I brought the zip file onto my Windows box, WinZip thought αβγδεζηθ.txt was αβγδεζηθ.txt. You can see the UTF-8 encoding in there, mapped to 1252.

Too bad Windows doesn't have a UTF-8 API. It already has a UTF-16 API. I wish I could use the ANSI APIs but tell the system to use UTF-8 as my current encoding scheme.

# Maurits [MSFT] on 11 May 2005 2:49 PM:

The Unicode exploit I was thinking of is if the "../" check is only placed against the raw string, and not the normalized form.

For example if the string is stored in UTF-8, a [c0][af] byte sequence will not trigger the path validator. On the other hand it may very well be normalized during the extraction routine - boom, a vulnerability.

Such vulnerabilities are not new to the Zip world - see http://secunia.com/advisories/8781 for an example of path interpretation vulnerabilities (though admittedly this one is not Unicode-related.)

# Mihai on 11 May 2005 3:52 PM:

The ZIP file format specs are here:
http://www.pkware.com/company/standards/appnote/
A quote:
"The current Header ID mappings defined by PKWARE are:
...
0x0008 Reserved for future Unicode file name data (PFS)"

So, it seems something is there already. But someone should bite the bullet and use it. Probably the backward-compatibility is a problem, but sooner or later someone should do it :-)

# Dean Harding on 11 May 2005 9:48 PM:

The latest version of WinZip has a new compression algorithm that is not compatible with older version or with other ZIP-compatible products. When you select it, it just says "files compressed with this scheme will not be able to be opened in older version of WinZip or other products". So there's no reason why they couldn't do the same with Unicode file names.

Ah well, no point bitching about here, I guess :)

# Michael S. Kaplan on 11 May 2005 10:31 PM:

If I combine the info from Chuck Campbell obtained by Maurits, the info on the ZIP file format provided by Mihai, the info on the way backcompat has been treated in the past by WinZip, and the info from Alexey on other compression formats that support the names now, there are clearly all of pieces in place to make this feature work in a future version of WinZip (if they choose to do it!).

# Serge Wautier on 12 May 2005 2:48 PM:

Of course WinZip et al can zip your Unicode data. But MichKa discovered that the name of zipped files cannot be Unicode !

referenced by

2012/01/04 If someone blathers on about how Windows supports Unicode, you can suggest they just ZIP it, if you like!

2008/05/13 WinZip, the [long awaited] Unicode edition!!!

2006/12/07 Zipping up Unicode file PATHs

2006/04/22 Unicode? Zip don't need no stinking Unicode!

go to newer or older post, or back to index or month or day