Unicode? Zip don't need no stinking Unicode!

by Michael S. Kaplan, published on 2006/04/22 13:48 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/04/22/581356.aspx


I have talked about the limitations in ZIP before in the post Zipping up Unicode file names, but Heath has pointed out a new and interesting wrinkle in the problem in his post Update for the Palm Treo 700w Available, with Problems.

Now Heath may seem to some to be some kind of lightning rod for Unicode Lame List stories, but he isn't -- he is just a smart developer who is finding himself thrown into bad software situations that he did not design....

In this case we see the biggest problem with not using Unicode -- the basic problem of deciding what code page to use. It is probably not so much that zipfldr.dll is specifically using cp437 and cp1252, it is that it is using CP_OEMCP and CP_ACP.

What causes such a mistake to not get noticed, though? I mean, it is pretty un-natural to be using both constants, isn't it?

As luck (or unluck) would have it, they are not. The problem starts with the Shell folks, are using funky macros wrapped around funky shlwapi wrappers like SHAnsiToUnicode and SHUnicodeToAnsi. I call them funky because they are. They are also quite consistent in their underlying use of CP_ACP always.

And as for the rest of the problem, it looks like the CP_OEMCP is coming from the fact that it is a console app that is running things so that some of the translations are happening in this different context....

How smart is Palm feeling for putting and ® in the filename, at this point? No wonder they took the update down. :-)

Clearly we'll need to see people using ASCII file names until people move up to Unicode. Code pages are just too damn confusing!

 

This post brought to you by "®" and "(U+00ae and U+2122, a.k.a. REGISTERED SIGN and TRADE MARK SIGN)


# Heath Stewart on 22 Apr 2006 4:20 PM:

I don't think it's that there's any console app involved - there shouldn't be, since zipfldr.dll is just a shell extension server DLL; I think it's just that they interpreting the file names as DOS file names and using the OEM code page.

What else is interesting regarding code pages is that when I came to your page the (R) and (TM) glyphs weren't displaying correctly. I actually filed a bug last night on Community Server because the new post page uses ISO-8859-1 while the page content sends the Content-Type HTTP header with UTF-8. In this case, Internet Explorer apparently ignored the Content-Type header and automatically chose Windows 1252. I changed the encoding to UTF-8 and it appears correctly. How lame.

For my post you linked to I actually used the appropriate HTML entities where defined, and coded the entities myself otherwise. It's a big hassle to have to fix-up my HTML but hopefully I won't have to with a future update to Community Server.

PS: When I was younger I was also a lightening rod for major accidents. Before you know it, I'll have a white stripe of hair and go running every time a storm brews. (Anyone know the reference?)

# Heath Stewart on 22 Apr 2006 6:57 PM:

Oh yeah, and don't forget to add to the lame Unicode support you mentioned about my blog from http://blogs.msdn.com/michkap/archive/2005/10/08/478479.aspx. You should add that to the "Unicode Lame List" category, too.

# Michael S. Kaplan on 23 Apr 2006 1:56 AM:

Good idea on the post category for that other post. :-)

All I know is that the DLL does not ever use the OEMCP -- so someone is putting that particular interpretation on it....

I can't repro the encoding problem of the pages, though -- everything seems to display fine here, for me at least. I wonder why?

# Mihai on 24 Apr 2006 12:56 PM:

<<In this case, Internet Explorer apparently ignored the Content-Type header and automatically chose Windows 1252.>>

A good reason to always add the meta "text/html; charset=utf-8" to web pages.

# Yuhong Bao on 12 Mar 2009 9:04 PM:

"All I know is that the DLL does not ever use the OEMCP"

Except it does, I looked at the zipfldr.dll imports and it imports OemToCharBuffA and CharToOemA.


referenced by

2008/05/13 WinZip, the [long awaited] Unicode edition!!!

2006/04/30 Sometimes, you have to keep it in ASCII

go to newer or older post, or back to index or month or day