MSI Databases and Unicode?

by Michael S. Kaplan, published on 2005/10/08 00:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/10/08/478479.aspx

You see, we wanted a nice easy setup that would install the keyboards people would create without making them buy a product.

I figured after all of the experience writing complex ACME setup scripts for the Office 97 Developer Edition Setup Wizard that I would be able to write a simple setup that has an easy job to do. And indeed it was -- I used the MSI API functions and managed to write something that would do the setups. Of course people who have Orca installed would regularly complain about the fact that it fails validation, but that is mainly because it is missing a validation table (of the 20 people who have mentioned the validation problem to me, only one person indicated that they noticed this).

Now MSKLC lets you create a keyboard description that can have in it any character in Unicode (minus a few things like the single quote character), so I ran headlong into the fact that the Windows Installer does not support Unicode.

Heath, that reminds me -- when is the Windows Installer going to support Unicode natively? And I don't mean UTF-7/UTF-8!

Until then, Heath's post about using code pages is a great reference to how to best handle the situation. Though I wish that best practices did not involve keeping everything in ASCII.

On the Windows platform (Win9x excluded, of course!), everything is UTF-16 LE, and anything that is not has to convert. It is always better to have fewer conversions for the sake of performance and to reduce the risk of data corruption (present with every conversion if an incorrect one is done).

Also, in most cases people who convert non-Unicode systems to UTF-8 do so in order to avoid going through the real update and review that is needed to make sure that the support is solid....

On OS X all the String stuff is UTF-16 (well, ok, CFString hides its internal format, it may or may not be UTF-16 depending on how it was created, for performance reasons); But all the low-level path stuff is UTF-8 since it has to remain POSIX compatible. However, Higher level types like FSRef (which reference files without using paths at all, as God intended) may have a completely human unreadable path when it gets to the low level stuff.

Ah, the wonders of abstraction, how I love thee.

Then again, I do like that Windows' max path name length is 32,767 characters compare to OS X's extremely, extremely lame limit of 1024 bytes (FSRefs do not suffer this limit as long as you never covert them to a path).

What is the Unicode version of CreateDirectory? The docs say "To extend this limit to 32,767 wide characters, call the Unicode version of the function and prepend "\\?\" to the path." but doesn't link to the unicode version of the function. BLAH!

I am not sure that is really needed though, is it? I mean, given that every way into querying and modifying the database is Unicode already, storing as Unicode would be just as easily possible as not converting except when working on Win9x and calling the OS....

In other words, there are two ways to keep up Win9x support -- one that sacrifices functionality on newer platforms, and one that does not. I wish they would choose the one that does not....

This is one of my long time grieves.

Code pages are not enough. How is one supposed to create a Hindi installer?
And even if UTF-8 is supported, it does not help because:
- Is not documented
- Is not supported by most tools (tried InstallShield, Visual Studio Installer 1.1, WiX (Windows Installer XML)
- The cab file format does not support any kind of Unicode, so localized file names are beyond reach. I know is not recommended, but if a customer wants it, sometimes the answer "you have to change your code" is not good enough.

And backward compatibility should be achieved by going from Unicode to ANSI, not by losing functionality on newest platforms. Especially that the non-Unicode platforms (W98/Me), are already in "Extended Support," which will end in less than one year (June 30, 2006).

Actually, the cabinet format specification says filenames are in UTF-8. In any case, you can rename files after extraction, and it's very common to see cabinets containing mangled filenames (the next-to-latest PSDK setup comes to mind)

> people who convert non-Unicode systems to UTF-8 do so in order to avoid going through the real update

Say what?

I've found that UTF8 is in many cases BETTER than UTF16.

Take any primarily-Latin-alphabet text document you may have lying around.

Make a UTF8 version.
Make a UTF16 version.

Check out the UTF16 version in a hex editor.

EVERY OTHER BYTE IS ZERO!

UTF8, on the other hand, is much more space-efficient for primarily Latin text.

UTF16, it is true, is usually two-bytes-per-character. A comfortable rule. And the exceptions are rare enough that they can effectively be thought of as "freak cases."

UTF8, on the other hand, forces the issue much sooner. There are many much more frequently used characters that require multibyte encoding in UTF8.

This could be a GOOD thing. Embrace the challenge, rise to meet it.

Finally...

UTF-8 is DIRECTLY COMPATIBLE with iso-8859-1. If you have mostly-Western-European data, you can serve it as UTF-8. Your UTF-8 data consumers will be 100% fine. Your ISO-8859-1 data consumers will be 99% fine.

By contrast, if you serve it as UTF16, and the ISO-8859-1 data consumers will have nasty \0's at every other character.

Michael, it's true that MSIs are not CABs but MSIs may contain, and MSMs and MSPs are extremely like to contain, CAB files. The names in the CABs are the same as the File table primary keys. See my blog entry on what's in a patch at http://blogs.msdn.com/heaths/archive/2005/09/01/459561.aspx for more information.

Also, while it would be nice to see Unicode as the default if even supported in MSI, MSI must be backward compatible having started on both 9x and NT, though MSI 3.0 and newer will only run on NT. At some point hopefully this'll happen as 9x support is dropped. One can hope!

I must also agree Maurits about size and do often profess that size does matter for patches. We want them to be as small as possible for the customers' benefits as well as to keep our bandwidth requirements down and our throughput high. CABs will compress data just fine but each patch transform that contains binary data doesn't always compress well or at all.

"Unofficially, MSI databases do support UTF-7 and UTF-8"

I have no problem with UTF-8, this is only to store some strings, not for processing.
Selecting UTF-8 vs UTF-16 is a long talk, full of "depends."
Backward compatibility is also a good reason.
So, I am not asking about droping code pages. Just support Unicode officialy, update the tools, update the doc, and let me choose case by case.

About the cabs (needed for MSMs, MSPs and MSTs), I went thru the doc a while ago and found nothing about utf-8, but I might be wrong.

Main point: just make Unicode officialy supported, and test all tools with Unicode. That's all :-)

Two years have passed since and Windows Installer still lacks Unicode support. Is there any development still going on or are we really stuck with one of the key components of an operating system lagging that much behind its own environment?