Custom code pages?

by Michael S. Kaplan, published on 2006/07/05 04:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/05/656283.aspx

There is an old expression about a person being like a dog with a bone -- meaning that they really want to keep at something in particular, and they don't want to let it go.

We have something similar here in this blog, and it is regular reader Ivan Petrov, who is really looking for a way to extend code pages, and he has asked the question in various ways in the past. They are interesting questions, and the issues tied up in them are also interesting, though the answers are not always what he might like. :-(

As I have mentioned before, code pages are never going to be enough, though. The answer here, the only answer, is Unicode. Given the limited use of all code pages that are not "default system code pages" of a locale making the utility of most of the code pages on Windows pretty limited anyway, the ability to add more such code pages would be of even more limited use.

Certainly the migration of legacy data into Unicode is an interesting scenario, and one that in a past life as a consultant I have often worked on, providing short term and long term solutions. The need to do such transformations conformant to various national and industrial standards has existed for quite some time, and will likely continue to exist.

Hell, if you think about many of IBM's EBCDIC code pages, they were often provide by IBM for customer needs to do such mappings. It is a business for everyone, really....

Does that make it important to support custom code pages on Windows at some point?

Well, maybe -- if a compelling case can be made for such data migration -- I have heard rumor that IBM has in the past produced EBCDIC code pages for individual customers, so clearly many companies have felt such pressures before.

More important than the opening up of functionality that does not exist today on Windows but which such opening up has been happening, such as the work that has happened with fonts, keyboards, and locales?

Well, maybe not -- since mappings such as the ones on the IBM site or on Tex's are hardly out of the reach of developers today....

The big use that we all have for code pages is actually as repetoire definitions -- because in this world where it is hard to know what letters are important to a language, such a standard is a good indication of what someone felt was a reasonable subset!

In the meantime, the owner of encodings on Windows is likely to have thoughts on the issue (though the first answer to just about any question about code pages from him is Use Unicode!), so you can Shawn if you are curious. :-)

This post brought to you by Ӹ (U+04f8, a.k.a. CYRILLIC CAPITAL LETTER YERU WITH DIARESIS)

I don't know about Ivan specifically, but most people who want support for other character encodings are looking to process existing data, "Use Unicode" may be hindsight, or even anachronism from that perspective and either way not a useful answer.

On other modern platforms the OS provides a comprehensive translation API, on Win32 the best on offer seems to be the MultiByte/WideChar family which are pretty miserable even before considering that they're limited to a small subset of the existing encodings.

So, what's GIFT doing to deliver "industry-leading, high-quality, extensible APIs" in this space? What if anything will be delivered in Vista?

Hi Nick -- asked and answered, dude. No new support in Vista. But WideCharToMultiByte and MultiByteToWideChar are comprehensive in the support of codepages that have been used to support hundreds of thousands of apps run by millions of users of older versions of Windows....

Hi ML -- Nothing is wrong with UTF-8.

And nothing is wrong with UTF-32.

And nothing is wrong with UTF-16.

They are all forms of Unicode. And different platforms have different defaults; each one of those platforms has the right to choose that default without needing to re-choose in 15 years. :-)

> I have heard rumor that IBM has in the past produced EBCDIC code pages for individual customers

That's a very mainframe-ey thing to do. When your customer pays several million dollars for your solution, producing a custom encoding for them makes sense. But when you pay $100 (or whatever it is in US$) for Windows, it doesn't make so much sense.

And that's basically what Ivan is asking. I mean, like Michael says, for anything but data migration, you should use Unicode - and I don't think there will be many arguments against that. So that just leaves data migration. Now, if your "custom" codepage is a simple 1-1 mapping of bytes to Unicode codepoints, then you don't need anything in the OS to do it - it's pretty trivial with a bit of perl script or something. If it's more complex (for example, a multi-byte codepage) then there's nothing really "general" that the OS can provide anyway, because the codepage could be coded any old way at all...

Yes it is possible to add custom codepages, and no, it isn't a good idea. The 50000 range is designed with custom codepages in mind. With very little reversing you can figure out you need a DLL which exports a single function called NlsDllCodePageTranslation, whose prototype is documented indirectly. Drawbacks of 50000-range codepages: they only work in Win32. The whole low level, such as filesystems and the kernel in general, will be completely oblivious to your custom codepage. Michael has already noted this about ISCII codepages (which are in the 50000 range)