On not being well served by the mantra "must support Unicode"

by Michael S. Kaplan, published on 2011/11/30 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/11/30/10242902.aspx

Yesterday, I was having an interesting conversation, one that has given me pause.

We were talking about Unicode, and the need for components in the OS to do a better job of directly embracing it.

This is obviously nothing new around these parts, but a new twist was interjected into the conversation.

You see, the components we were talking about were consistently calling the "W" suffixed Unicode functions all the time -- either explicitly or because Windows has been compiled with /DUNICODE and /D_UNICODE for years now.

However, at certain critical bottlenecks, they had two requirements added:

the characters had to be on the CP_ACP, and
the characters had to be on the CP_OEMCP

The net effect of these two requirements was a system default locale dependency, which resulted in many serious limitations, the most important of which:

anywhere from hundreds to tens of thousands of characters required by national standards like JIS X 2004, HKSCS, and the all important GB18030 were often blocked;
all of the "Unicode only" locales that could never be system locales were blocked completely.

Now the fact that they were mistaken was not a surprise; people have been making this mistake for almost two decades.

The shocker (for me) was how long it took them to understand and accept that some characters were not on any ACP or OEMCP at all. And that "Unicode-only locales, first introduced during the early betas of NT 5.0 (aka Windows 2000), even existed.

I first wrote Code pages are really not enough.... and Why ACP != OEMCP (usually) over six years ago, but the very real consequences on the destruction of text by flags like ES_OEMCONVERT was simply new information to some people who would never blink at swearing by the need for Unicode support.

So I've decided that the mantra of making sure components "must support Unicode" is insufficient.

I'll need to make sure it's always clear that a system default locale/CP_ACP/CP_OEMCP dependency is just as bad, and perhaps even worse. Because removing a code page dependency can be more involved than just compiling the code differently.

Sometimes a lot more involved....

MGetz on 30 Nov 2011 7:55 AM:

This would explain why many core OS components don't support anything outside of the BMP, and why there has never been any serious movement to change that (something I've found odd for years now...).

Michael S. Kaplan on 30 Nov 2011 8:22 AM:

Not necessarily -- since "Unicode" usually means UTF-16 for MS platforms, you can fit supplementary characters just fine. The bug I talk about here in this blog blocks many things, including characters more common than "outside the BMP" ones....

Mihai on 30 Nov 2011 10:08 PM:

In my book, the two requirements almost negated the whole effort to be Unicode. I often compare "Unicode support" to a pipe: you might have a 1000 miles pipe that can (in theory) move 1000 galons of water per second, but if you have one singleone foot bottleneck that can only pass 20 galons per second, then your pipe is not a 1000 galons per second pipe, bacause (as end user) I will never be able to get that debit.

Yuhong Bao on 4 Dec 2011 5:05 PM:

Reminds me of this:

bugzilla.mozilla.org/show_bug.cgi

Michael S. Kaplan on 5 Dec 2011 6:55 AM:

Mihai -- I agree with you to a point, though it doesn't negate all of the support that is allowed for the full range of Unicode....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day