What do you get when you put a Hebrew on top of a Russian? (aka What lies beneath can bite you on the ass)

by Michael S. Kaplan, published on 2008/10/01 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/10/01/8971160.aspx

So it was actually a couple of days ago (in the blog titled What a tangled web we weave when a KLID from an HKL we must receive) that I painted a picture that would cause any normal, sane developer charged with working with keyboard layouts to either run out of the room screaming or to collapse into despair, convinced that Microsoft designed all of this exclusively to make their lives more difficult.

Lucky for us that so few normal, sane developers around these parts! :-)

Now as a responsible blog author, I really need to inject some calm into the situation.

I mean, just because I'm unofficial doesn't mean I'm supposed to hit the fire alarm, after all....

With that said, the blog you are reading now is not going to reassure; that will happen in a blog later this week.

For now I am going to turn up the heat.

Did you know that every time you add a keyboard layout to Windows, it stores a bunch of information?

Let's tabulate some of it:

The LANGID (either inherited from the KLID or the one you choose) that will usually provide both the big and abbreviated strings that will go in the language bar (the LOCALE_SLANGUAGE and the first two letters of the LOCALE_SABBREVLANGNAME);
The KLID (Keyboard Layout Identifier) that points to the registry key that will get other stuff, like the DLL name and the string that identifies the actual keyboard layout;
The ANSI code page to associate with text input by the keyboard in non-Unicode Windows applications (gleaned from information in the fsCsb member of the FONTSIGNATURE from the locale in the LOWORD of the KLID);
The OEM code page to associate with text input by the keyboard in non-Unicode console applications (gleaned from information in the fsCsb member of the FONTSIGNATURE from the locale in the LOWORD of the KLID);
Some other stuff not relevant to the current discussion that I'll talk about some other time when it won't distract from the point I'm trying to make in a bit.

Now you'll notice that on top of the keyboard is a language, and it shows up everywhere.

But underneath, there is this other language. And that is the one that seems to control a whole bunch of the underlying behavior of the keyboard layout.

This setup mostly makes sense -- after all, if you put the Hebrew language on top of the Cyrillic keyboard layout on a system with an en-US system locale:

then you would not be very well-served by running your non-Unicode application's input through code page 1252 and your console applications through code page 437 (as the default system locale might recommend).

Just as you would not be well-served by running your non-Unicode applications input through code page 1256 and your console applications through code page 862 (as the user's chosen "language atop" might recommend).

The best locale to base the decision off is the one associated with the layout itself.

What a wonbderful coincidence that this is the actual design!

Uh oh, I was trying to make people less at ease in this blog. I need to point out why this is bad, not why it is good.

Well, remember the central message of What a tangled web we weave when a KLID from an HKL we must receive -- that it is hard to get a KLID if you have an HKL.

But all of this information inside the input language that does all of this work and makes all of these decisions is KLID-based!!!

And very hard to get, just like KLID, but a little harder since you need to get a little bit more information once you are exhausted from getting the KLID.

This means that if you have a non-Unicode legacy application and you need to convert the input to Unicode, you are not given the best information do the conversion.

And then there has been the occasion that the data has been wrong, like in Vista prior to a hotfix and SP1, leading to the problems covered in Double Secret ANSI, part 1 (Somewhere between ANSI and Unicode) and Double Secret ANSI, part 2 (the brokenest one yet, sorry 'bout that!).

Why are USER (keyboards) and GDI (fonts) tied together here so tightly, you might wonder? Well, most of the code that does this work is in WIN32K.SYS, and these two components help each other out.

Now wait a minute -- as bad as that bug is, we fixed it. How is that supposed to cause unease?

Well, let's try this on for size....

The bug might not be 100% fixed just yet. (insert evil laugh here?)

And then there is the actual flaw to discuss.

Remember Understanding (and explaining) why English is everywhere from a couple of years back?

It explains why/how most locales have the regular old KBDUS.DLL vanilla US English keyboard avalable.

Do you see the problem yet?

I'll give you a hint -- the KLID is 00000409.

Second hint -- the underlying ANSI code page (ACP) is 1252.

Last hint -- the underlying OEM code page (OEMCP) is 437.

Now 437 is okay if you are in pure English, but it kind of sucks for dealing with almost any other language supported by code page 1252 (which is why most of the locales that have an ACP of 1252 have an OEMCP of 850 and why most of them wish the had an OEMCP of 858 -- so that their language would be supported well in the console).

But the alternate keyboard layout that exists for most locales does not provide good support for even the locales that it could, in the console.

And there is no workaround for this other than using existing layouts or defining your own layouts based on a locale that would actually be useful.

Yuck.

And what about the fact that most IMEs and text based TIPs have the underlying 00000409 keyboard underneath them?

Double yuck.

Alright, that's enough for now.My heart is not in the task of trying to make people panic. I'll end this trilogy on a much happier note, soon. :-)

This blog brought to you by এ (U+098f, aka BENGALI LETTER E)

Yuhong Bao on 20 Nov 2010 2:17 PM:

"most of them wish the had an OEMCP of 858 -- so that their language would be supported well in the console"

Well, IBM instead of creating codepage 858 decided to modify codepage 850 with the same modification. So depending on your version of PC-DOS or OS/2, the same character can be either displayed as a dotless i or an Euro sign.

Michael S. Kaplan on 20 Nov 2010 7:33 PM:

For those still working in PC-DOS or OS/2, that is....

Yuhong Bao on 2 Dec 2010 11:08 AM:

So why was codepage 858 never considered as an OEMCP for ANY locale? The real Turkish locale actually use codepage 857, BTW.

Michael S. Kaplan on 2 Dec 2010 1:51 PM:

Because we don't ever change OEMCP values after they are set, and these were set before 858 was added to Windows setup.

Yuhong Bao on 2 Nov 2012 2:26 AM:

That is why I said ANY locale. I know there was plenty of new locales added in Win2000, for example.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2012/08/28 'If it is easy, then there are samples. And if it's very easy, everyone is writing them.'

2008/10/05 Can I get your [font]signature on this, please?

go to newer or older post, or back to index or month or day