A bit about the WM_IME_CHAR message

by Michael S. Kaplan, published on 2006/01/24 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/24/516693.aspx

Ben Bryant asked in the microsoft.public.win32.programmer.international newsgroup:

I am getting mixed messages about what code page is given to a WM_IME_CHAR handler in an ANSI build. I would like to assume its the default system locale code page (GetACP), but is there a keyboard input code page setting since you can change your keyboard input language without rebooting?

The WM_IME_CHAR message is (unlike the WM_UNICHAR message) not always going to be Unicode. Unfortunately, the code page is the default ANSI code page of the system, which is returned by the GetACP function. And the only real workaround for this is to use the Unicode messages or at least the Unicode IMM functions when using the IME.

Note that the functions support Unicode even on Win9x, as per the topic The Input Method Editor and Unicode:

Windows 98/Me, Windows NT/2000/XP: Windows supports a Unicode interface for the IME, in addition to the ANSI interface originally supported. Windows 98/Me supports all the Unicode functions except ImmIsUIMessage. Also, all the messages in Windows 98/Me are ANSI based. Since Windows 98/Me does not support Unicode messages, applications can use ImmGetCompositionString to receive Unicode characters from a Unicode-based IME on Windows 98/Me.

There are two issues involved with Unicode handling and the IME. One is that the Unicode versions of IME routines return the size of a buffer in bytes rather than 16-bit Unicode characters, and the other is the IME normally returns Unicode characters (rather than DBCS) in the WM_CHAR and WM_IME_CHAR messages.

Use RegisterClassW to cause the WM_CHAR and WM_IME_CHAR messages to return Unicode characters in the wParam parameter rather than DBCS characters. This is only available under Windows NT; it is stubbed out in Windows 95/98/Me.

Of course, running on the NT-based platforms, even if you deal with the mess of the non-Unicode version of WM_IME_CHAR, it is still better than the packed double message that WM_CHAR handling would receive -- that is a true nightmare.

But if you have a Unicode application, you will be much better off. Or even a non-Unicode application that ignores the content of WM_IME_CHAR and just uses it as a trigger to call ImmGetComposiionString....


This post brought to you by "U" (U+0055, a.k.a. LATIN CAPITAL LETTER U)

# Ben Bryant on 24 Jan 2006 6:26 PM:

Thanks, ur da man! I was just noticing how WM_UNICHAR gives a handy UTF-32 code point. I suppose you probably said this somewhere but does the wide WM_CHAR support supplimentary code points?

# Michael S. Kaplan on 24 Jan 2006 9:39 PM:

No, it would come in as two separate UTF-16 WM_CHAR messages....

# Ben Bryant on 28 Jan 2006 11:12 AM:

Still having trouble believing it is the default ANSI code page of the system (ACP). I would expect the characters to be nulled out if they did not exist in the ACP but I am getting garble, so I am still guessing it is actually the WM_INPUTLANGCHANGE code page. Using an input language not supported by the ACP would suggest that in a non-Unicode program I need to handle WM_IME_COMPOSITION and call ImmGetComposiionStringW (N.B. the W on the end of that).

# Michael S. Kaplan on 28 Jan 2006 12:58 PM:

There is actually a great deal of overlap in lead bytes and some glyph appearing in the various code pages -- and thus you can easily just garbled text in that case. It is the ACP that is used.

But I would definitely recommend calling the Unicode IME functions -- in fact I did recommend exactly that! :-)

# Ben Bryant on 28 Jan 2006 2:19 PM:

Thanks agaain for responding. Well, there is not much overlap in Arabic and Latin-1 outside of ASCII, but my customer is typing on an Arabic on screen keyboard and getting European accented characters on US English Windows. If the IME service was doing a Unicode to ANSI conversion behind the scenes to supply WM_IME_CHAR I wouldn't get "overlap," I'd expect to get replacement chars (question marks) or nothing. At any rate, I do not see any reason to ever see garbled text even if there was overlap, so I do not see your point. In fact, the garble that I am seeing is the reason I think it is not the ACP, it is Arabic encoding which I am treating as the Windows-1252 ACP.

# Michael S. Kaplan on 28 Jan 2006 4:18 PM:

Unfortunately, two things happen:

1) In some cases, other characters fit on the other code page for those same bytes, and

2) In other cases, esp. on other EA code pages, you will get other ideographs as lot of the time....

As usual, keeping it all Unicode is the best way to go. :-)

go to newer or older post, or back to index or month or day