Not spacing out on spacing forms of characters

by Michael S. Kaplan, published on 2007/05/18 09:43 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/05/18/2710448.aspx


Chris asks:

I have a question about ToUnicodeEx and its behavior when detecting dead-keys. It wants to put the spacing-version of the accent into the string buffer, but I am looking for a way to get the non-spacing equivalent. Is there system call somewhere that I missed, or some mapping table that maps the spacing accents to their non-spacing equivalents?

Thanks in advance,

Chris

This question is an interesting one because the answer involves data that is for the most part not included in the actual dead keys.

Both ToUnicode and ToUnicodeEx work off the data that is in the keyboard layout. So when the docs say that a return value of -1 means

The specified virtual key is a dead-key character (accent or diacritic). This value is returned regardless of the keyboard layout, even if several characters have been typed and are stored in the keyboard state. If possible, even with Unicode keyboard layouts, the function has written a spacing version of the dead-key character to the buffer specified by pwszBuff. For example, the function writes the character SPACING ACUTE (0x00B4), rather than the character NON_SPACING ACUTE (0x0301).

the fact is that the reason that the statement about spacing characters is true is that this is the only data that is actually in the dead key chain. 

(Of course this sentence has other weird things it is saying about the keyboard state that I'll talk about another day!)

All hope is not lost, however.

Chris can actually use Unicode normalization to get the right answer.

You see, U+00b4 has a compatibility decomposition, which means that normalizing to either Form KC or KD will return U+0020 U+0301.

So all you have to do is take the dead key value that is returned in the buffer when -1 is returned and call NormalizeString (or string.Normalize if you are in managed code), and if it returns a space and a character, then that character is the non-spacing version....

Now there are some cases where the dead key table (which by convention would usually be based on the spacing version like U+00b4), is instead based on the non-spacing character from the start. This could either be a mistake in the keyboard layout authoring or it could be that no spacing version actually exists.

In a case like that, the normalizing call will return the original string -- so when the call does not appear to succeed, it may have in fact succeeded in its task of giving the non-spacing version.

And either way you end up with the non-spacing version....

 

This post brought to you by ´ (U+00b4, a.k.a. ACUTE ACCENT)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/04/16 Chain Chain Chain, Chain of Dead Keys

go to newer or older post, or back to index or month or day