IMEs? They have it easy....

by Michael S. Kaplan, published on 2004/12/20 03:13 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/20/327248.aspx

Yes, I said it. IMEs (Input Method Editors) have it easy. And I will say that even though I have only ever built them myself from the samples in the Platform SDK or the ones already in Windows. Even though I have only ever really worked at building keyboards, which cover such a small fraction of characters compared to IMEs that it makes me look like the kid on the beach building sand castles compared to Buckingham Palace by comparison.

Nevertheless, I will make it. Because I am not talking about the design or the development part of it (which has obviously just a much chance to be intuitve/useful or not as any other project involving user interface and user interaction). I am talking about from a data management and data usage standpoint. And the questions that the data can answer.

With an IME, an attempt to take the small number of keys found on the regular keyboard and map them to up to some subset of the entire set of CJK ideographs, Kana, Bopomofo, Hangul, and Jamo characters in Unicode. The basis of the mapping varies, depending on language and user preference. It could be based on pronunciation, on the code point number, on count of strokes, on radical. The user can then commit the choice and it will be entered into the application.

And here is where we get to the part that makes me (as the owner of collation support in Windows an the .NET Framework) jealous, that makes me say the IME folks have it easy.

Because if that first choice is not what the user was looking for, then they get a list of alternate candidates that meet the same criteria as that keystroke or set of keystrokes. The list can be ordered by some collected data that tells the IME which candidate is more likely to be the right choice.

Ameliorated are the problems I discussed a few days ago about ideographs that have more than one pronunciation, because they can all be there, an entry for each pronunication.

Mitigated are the problems I mentioned about pronunications that can apply to many different characters, because each additional character can show up in the candidate list.

And Gone is the need to answer the question of equality that is so central to the CompareString and LCMapString APIs, the CompareInfo and SortKey classes -- because the question is no longer "are they equal, my liege?" or "which is ordered first, sire?". Instead, it's "what's on the list, dude?"

Of course I was immediately reminded by colleagues that this is only cooler when it is the question that one is wanting to ask. If Jessica needs an order for a list of strings or Wendy needs to answer the quesion of equality or Molly needs to build indexes for her database, then the question that the IME works so hard to answer is not nearly as cool.

I was also reminded of something else I know but sometimes forget (which is good, because the remembering part humbles me a bit -- there are many people who feel I need that!). A question's coolness cannot be judged solely by the ease or even the possibility of a good answer. Some of the coolest questions in the world do not have answers yet. Some have answers that seem much simpler or even dumber than the original questions. Some are brilliant even if the initial question seems at first glance to be dumb or trite.

So why am I jealous of the IME folks? Because getting a satisfactory answer to their question is a more tractable problem for Korean, Japanese, and Chinese (both Mandarin and Cantonese), when compared to the questions that the technologies I work on ask.

For me, a function that is smart enough to order multiple characters with the same pronunication is easy -- I just plug in the rules for whatever mechanism acts as the tie-breaker. However, the function to take a character with multiple possible pronunciations and choose the "right" one for a pronunication based sort is a lot harder. Under current art, one needs to add one's own pronunication data a-la-Ruby (or other annotation mechanism).

Perhaps surrounding text could provide the context, if it exists -- but what it is does not? Also, a machine being able to choose the "right" pronunication based on such context is really the first half of the machine translation problem -- to know how to treat the data, the machine must first be able to in some sense understand what is meant.

Are there answers? Well, not in Windows or the .NET Framework today. But there is an understanding of the desired functionality. There may even by thoughts about avenues of attempts at solution. One day.

But at the very least, I thought that this post might make a good quick introduction to the problem.

> With an IME, an attempt to take the small
> number of keys found on the regular keyboard

That's the theory. In Unix systems it did that, in most Japanese Windows systems it did that, and then things get more complicated. This is something that has been discussed many times in Usenet newsgroups and I think you would likely be aware, but this makes me wonder. You know that the regular keyboard has a character ; to the right of l, when shifted it is +, and to the right of that is :, and when shifted it is *. Etc. Japanese Windows 2000 and XP were hacked so that, even though they get installed thinking they have a US layout keyboard underneath the IME, they still work with the regular keyboard underneath the IME. But if the user ever changes a keyboard operation, then this hack gets lost, and then even Japanese Windows 2000 and XP suffer from the same problems as US Windows of all versions with real or "global" IMEs. Microsoft's meaning to the word "global" is US, because its "global" IMEs don't work right if the actual physical keyboard is regular or German or French, they only work if it's US. The US versions of Windows 2000 and XP include real IMEs but they suffer the same defect. In US Windows 2000 and XP (and in Japanese Windows 2000 and XP if the user has ever changed a keyboard setting) then the user has to replace the lower level keyboard driver instead of just setting the layout properly. And Microsoft made it really really hard to get to the necessary lower level keyboard driver. Once the user finds all the necessary steps, Microsoft even asserts that the Japanese 106 driver isn't compatible with the laptop's built-in keyboard.

> and map them to up to some subset of [...]
> characters in Unicode.

I wonder if all of Microsoft's IMEs operated in Unicode, including those for MS-DOS and Windows 95 etc. But IMEs for Unix traditionally didn't, because Unicode didn't even exist yet.

By the way if Unicode provided the same degree of backwards compatibility for US and European character encodings as it does for Japanese, I'll bet you wouldn't even have heard of Unicode. For you it isn't a bad joke, but if it had been done equitably then it would be one even for you. Sure we have to contend with Unicode, but a ton of existing databases are not going to be converted for it.

I do understand the difficulties to which you refer, but I think that for the most part these have much more to do with documentation than with functionality (since it is possible to get the layout that matches your hardware, its just not explained very well how). Of course they only affect Japanese and to a lesser extent Korean, not all IME languages.

This whole discussion is of course a "hijacking" of what *this* post is about since I was really not talking about this type of issue in keyboards or IMEs or TIPs.

I will look into covering these issues in a future post.

As for the unicode backcompat, that clearly has nothing whatsoever to do with this topic and is politically charged enough that I'd rather not see it here just now, as we do not need to fight a Japanese-US war based on differing perceptions (also with the people in Japan who felt the JIS-0x213 should be treated as a repetoire and not as a code page?).

12/20/2004 6:03 PM Michael Kaplan

> (since it is possible to get the layout that
> matches your hardware, its just not
> explained very well how)

For the "Global" IMEs for Windows 9x, it requires hacking the registry in a way that was discovered and posted to Usenet by someone I don't recall, and not explained at all by Microsoft.

For Windows 2000 and XP, I don't recall seeing any explanations, but even if there are any, EVEN MICROSOFT DID NOT FOLLOW THEM. Microsoft hacked the installs of Japanese Windows 2000 and Japanese Windows XP to work around that misdesign (which works for a while) without even being aware that there is a designed way to work around the misdesign.

> As for the unicode backcompat, that clearly
> has nothing whatsoever to do with this topic

Well, I intended it to point out the extent to which it was a misstatement to say that IMEs convert keyboard input to Unicode. IMEs were converting keyboard input to character sets before Unicode existed and they continue to target those character sets.

> and is politically charged enough

OK sure it became political, but it started out technical, because the fact was that all those tons of data already existed and continue to exist.

For your first set of issues, I believe you are mistaken on some points, but THIS IS NOT THE PLACE to vent about it. I will talk about it in a future post, where it can be on-topic.

All but the Win9x issue, about which I have neither knowledge nor opinion. :-)

It is not a mis-statement to talk about the intention of IMEs on Windows, which swork exclusively with Unicode (since the OS does). Even a third party IME will end up using Unicode even if they are using a legacy code page since the conversion always happens....

The final set of issues related to your opinions on the issues related to Han unification, I will not cover. You are welcome to do so in your own blog, but I am not interested in that type of discussion here, certainly not in hijacking unrelated threads! :-(