Why is there no pronunciation-based sort for Japanese?

by Michael S. Kaplan, published on 2004/12/27 03:03 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/27/332618.aspx

When people start looking at East Asian languages, they notice that most of the regions have a sort based on pronunciation: Korea has a sort based on the Hangul pronunciation of the Hangul and Hanja codepoints, Taiwan has one based on the pronunciation in Bopomofo order, and China has one based on the Pinyin pronunciation. They notice that there is one major region missing from this list -- Japan. They wonder why Japanese is not given the benefits of such a sort. Isn't the Japanese market important to Microsoft?

The short answer is that there is a pronunciation based sort in Windows. Simply pass any Hiragana or Katakana to CompareString or LCMapString/LCMAP_SORTKEY and you will see everything collate properly. It even works in all locales; one does not even need to pass the Japanese LCID, 0x0411, to see it happen. The world is in the proper ｱアあｲイいｳウうｴエえｵオお order (in the traditional AIUEO order, Halfwidth Katakana followed by Fullwidth Katakana followed by Hiragana). What more could one want?

Of course the answer to that question is in the long answer -- people want to know how to get the Kanji (the Han ideographs) to sort in this order, too.

The answer to this question is that there is no such sort. To explain why, lets look at how the Korean/Chinese/Taiwanese regional sorts are done. In all three of them, there are often characters that have multiple different pottential pronunications in an ideograph, based on context (just as exists in English for words like Polish the language versus polish the furniture cleaner). This would make pronunication based sorts impossible except for the fact that the most common pronunciation is determined and then that is the one that is used when multiple pronunications exist.

Admittedly this is not a perfect solution, but short of a computer that can actually read the text, there is not much more that can be done (although I am sure one could imagine interesting dictionary-based ways to approximate things -- I have, and they fall under the heading of 'clever' even when they are not really practical).

There are three different types of pronunications, called readings (on, kun, and nanori) and individual Kanji can have one, two, or all three of these (and in most cases at least the first two). They can also have more than one of each! The third reading type (nanori) is for name and there is in most cases no way to know what it is without being told (this is in fact how phonebooks work -- someone giving the pronunciation in Kana to the phone company or list creator).

Given all of that, there is no way to even guess what the most common pronunciation is, even if the data were available, without giving users results that seem wrong or confusing to them. Because even though one could craft an algorithm that could make intelligent guesses at which type of reading is meant, there is no way to make something at least as likely to be correct as the other East Asian languages, especially given that what is probably the most common need for such a sort (lists of names) would require a separate field for the pronunciation.

And this is indeed the best solution for such situation -- a separate field containing the pronunciation. It works quite well, and I would encourage any application that wants to do a pronunciation-based sort to try doing this as a method.

In theory, this is something an application can do when a name is typed when the IME mode is based on pronunciation; this is the one time that the pronunciation information is present without it being queried separately -- during the composition phase. As far as I know, this is not something that is done right now (if I am mistaken feel free to let me know!). It would be exceedingly difficult to do with the IME APIs and Windows messages as they are (and it is nearly impossible in the .NET Framework since the appropriate events are not even exposed).

> In theory, this is something an application
> can do when a name

a name or anything else

> is typed when the IME mode is based on
> pronunciation; this is the one time that the
> pronunciation information is present without
> it being queried separately -- during the
> composition phase.

Agreed.

> As far as I know, this is not something that
> is done right now

I have read that in some cases it is possible to reconvert a Kanji string after mistakenly converting an undesired Kanji string with the same pronunciation. I've never been able to do that myself. As a guess, it might be something particular to Word 2002 and later plus IME 2002 and later, maybe or maybe not related to the "Natural" IME. But even when I had Word 2002 temporarily installed on a machine, I wasn't able to do recoversions myself. Also if I recall enough correctly of what I read, reconversion is no longer possible after the document is closed and reopened, because the pronunciation is not stored.

By the way some characters have more than 10 readings.

In both Windows XP and Windows Server 2003, the help for the Japanese IME include keystrokes that can be used to revert a converted string or character to a reading, which is what I think you are talking about.

I was thinking of an application that would do this behind the scenes so if you type in a name it would get the pronunciation and stick it in a "reading" field. No extr work required from the user!

One of the things you will need to be aware of if you are going to use the IME input as a reading for the kanji is that you're going to have times when that input is not correct.

Sometimes for place and people names, or other words that don't want to come up nicely with the IME, it's easier to type in alternative readings for a kanji and select it from the list.

Take for example '真乗院', a minor temple in Tokyo. It doesn't come up in the IME when you input 'しんじょういん(shinjouin)', but typing in the readings for individual kanji 'ma', 'noru' (delete the ru after you select the kanji) and 'in'. But now 'Manoin' would sort completely differently than 'Shinjouin.'

Little IME tricks like this could play havoc if you don't give people an opportunity to correct their kanji readings.

Chris -- I agree, by all means!

But a solution that put it in and let you override it is superior to one that forcess you to always type it, even though it may well be what you just typed in order to enter the string in the first place.

Don't forget the fourth method of reading: "special" readings. These are basically arbitrary. For example, the characters for "one" and "day," 一日, can, and not infrequently is, read using on-readings for both characters: "ichi nichi." In this case it expresses the idea of a length of time one day long. However, when you're using it to express the idea of "first day" (e.g., the first day of the month), you pronounce it "tsuitachi." This bears no relation to the on or kun readings of either character. Moreover, it's a single word; you can't for example assign the "tsui" to 一 and the "tachi" to 日; the reading is valid only for the two characters in combination.