by Michael S. Kaplan, published on 2004/12/27 03:03 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/27/332618.aspx
(From the Suggestion Box)
When people start looking at East Asian languages, they notice that most of the regions have a sort based on pronunciation: Korea has a sort based on the Hangul pronunciation of the Hangul and Hanja codepoints, Taiwan has one based on the pronunciation in Bopomofo order, and China has one based on the Pinyin pronunciation. They notice that there is one major region missing from this list -- Japan. They wonder why Japanese is not given the benefits of such a sort. Isn't the Japanese market important to Microsoft?
(I have been asked this very question, sometimes that very way, in email)
There are two answers to this question, one long and one short.
The short answer is that there is a pronunciation based sort in Windows. Simply pass any Hiragana or Katakana to CompareString or LCMapString/LCMAP_SORTKEY and you will see everything collate properly. It even works in all locales; one does not even need to pass the Japanese LCID, 0x0411, to see it happen. The world is in the proper アアあイイいウウうエエえオオお order (in the traditional AIUEO order, Halfwidth Katakana followed by Fullwidth Katakana followed by Hiragana). What more could one want?
Of course the answer to that question is in the long answer -- people want to know how to get the Kanji (the Han ideographs) to sort in this order, too.
The answer to this question is that there is no such sort. To explain why, lets look at how the Korean/Chinese/Taiwanese regional sorts are done. In all three of them, there are often characters that have multiple different pottential pronunications in an ideograph, based on context (just as exists in English for words like Polish the language versus polish the furniture cleaner). This would make pronunication based sorts impossible except for the fact that the most common pronunciation is determined and then that is the one that is used when multiple pronunications exist.
Admittedly this is not a perfect solution, but short of a computer that can actually read the text, there is not much more that can be done (although I am sure one could imagine interesting dictionary-based ways to approximate things -- I have, and they fall under the heading of 'clever' even when they are not really practical).
Now lets look at the situation with Japanese.
There are three different types of pronunications, called readings (on, kun, and nanori) and individual Kanji can have one, two, or all three of these (and in most cases at least the first two). They can also have more than one of each! The third reading type (nanori) is for name and there is in most cases no way to know what it is without being told (this is in fact how phonebooks work -- someone giving the pronunciation in Kana to the phone company or list creator).
Given all of that, there is no way to even guess what the most common pronunciation is, even if the data were available, without giving users results that seem wrong or confusing to them. Because even though one could craft an algorithm that could make intelligent guesses at which type of reading is meant, there is no way to make something at least as likely to be correct as the other East Asian languages, especially given that what is probably the most common need for such a sort (lists of names) would require a separate field for the pronunciation.
And this is indeed the best solution for such situation -- a separate field containing the pronunciation. It works quite well, and I would encourage any application that wants to do a pronunciation-based sort to try doing this as a method.
In theory, this is something an application can do when a name is typed when the IME mode is based on pronunciation; this is the one time that the pronunciation information is present without it being queried separately -- during the composition phase. As far as I know, this is not something that is done right now (if I am mistaken feel free to let me know!). It would be exceedingly difficult to do with the IME APIs and Windows messages as they are (and it is nearly impossible in the .NET Framework since the appropriate events are not even exposed).
This post brought to you by "ㄎ" (U+310e, a.k.a. BOPOMOFO LETTER K)
# Norman Diamond on 27 Dec 2004 1:31 AM:
# Michael Kaplan on 27 Dec 2004 1:59 AM:
# Chris Pearce on 27 Dec 2004 10:37 AM:
# Michael Kaplan on 27 Dec 2004 5:47 PM:
# Curt Sampson on 13 Jan 2005 7:47 PM:
# Michael Kaplan on 13 Jan 2005 8:15 PM:
referenced by