Supporting a pronunciation based sort for East Asian languages...

by Michael S. Kaplan, published on 2005/12/07 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/07/500827.aspx

It has nearly been a year since I asked the question Why is there no pronunciation-based sort for Japanese? somewhat rhetorically (since I answered the question!).

But since then I have gotten the question a few more times, as people ask how they can do it.

I kind of answered that in the earlier post:

Given all of that, there is no way to even guess what the most common pronunciation is, even if the data were available, without giving users results that seem wrong or confusing to them. Because even though one could craft an algorithm that could make intelligent guesses at which type of reading is meant, there is no way to make something at least as likely to be correct as the other East Asian languages, especially given that what is probably the most common need for such a sort (lists of names) would require a separate field for the pronunciation.

And this is indeed the best solution for such situation -- a separate field containing the pronunciation. It works quite well, and I would encourage any application that wants to do a pronunciation-based sort to try doing this as a method.

In theory, this is something an application can do when a name is typed when the IME mode is based on pronunciation; this is the one time that the pronunciation information is present without it being queried separately -- during the composition phase. As far as I know, this is not something that is done right now (if I am mistaken feel free to let me know!). It would be exceedingly difficult to do with the IME APIs and Windows messages as they are (and it is nearly impossible in the .NET Framework since the appropriate events are not even exposed).

The best solution is right in there -- have a separate field with the pronunication in it. That pronunciation may be

Kana-based for Japanese
Bopomofo-based for Taiwan
Latin-script-based for PRC (for either Mandarin or Cantonese)
Hangul-based for Korea

And then you can do the sorting based on this alternate field rather than the display string.

If one is using a pronunciation-based IME then this will seem inconvient at times (after all, you may have just typed the same string to find the candidate you wanted!) and would definitely be inonvenient when you are actually typing Kana or Hangul, but there are also times that the pronunciation string may be very different and thus the duplication would not be happening.

The real question that comes into play now is how visible the pronunciation string should be in a user interface.

Clearly for the situation where a singla Kanji/Han/Hanja ideograph has multiple well-known pronunciations, an ordered list likely does not necessarily need to include the pronunciation since the context is probably sufficient without it.

In the case of Japanese names where a nanori reading may be completely unrelated to any of the generally known readings, having the pronunciations available and perhaps even visib is likely a lot more crucial.

As I think about address book type user interfaces, the issue of how best to intuitively place that information (and how to not have it around when it would not be useful) becomes interesting. Perhaps a furigana type solution would be the most intuitive for Japanese users?

Which had me wondering whether there were any languages outside of East Asia where a pronunciation sort would be used. Anyone know of any? :-)

This post brought to you by "ㄎ" (U+310e, a.k.a. BOPOMOFO LETTER K)

# anonymous on 7 Dec 2005 4:44 AM:

What about Excel? Excel has allowed storing the phonetic information behind a cell in Japanese since Excel 97. By default, Excel sorts using this phonetic information

# Michael S. Kaplan on 7 Dec 2005 8:55 AM:

Very true -- and Excel of course is "heavy" enough that adding one more attribute to a cell for pronunciation is not so hard to do.

But they still have the same problem when it omes to the desired pronunciation not matching the one used to type in a name, andI have heard complaints about the visibility o the pronunication in the past....

# Mihai on 7 Dec 2005 12:57 PM:

If you are talking about the UI of applications,
the UI used for MS Word index entries seems ok (when Japanese is enabled).

# Michael S. Kaplan on 7 Dec 2005 1:01 PM:

They do not use furigana, do they?

I wonder how that UI would work for the other East Asian languages.... or in Excel. Interesting, I was not thinking about the Word scenario here before.

# Mihai on 7 Dec 2005 2:08 PM:

Yes, they do use furigana (and the user has to type it :-).
And I guess it will also work for other languages, if Word honws how to sort the thing in the pronunciation field. In theory you could probably put numbers in there :-)

There are only two applications that I know can do this: MS Word (starting with Word 2000) and FrameMaker. In FrameMaker I have used this to soft Thai (unsupported). Save files as MIF (SGML), find all the index markers, sort them, then update the index markers with "pronunciation" (looking like "1234412" :-)
Then FrameMaker was able to generate properly sorted Thai indexes :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2006/01/03 'Acceptable' Japanese sort order?

go to newer or older post, or back to index or month or day