And what about the Japanese (Unicode) sort?

by Michael S. Kaplan, published on 2004/12/14 23:53 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/14/307152.aspx

Although I got no public comments about it, seven different people contacted me privately (by email or via the "Contact" link) asking me what was the answer to Andrea's question about the Japanese (Unicode) sort.

(I'm not sure why no one asked in the public comments. I must be very intimidating or something)

Its about the same as the answer about Korean (but the Yen sign U+00a5 is used instead of the Won sign, for obvious reasons). We also move the HORIZIONTAL BAR (U+2015) to sort by the KATAKANA-HIRAGANA PROLONGED SOUND MARK (U+30fc), for similarly unremarkable historical reasons. Otherwise the sort is identical to the default sort, a fact that makes it quite fundmentally useless for Japanese data.

In case anyone looks here without looking at "it", please do go look at "it". I do understand the codepoint of the single-byte yen sign in Japanese (non-Unicode) character sets. Regarding sorting, the number of meaningful Japanese sort orderings does not magically go down to 1 if you use Unicode encoding instead of more common character sets (ANSI code page 932 or other), and Windows does not have a sort ordering that would match my local phone book.

Yes, that is hopefully true, since the so-called Japanese Unicode sort was removed due to its ueslessness.

The actual Japanese sort in Windows today is hopefully not in the phone book order, or that would be a messed up phone book. I will probably relay a conversation I had with a different person st a conference about THAT sort on another day. :-)

12/21/2004 12:38 PM Nick Lange

> does anyone have a url for a good japanese
> sort algorithm?

I doubt it very much. After posting that Windows doesn't have a sort ordering that would match my local phone book, I belatedly remembered that it isn't even possible to define a sort ordering that would match a phone book even without me being listed in it. And I belatedly remembered that I have even posted that fact in Raymond Chen's blog...

Anyway, there are a few standard sort orderings, but all of them are unsuitable for use in human displays. They are only suitable for use in internal operations such as storing and retrieving keys in databases or hash tables or symbol tables and such things.

I once knew someone named Kanbe. The Kanji of his name were the same as for the city Kobe. Consider sorting the names Kanbe, Kimura, and Kobe. The exact same Kanji for Kanbe and Kobe must be listed both before and after Kimura. The only way to sort them properly is to also have the pronunciations recorded, use the pronunciations as the primary sort key, and use additional secondary keys including the actual names that are going to be displayed or printed.

The first name of a former colleague is Yukie but someone read her name and called her Sachie (of course he really read and called her by full name, properly starting with her family name). There are thousands like this.

Note that the solutions currently used for Korean and Chinese are very much based on the model of "take the most common pronunciation." This solution is routinely rejected for Japanese, a point that I will actually be exploring in a future posting to the blog (cf: http://blogs.msdn.com/michkap/articles/271003.aspx#329682 ).

:-)

12/22/2004 12:25 AM Nick Lange

> my ketai has fields for both the reading and
> the kanji

Yes, so do databases and hand-written paper forms for all kinds of purposes etc. The printed phone book doesn't. (If you only have a keitai then I don't know if you're supposed to be entitled to a printed phone book.)

Sorry for picking on phone books so much. There are other situations too where furigana aren't printed but would have helped some readers if they had been printed, but if those situations require lists to be sorted then they often turn out to be phone books.