And what about the Japanese (Unicode) sort?

by Michael S. Kaplan, published on 2004/12/14 23:53 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/14/307152.aspx


Although I got no public comments about it, seven different people contacted me privately (by email or via the "Contact" link) asking me what was the answer to Andrea's question about the Japanese (Unicode) sort.

(I'm not sure why no one asked in the public comments. I must be very intimidating or something)

Its not a very exciting answer, for what it's worth.

Its about the same as the answer about Korean (but the Yen sign U+00a5 is used instead of the Won sign, for obvious reasons). We also move the HORIZIONTAL BAR (U+2015) to sort by the KATAKANA-HIRAGANA PROLONGED SOUND MARK (U+30fc), for similarly unremarkable historical reasons. Otherwise the sort is identical to the default sort, a fact that makes it quite fundmentally useless for Japanese data.


# Jake on 17 Dec 2004 8:31 AM:

I was not intimidated. I just did not want to try and make you answer the question if there was some political reason that you could not.

# Norman Diamond on 19 Dec 2004 9:47 PM:

In case anyone looks here without looking at "it", please do go look at "it". I do understand the codepoint of the single-byte yen sign in Japanese (non-Unicode) character sets. Regarding sorting, the number of meaningful Japanese sort orderings does not magically go down to 1 if you use Unicode encoding instead of more common character sets (ANSI code page 932 or other), and Windows does not have a sort ordering that would match my local phone book.

# Michael Kaplan on 19 Dec 2004 10:27 PM:

Yes, that is hopefully true, since the so-called Japanese Unicode sort was removed due to its ueslessness.

The actual Japanese sort in Windows today is hopefully not in the phone book order, or that would be a messed up phone book. I will probably relay a conversation I had with a different person st a conference about THAT sort on another day. :-)

# Nick Lange on 21 Dec 2004 12:38 PM:

Ok, I'll bite.. does anyone have a url for a good japanese sort algorithm? I can imagine the pain of such an algorithm, having to look up readings first... but maybe someone has done said work already?

# Norman Diamond on 21 Dec 2004 5:03 PM:

12/21/2004 12:38 PM Nick Lange

> does anyone have a url for a good japanese
> sort algorithm?

I doubt it very much. After posting that Windows doesn't have a sort ordering that would match my local phone book, I belatedly remembered that it isn't even possible to define a sort ordering that would match a phone book even without me being listed in it. And I belatedly remembered that I have even posted that fact in Raymond Chen's blog...

Anyway, there are a few standard sort orderings, but all of them are unsuitable for use in human displays. They are only suitable for use in internal operations such as storing and retrieving keys in databases or hash tables or symbol tables and such things.

I once knew someone named Kanbe. The Kanji of his name were the same as for the city Kobe. Consider sorting the names Kanbe, Kimura, and Kobe. The exact same Kanji for Kanbe and Kobe must be listed both before and after Kimura. The only way to sort them properly is to also have the pronunciations recorded, use the pronunciations as the primary sort key, and use additional secondary keys including the actual names that are going to be displayed or printed.

The first name of a former colleague is Yukie but someone read her name and called her Sachie (of course he really read and called her by full name, properly starting with her family name). There are thousands like this.

# Nick Lange on 22 Dec 2004 12:25 AM:

Agreed, so I guess the best thing to do is just butcher the actual readings and just sort from the first match in a lookup. (joke)
Although my ketai has fields for both the reading and the kanji... probably how most systems here work.

# Michael Kaplan on 22 Dec 2004 12:46 AM:

Note that the solutions currently used for Korean and Chinese are very much based on the model of "take the most common pronunciation." This solution is routinely rejected for Japanese, a point that I will actually be exploring in a future posting to the blog (cf: http://blogs.msdn.com/michkap/articles/271003.aspx#329682 ).

:-)

# Norman Diamond on 22 Dec 2004 1:19 PM:

12/22/2004 12:25 AM Nick Lange

> my ketai has fields for both the reading and
> the kanji

Yes, so do databases and hand-written paper forms for all kinds of purposes etc. The printed phone book doesn't. (If you only have a keitai then I don't know if you're supposed to be entitled to a printed phone book.)

Sorry for picking on phone books so much. There are other situations too where furigana aren't printed but would have helped some readers if they had been printed, but if those situations require lists to be sorted then they often turn out to be phone books.

# Nick Lange on 22 Dec 2004 3:33 PM:

nice... After my current contract is up, I'd like to get into more multilingual programming projects... can't wait.

referenced by

2008/05/07 Four exceptions to prove the rule

2006/09/23 The modern solution to the problem of Traditional Spanish in Vista

2006/08/12 You think that's bad? Just wait, it gets worse...

2006/01/03 'Acceptable' Japanese sort order?

2005/12/28 Getting rid of your extra yen

2005/11/01 I WON to talk about the YEN

2005/10/12 I'd rather call it the path separator

go to newer or older post, or back to index or month or day