'Acceptable' Japanese sort order?

by Michael S. Kaplan, published on 2006/01/03 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/03/508579.aspx


Jeremy White asked me:

Hi there! I've been search the web to find acceptable Japanese sort orders. I really couldn't find one, and your blog was the closest I could come to dealing with the issue. You might well make to make this a post -- I'm sure I'm not the only one whos come across it?

What is any acceptable Japanese sort order?

Katakana and Hiragana sorted in AIUEO order, interleaved with Katakana first.

Then Kanji, sorted in stroke/radical order. Or some sort of phoetic order?

Then Romanji.

Any sort of double or half-width attribute stripped out before sorting.

Does the Unicode sort produce acceptable results? I'm sure they're not ideal, but I'm wondering if they are acceptable (and perhaps what they are).

Thank you

I do not know of any Japanese sort that does exactly what is being suggested here. I have had many posts that talk about sorting Japanese, such as the following:

There are obviously problems coming up with a sort order that is intuitive for the average Japanese user, and some of the difficulties are only hinted at by the above posts. But I can talk about a few of the specific issues that Jeremy raises to explain why there are problems with some of them....

I have talked previously about the problems with building a pronunciation-based sort, and you can look at several of the links above.

A radical/stroke based sort is an interesting one to consider for some contexts (like dictionaries) but use on computers may not be as commonly expected?

I do not know of any sort that puts Romanji last, since the Latin script does not usually come after Han ideographs. and there is really no way to distinguish between English and Romanji at the script level.

It is likely best to ignore neither width nor Kana distinctions -- they will be sorted properly by Windows and if you pass the ignore flag than the ordering would be arbitrary rather than deterministic....

Now the Unicode order will not be intuitive to anyone; it is arbitrary and has very little to do with the way humans would look at any language (including Japanese).

Of course the best results may happen by just passing MAKELCID(MAKELANGID(LANG_JAPANESE, SUBLANG_JAPANESE_JAPAN), SORT_JAPANESE_XJIS) and going from there? This has the advantage of at least sorting consistently with the rest of the Japanese sorting that is happening on the platform....

 

This post brought to you by "" (U+ff7c, a.k.a. HALFWIDTH KATAKANA LETTER SI)


# Rosyna on 3 Jan 2006 6:54 AM:

Hmm, シ is SHI just happens to be in the same location as SI. Because of this you'll hear the Japanese say "Tackshi" for Taxi and that's just really, really weird.

But the sort order I see here is always phonetic. But that's the problem with Japanese, 月 and 月 are the same character, but I typed gatsu (month) to get the first and tsuki (moon) to get the second. This is why I don't envy people that have to face or solve this issue. The iPod sort order even seems whacked for Japanese.

# Michael S. Kaplan on 3 Jan 2006 1:03 PM:

Phonemic is definitely what is preferred -- but once you get into Kanji it gets a lot harder to do (for the multiple readings per ideograph, mainly!).

# Nick Lamb on 3 Jan 2006 8:55 PM:

"But that's the problem with Japanese, 月 and 月 are the same character, but I typed gatsu (month) to get the first and tsuki (moon) to get the second."

When I say "lead" (IPA: /lεd/ a noun) to the computer, the speech recognition system momentarily knows that I didn't mean "lead" (IPA: /liːd/ a verb) but this is instantly lost in the text-processing software to which it is connected. Subsequently there is no way to treat these two words separately, because from a text processing point of view they are indistinguishable.

# Rosyna on 4 Jan 2006 10:41 AM:

Yeah, phonetic is what is used here in Japan, even for Latin spellings. For example, if I wanted to find songs by the artist "angela" (which I'm trying to do, trying to find a place that sells her I/O CD with no luck) I'd look under the あ's (ア's, depending on the store). Which is really, really weird when looking for something like the new Neon Genesis Evangelion CDs which are under the え’s (エ’s) for "Evangelion". Go figure.

Nick Lamb, yeah, many text-to-speech implementations do the same. Look for the word usage for lead. However, this fails miserably for some words like row which can be pronounced two ways for the same usage.

# Michael S. Kaplan on 4 Jan 2006 10:47 AM:

This was actually the reasoning in that "IMEs? They have it easy.... " post I referenced -- they solve the multiple pronunciations issue by simply including all the pronunciations, but is harder for collations since the item can only be sorted in one place....

# Rosyna on 4 Jan 2006 6:22 PM:

Ah yes. Those hosers do have it easy. *They* get context. Sorting has no context whatsoever (especially when sorting file names). It's bad enough in Japan that people don't know how to pronounce other's names without a hint (on official forms and many unofficial ones you have to right your name in kanji *and* the phonetic spelling if your name).

Reminds me of Mount Fuji, which can be pronounced different ways depending on the person saying it.

# Michael S. Kaplan on 5 Jan 2006 4:41 PM:

Is that a dialect issue? Or do different kinds of people address mountains differently? :-)

Paul A Houle on 3 May 2010 1:52 PM:

The solution is elementary,  Dear Watson.

If you want to sort a list of named entities in Japanese,  the right thing to do is convert to furigana or romanji,  then order the way you'd do it in English.

In many Japanese IS systems,  there are separate text fields for the "normal" vs "furigana" writings,  and sort is done on the furigana.

Now you might say it's a real PITA to always make people write text twice,  and I say,  no problem;  we're in the age of large-scale knowledge based systems,  and it's just a problem in statistical language translation.  It's much simpler than most cases,  because there's little or no "semantic" or "syntactic" gap between kanji and kana writings.  (The meaning is exactly the same,  and the order of the symbols is preserved.)

If I had a sufficient corpus of japanese text in both kanji and kana forms I'm sure I could make something that reads kanji phonetically better than I do in a week or so.


referenced by

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

go to newer or older post, or back to index or month or day