by Michael S. Kaplan, published on 2006/01/03 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/03/508579.aspx
Jeremy White asked me:
Hi there! I've been search the web to find acceptable Japanese sort orders. I really couldn't find one, and your blog was the closest I could come to dealing with the issue. You might well make to make this a post -- I'm sure I'm not the only one whos come across it?
What is any acceptable Japanese sort order?
Katakana and Hiragana sorted in AIUEO order, interleaved with Katakana first.
Then Kanji, sorted in stroke/radical order. Or some sort of phoetic order?
Then Romanji.
Any sort of double or half-width attribute stripped out before sorting.
Does the Unicode sort produce acceptable results? I'm sure they're not ideal, but I'm wondering if they are acceptable (and perhaps what they are).
Thank you
I do not know of any Japanese sort that does exactly what is being suggested here. I have had many posts that talk about sorting Japanese, such as the following:
There are obviously problems coming up with a sort order that is intuitive for the average Japanese user, and some of the difficulties are only hinted at by the above posts. But I can talk about a few of the specific issues that Jeremy raises to explain why there are problems with some of them....
I have talked previously about the problems with building a pronunciation-based sort, and you can look at several of the links above.
A radical/stroke based sort is an interesting one to consider for some contexts (like dictionaries) but use on computers may not be as commonly expected?
I do not know of any sort that puts Romanji last, since the Latin script does not usually come after Han ideographs. and there is really no way to distinguish between English and Romanji at the script level.
It is likely best to ignore neither width nor Kana distinctions -- they will be sorted properly by Windows and if you pass the ignore flag than the ordering would be arbitrary rather than deterministic....
Now the Unicode order will not be intuitive to anyone; it is arbitrary and has very little to do with the way humans would look at any language (including Japanese).
Of course the best results may happen by just passing MAKELCID(MAKELANGID(LANG_JAPANESE, SUBLANG_JAPANESE_JAPAN), SORT_JAPANESE_XJIS) and going from there? This has the advantage of at least sorting consistently with the rest of the Japanese sorting that is happening on the platform....
This post brought to you by "シ" (U+ff7c, a.k.a. HALFWIDTH KATAKANA LETTER SI)
# Rosyna on 3 Jan 2006 6:54 AM:
# Michael S. Kaplan on 3 Jan 2006 1:03 PM:
# Nick Lamb on 3 Jan 2006 8:55 PM:
# Rosyna on 4 Jan 2006 10:41 AM:
# Michael S. Kaplan on 4 Jan 2006 10:47 AM:
# Rosyna on 4 Jan 2006 6:22 PM:
# Michael S. Kaplan on 5 Jan 2006 4:41 PM:
Paul A Houle on 3 May 2010 1:52 PM:
The solution is elementary, Dear Watson.
If you want to sort a list of named entities in Japanese, the right thing to do is convert to furigana or romanji, then order the way you'd do it in English.
In many Japanese IS systems, there are separate text fields for the "normal" vs "furigana" writings, and sort is done on the furigana.
Now you might say it's a real PITA to always make people write text twice, and I say, no problem; we're in the age of large-scale knowledge based systems, and it's just a problem in statistical language translation. It's much simpler than most cases, because there's little or no "semantic" or "syntactic" gap between kanji and kana writings. (The meaning is exactly the same, and the order of the symbols is preserved.)
If I had a sufficient corpus of japanese text in both kanji and kana forms I'm sure I could make something that reads kanji phonetically better than I do in a week or so.
referenced by