Knock knock! Who's there? Kana! Kana Who?

by Michael S. Kaplan, published on 2005/06/01 02:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/06/01/423711.aspx


You Kana wonder how we order Japanese strings? :-)

Some time yesterday, one of the testers over on the Shell team was curious about how collation works for the Japanese alphabet. The discussion was an interesting one, so I thought I would post the summary of all the infomation we talked about (with some examples for each interesting distinction) here.

Note that this behavior relates to what is done on Windows (as well as SQL Server, Office, Windows CE, Active Dirctory, and every Microsoft product that either calls our APIs or uses our data). Your mileage for other platforms may certainly vary!

Ok, on Windows, the Japanese Kana all sort in an implementation of the GoJuOn order, with the following principles:

When you combine all these rules together, the order you get for the vowels would be:

ァァアアぁあ㋐ィィイイぃい㋑ゥゥウウぅうヴゔ㋒ェェエエぇえ㋓ォォオオぉお㋔

And then the other important things to note (changes in red 1 June 2005 7:50am):

ァァアアぁあ㋐ィィイイぃい㋑ゥゥウウぅうヴゔ㋒ェェエエぇえ㋓ォォオオぉお㋔

In other words, everything on the same line below can be made to seem equal; everything on a different line cannot.

HALFWIDTH KATAKANA LETTER SMALL A; KATAKANA LETTER SMALL A; HALFWIDTH KATAKANA LETTER A; KATAKANA LETTER A; HIRAGANA LETTER SMALL A; HIRAGANA LETTER A; CIRCLED KATAKANA A
HALFWIDTH KATAKANA LETTER SMALL I; KATAKANA LETTER SMALL I; HALFWIDTH KATAKANA LETTER I; KATAKANA LETTER I; HIRAGANA LETTER SMALL I; HIRAGANA LETTER I; CIRCLED KATAKANA I
HALFWIDTH KATAKANA LETTER SMALL U; KATAKANA LETTER SMALL U; HALFWIDTH KATAKANA LETTER U; KATAKANA LETTER U; HIRAGANA LETTER SMALL U; HIRAGANA LETTER U; KATAKANA LETTER VU; HIRAGANA LETTER VU; CIRCLED KATAKANA U
HALFWIDTH KATAKANA LETTER SMALL E; KATAKANA LETTER SMALL E; HALFWIDTH KATAKANA LETTER E; KATAKANA LETTER E; HIRAGANA LETTER SMALL E; HIRAGANA LETTER E; CIRCLED KATAKANA E
HALFWIDTH KATAKANA LETTER SMALL O; KATAKANA LETTER SMALL O; HALFWIDTH KATAKANA LETTER O; KATAKANA LETTER O; HIRAGANA LETTER SMALL O; HIRAGANA LETTER O; CIRCLED KATAKANA O

The rules for the flags affect all this? Well....

Now obviously Windows file names are "case insensitive" but we do not consider the "small" Kana and the "regular" Kana to be case pair (no one does, usually including native speakers) -- so you can have both of them in file names in the same directory, but you cannot use both as the same names in (for example) an Active Directory installation (in fact since all four flags are passed for AD, you cannot use any of the letters within the colored groups together in the same AD namespace).

Ignoring something with these flags in this context means "treat them all as equal" -- which means you will have a non-deterministic ordering any time you have a big list with many of these variants comparing as equal. In my opinion, a deterministic order is always better, and not just because I try to be an orderly guy. :-)

But your mileage may vary, of course!

Now the Kanji are not sorted in pronunciation order, because as I mentioned back in December of last year, there is no pronunciation-based sort for Japanese on Windows. But if you have entered the pronunciation information and are sorting by it (the way that for example an addressbook might choose do) then this order will be respected. Note that name readings (nanori'yomi) are sometimes (perhaps often) entirely individual and do not match any of the kun'yomi or on'yomi with which a given ideograph may be commonly associated. So such a feature makes a lot of sense if you know how all the names are pronounced; if not (for example in a large company address book) you may want an alternate way to search for names that you may know only by characters and not by pronunciation.

 

This post brought to you by "ヰ" (U+30f0, a.k.a. KATAKANA LETTER WI)


# Philip Newton on 1 Jun 2005 7:18 AM:

> we do not consider the "small" Kana and the
> "regular" Kana to be case pair (no one does,
> usually including native speakers)

That makes sense to me since, for example, shuu "state, province" and shiyuu "private ownership" are phonemically different but differ in (kana) writing "only" through big-vs-little kana: しゅう vs しゆう.

# Michael S. Kaplan on 1 Jun 2005 7:39 AM:

Robert --- you are correct, I did mess that one up. I think I will post the correction to that one....

# Nicholas Allen on 1 Jun 2005 9:52 AM:

Using the kanji kurikaeshi (々) in hiragana looks weird. There are separate repetition symbols for use in hiragana and katakana (in voiced and unvoiced variants) as well as a double kana repeat (again voiced+unvoiced).

# Michael S. Kaplan on 1 Jun 2005 11:26 AM:

Good point, Nicholas. although Windows still respects the placement and does its best to follow the apparent intent....

So, good technically even if not so good linguistically. :-)

# Eusebio Rufian-Zilbermann on 1 Jun 2005 12:37 PM:

The prolonged sound mark is equivalent to actually adding a lowest-weight variation of the corresponding vowel (usually ア after -a sounds, イ after -e or -i sounds, and ウ after -o or -u sounds) for example, the word sensei could be written in Katakana as センセイ or as センセー. This replacement is why ぎー sorts before ぎぎ. The "trick" of duplicating and then substracting a bit, it works in most cases but it doesn't work when you're prolonging a vowel sound itself: if you sort アア and アー the prolonged sound mark sorts after the corresponding letter and not before. Try them with a phonetic sort in Winword ':)

# Michael S. Kaplan on 1 Jun 2005 2:30 PM:

The implementation is not completely perfect in that sense -- we always show a slight difference between the two cases (but one that you can easily choose to ignore if you want to).

referenced by

2010/02/17 Knock knock! Who's there? Kana! Kana Who? I Kana got something wrong!

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/04/01 I Kana understand you, could you repeater that? (Part 2)

2007/03/29 I Kana understand you, could you repeater that? (Part 1)

2006/09/19 Put in on my Tab, please

2006/01/03 'Acceptable' Japanese sort order?

2005/07/20 More on sort elements

go to newer or older post, or back to index or month or day