A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

by Michael S. Kaplan, published on 2007/09/21 04:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/21/5027338.aspx


Previous posts in this series:

Today's post is going to be a first look at some of the Japanese support that is there in Windows....

Well, not a first look, since this post and the seven others it links to have talked about it already. :-)

In general. given the challenges faced in trying to handle a Kanji sort correctly, at present (and for the last decade!) only the Kana are handled (but anyone could add the kana for the pronunciation in a database and that additional column for collation. This will have to do until a more intelligent type of pronunciation-based Kanji sort is given....

Also, kana sorts properly in all locales, which can come in handy.

So we take a nice word like ramen (a loan word from Chinese) and look at it in katakana, narrow katakana, and hiragana (using LCMapString/LCMapStringEx to do the various conversions, of course).

Note that the word would usually be spelled using katakana, so the other forms are just illustrative for us:

ラーメン   22 42 22 02 22 35 22 80 01 01 01 ff 03 05 02 c4 c4 c4 c4 ff ff 01 00
ラーメン      22 42 22 02 22 35 22 80 01 01 01 ff 03 05 02 c4 c4 c4 c4 ff c4 c4 c4 c4 ff 01 80 17 06 03 00
らーめん  22 42 22 02 22 35 22 80 01 01 01 ff 03 05 02 ff ff 01 80 17 06 03 00

Clearly all three of them will sort near each other with identical primary weights.

But some interesting havoc is being wreaked here in both the special weights and punctuation weights areas, definitely worthy of some investigation and discussion....

The order being strived for is something I talked about a little bit in Knock knock! Who's there? Kana! Kana Who?, which you may have seen before. And indeed if you look at the sort keys, this was accomplished for the simple example, though it does not look as rigorous as it maybe could, from first glance.

Now I could cheat and give it to you by looking at the source and the data and explaining it, but I think we should do it the interesting hacking kind of way, don't you? :-)

 

This post brought to you by (U+247d, a.k.a. PARENTHESIZED NUMBER TEN)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/02/17 Knock knock! Who's there? Kana! Kana Who? I Kana got something wrong!

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/03/03 On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

go to newer or older post, or back to index or month or day