Knock knock! Who's there? Kana! Kana Who? I Kana got something wrong!

by Michael S. Kaplan, published on 2010/02/17 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/02/17/9964066.aspx


Sometimes I look back at prior blogs from this Blog and am really quite happy with what they say.

Other times, I am less impressed.

You may have the same feeling if you are a regular reader. :-)

Now still other times I feel like great at the time but then have reason to not feel as great later.

Today I am going to talk about one of the times from that third category.

The blog? From over four and a half years ago (Knock knock! Who's there? Kana! Kana Who?).

This blog involved a huge discussion with a lot of different people about the terminological, technical, linguistic, and collationary features of Kana in Japanese, and was finally reviewed by people both here and in Japan.

Unfortunately, it was kinda wrong when it described the exact weight differences between the various Kana.

The question, asked by the man who took on the role of the child who pointed out that the emperor was wearing no clothes, was Miles:

We have some bugs with respect to sorting and matching katakana small letters. I’ve been trying to figure out what the desired sort/match semantics should be, but that is not trivial.

  1. according to http://blogs.msdn.com/michkap/archive/2005/06/01/423711.aspx katakana small letters should compare equal to their non small equivalents when NORM_IGNORECASE is used.
  2. but calling CompareStringEx with various flags I get the following

NORM_IGNORECASE:                        U+30e3  <       U+30e4
LINGUISTIC_IGNORECASE:               U+30e3  <       U+30e4
NORM_LINGUISTIC_CASING:         U+30e3  <       U+30e4
NORM_IGNORENONSPACE:            U+30e3  =       U+30e4
LINGUISTIC_IGNOREDIACRITIC:     U+30e3  <       U+30e4

So, at least on Win2k8, it looks like Katakana Small letters are diacritic variants of their respective big equivalents.

Which of these two behaviors should SQL try to imitate ?

Thanks,
Miles

Oops.

That was my first thought.

Now the middle test of the five he tried was not relevant; NORM_LINGUISTIC_CASING has no impact whatsoever on Kana (it fixes the Turkic problem described here).

The other results made no sense though. Not if that blog of mine, the one everyone reviewed, the one I reviewed, the one no one has questioned in nearly five years, was right.

Better check this one out....

First let's look at the weights for the six Katakana A's:

U+ff67 HALFWIDTH KATAKANA LETTER SMALL A 22 02 01    01 01 c4 ff 02 c4 ff c4 ff 01 00
U+30a1 KATAKANA LETTER SMALL A           22 02 01    01 01 c4 ff 02 c4 ff ff    01 00
U+ff71 HALFWIDTH KATAKANA LETTER A       22 02 01    01 01 ff 02 c4 ff c4 ff    01 00
U+30a2 KATAKANA LETTER A                 22 02 01    01 01 ff 02 c4 ff ff       01 00
U+3041 HIRAGANA LETTER SMALL A           22 02 01    01 01 c4 ff 02 ff ff       01 00
U+3042 HIRAGANA LETTER A                 22 02 01    01 01 ff 02 ff ff          01 00
U+32d0 CIRCLED KATAKANA A                22 02 01 ee 01 01 ff 02 c4 ff ff       01 00

Hmmm.

Circled Katakana looks to be a diacritic (DW) difference, none of them look to be a case (CW) difference, and all of them duke it out in the special (SW) area.

now of course NORM_IGNOREWIDTH will muck with some of the information in here if you pass that flag; it will make the first and second items on the list look identical, and also the third and fourth to look identical.

Of course these seem to behave differently than other fullwidth characters and halfwidth counterparts, like I described back in A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you).

But my hope that something had changed in Windows 7 were dashed; the behavior is the same, and it turns out that A&P of Sort Keys, part 7 also has some minor differences to explain in regard to Kana behaving differently in weights than other differently "widthed" characters do.

Plus perhaps A&P of Sort Keys, part 10 (aka I've Kana wanted to start talking about Japanese) was a bit too confident about that first blog....

The really interesting thing about both "bugs" in the blog, which I discovered during this deeper dive for the blog you are reading now, are that neither are borne out by the actual raw values in the weight tables; the difference happens in the way the raw weights are read in the "special case" of Kana.

My review, which was mainly of comparing the relative sorting of the characters and the raw weights, was complete enough to give me confidence but not complete enough for me to have deserved said confidence....


jepoykun on 17 Feb 2010 7:46 AM:

Nice


go to newer or older post, or back to index or month or day