Query collation source data?

by Michael S. Kaplan, published on 2005/06/30 10:50 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/06/30/434223.aspx


Recently someone asked a question via the Contact link:

Hi, I was wondering if you know of a way to get the stroke count of a chinese character? This function seem to be in Word (chinese index), and also in .NET (string sort according to stroke and bopomofo)

But I can't seem to find any API functions exposed to retrieve the stroke count nor bopomofo?

Of course they could have asked in the Suggestion Box and suggested it as a topic, but I suspect they were not considering that at the time. But it is a question I have been itching to answer anyway, so I'll take this as a bit of serendipity. :-)

The quick answer:

You can't.

The fast paced answer that is a bit less of a sprint:

Although the collation tables for the various CJK languages are based on source data that has the various pronunciations and stroke counts, the source data is gone after the weights are assigned.

The long answer:

I'll tell you a secret -- that original source data is not just gone from the product, it is actually gone from this plane of existence, as far as I have been able to gather. No one seems to know where it is anymore, and the person who did a lot of the work is also gone. Access to the source code would only get you those weights. And the collation APIs work quite well at using that data in every single way except getting the original source back....

Maybe it will show up somewhere and people find stuff all the time. And obviously it can be reconstructed by working from the same original standards -- it is based on a known entity.

There is one exception to this: Korean, where Hanja sorting is based on the Hanguel pronunciation and thus in the case of Korean the pronunication is fairly intrinsic and easy to get to, since the Hanguel is right there. The source is built in and fairly easy to reconstruct using the information I mention in my explanation of why NORM_IGNORENONSPACE makes Korean text sort in apparently random order. In this one case, the data is right there.

The same is not true of (for example) Bopomofo, since the actual Bopomofo script is not sorted in with the Han ideographs; this does not meet user expectations. Kind of a shame in a way, it would be cool to have the Bopomofo interlaced ad one might do in a Bopomofo dictionary or address book. But its not what the users expect, which does really rule our plans in such cases. And it is not true of Pinyin, which is of course Latin letters (if they started sorting in the middle of Han text the natural order of the universe would seem fragmented!). It is kind of true for stroke-based counts since obviously one can just count the strokes and work backwards, but that is hardly a function in the API....

Now I actually have been doing a step better than reconstructing that data -- I have been working to try to expand it, for future versions. And of course to make sure it is kept, since it is the secular equivalent of the 'Collation Holy Scripture'. There are not currently plans to expose the source data directly, though it may be something worthy to think about at some point. I'll consider this request as another data point in that decision, when/if something is decided. :-)

 

This post brought to you by "ㄅ" (U+3105, a.k.a. BOPOMOFO LETTER B)


# J on 30 Jun 2005 1:04 PM:

Hi, maybe I'm missing something obvious as I'm no specialist, but: (assuming the maximum stroke count is 40), prepare 40 strings each of them with a character with N strokes (where N is the string #). Sort the target character with them. Its position should be the stroke count?

The strings can be precomputed, one can do binary search on them, and it would probably be necessary to store both the first and the last character of a certain stroke count to actually know what count the target character belongs to.

I hope this is not obviously dumb, I'm no expert :)

# Michael S. Kaplan on 30 Jun 2005 6:51 PM:

Well, you aren't missing anything. But it won't work, because the existing weights were generated by putting all the characters in the right order and then just giving them sequential numbers. So unless you know where the "break points" are between each stroke count or pronunciation, you cannot decide what bucket things fall into.

Now data like what those "brea kpoints" are might be cool to think about, but definitely it would be a lot of work to do....

go to newer or older post, or back to index or month or day