by Michael S. Kaplan, published on 2005/06/30 10:50 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/06/30/434223.aspx
Recently someone asked a question via the Contact link:
Hi, I was wondering if you know of a way to get the stroke count of a chinese character? This function seem to be in Word (chinese index), and also in .NET (string sort according to stroke and bopomofo)
But I can't seem to find any API functions exposed to retrieve the stroke count nor bopomofo?
Of course they could have asked in the Suggestion Box and suggested it as a topic, but I suspect they were not considering that at the time. But it is a question I have been itching to answer anyway, so I'll take this as a bit of serendipity. :-)
The quick answer:
You can't.
The fast paced answer that is a bit less of a sprint:
Although the collation tables for the various CJK languages are based on source data that has the various pronunciations and stroke counts, the source data is gone after the weights are assigned.
The long answer:
I'll tell you a secret -- that original source data is not just gone from the product, it is actually gone from this plane of existence, as far as I have been able to gather. No one seems to know where it is anymore, and the person who did a lot of the work is also gone. Access to the source code would only get you those weights. And the collation APIs work quite well at using that data in every single way except getting the original source back....
Maybe it will show up somewhere and people find stuff all the time. And obviously it can be reconstructed by working from the same original standards -- it is based on a known entity.
There is one exception to this: Korean, where Hanja sorting is based on the Hanguel pronunciation and thus in the case of Korean the pronunication is fairly intrinsic and easy to get to, since the Hanguel is right there. The source is built in and fairly easy to reconstruct using the information I mention in my explanation of why NORM_IGNORENONSPACE makes Korean text sort in apparently random order. In this one case, the data is right there.
The same is not true of (for example) Bopomofo, since the actual Bopomofo script is not sorted in with the Han ideographs; this does not meet user expectations. Kind of a shame in a way, it would be cool to have the Bopomofo interlaced ad one might do in a Bopomofo dictionary or address book. But its not what the users expect, which does really rule our plans in such cases. And it is not true of Pinyin, which is of course Latin letters (if they started sorting in the middle of Han text the natural order of the universe would seem fragmented!). It is kind of true for stroke-based counts since obviously one can just count the strokes and work backwards, but that is hardly a function in the API....
Now I actually have been doing a step better than reconstructing that data -- I have been working to try to expand it, for future versions. And of course to make sure it is kept, since it is the secular equivalent of the 'Collation Holy Scripture'. There are not currently plans to expose the source data directly, though it may be something worthy to think about at some point. I'll consider this request as another data point in that decision, when/if something is decided. :-)
This post brought to you by "ㄅ" (U+3105, a.k.a. BOPOMOFO LETTER B)
# J on 30 Jun 2005 1:04 PM:
# Michael S. Kaplan on 30 Jun 2005 6:51 PM: