How bad does it need to be in order to be not good enough, anyway?

by Michael S. Kaplan, published on 2007/11/22 00:01 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2007/11/22/6462768.aspx


It has been at least a good 29 months since I posted Is it Macau or is it Macao?, which (among other things) pointed out how a primarily Traditional Chinese locale basically was using the Simplified Chinese sorting data provided by standardization bodies in China.

And I pointed out how that was not perfect, but people seemed to think it was "good enough" or at least they were not complaining loudly enough to have anyone want to look into the matter....

Since that time I have been asked by several different people if I could quantify the difference.

The engineer in me agrees with the sentiment -- it feels unsatisfying to pick the wrong answer for the mere reason that it will be "as right as it can be without doing actual work" since that feels too much like intentionally making a bad landing.

And even the non-engineer reads about it and the question occurs to them -- how close is the answer?

How bad is it able to be and still fit within that "good enough" label?

So, looking at the total stroke data provided by standards bodies in China for all 70,195 ideographs in Unicode 5.0 and comparing it with the 54,195 ideographs for which stroke count data was provided by Taiwan standards bodies, how different are those 54,194 ideographs?

Well, 9,768 (or 18%) of them have different stroke counts between the two standards.

Here are the summary totals for the amount of difference:

Stroke Count Difference Ideograph Count
1 9,045
2 675
3 44
4 2
5 1
6 1

Scary numbers, huh?

Well, it scared me a bit. And made me really wonder how we decide what looks like it is close enough....

It might be interesting to run the numbers and see how much difference it would make in the actual order of characters if the Macao data was sorted using the Traditional Chinese numbers, perhaps that would be an interesting topic for another day....

For now, here are those numbers from the very bottom of the list, for your enjoyment:

Biggest unmatched stroke counts
Unicode Code point Taiwan stroke count China  stroke count
U+272F0 19 13
U+28F71 24 29
U+27055 19 15
U+25F22 20 16

Makes you wonder how well the data represents the actual counts, let me tell you....

Well, here they are, with the PMingLiU-ExtB provided ideograph on the left and the SimSun-ExtB provided glyph on the right:

And everyone who is curious can now look at these very extreme cases in each direction and decide what they think is the source of the difference -- linguistic preference, orthographic choice, typographic tradition, creeping errors in standards, or whatever.

Anyone in country want to take a crack at this one?

And maybe using the 18% figure people in Maca[o|u] can decide how bad do thing have to be before they decide things are not good enough. :-)

 

This post brought to you by 𧋰, 𨽱, 𧁕, and 𥼢 (U+272f0, U+28f71, U+27055, and U+25f22, four Extension B CJK ideographs)


# Jeroen Ruigrok van der Werven on Thursday, November 22, 2007 3:51 AM:

Since I am geared towards Japanese I would count the strokes as follows:

U+272F0 13 (even if I would count the normal hook -| as 2 I still only reach 16, I wonder how Taiwanese make that 19)

U+28F71 30 (although I'd say 29 for the one on the left, since the top mountain-like kanji has the center stroke pulled through down to the left, whereas the one on the right clearly has it disconnected, the left one's two bottom mountain-like kanji also seem disconnected but may in fact be intended to be drawn through as well, as such the glyph's design is not representing the strokes accurately in my opinion)

U+27055 19

U+25F22 16

I am a bit amazed to be honest...

# Michael S. Kaplan on Thursday, November 22, 2007 4:08 AM:

Yes, it may be slightly unfair to compare the glyphs in the t wo fonts given that the count could very possibly come from a very different reference glyph -- though it seems like the best I could possibly do here under the circumstances....

None of the four characters are in JIS x 213, though it would have been interesting to get their offical counts if they were. :-)

# Andrew West on Thursday, November 22, 2007 6:44 AM:

>> U+272F0 19 13

There is something wrong with this one, as however you count, it can only be 13. Maybe 19 is for a variant form of the character with two insect radicals at the bottom (extra six strokes) instead of one, but that would be a non-unifiable variant.

>> U+28F71 24 29

I make this 26 or 29, depending on whether the middle stroke of the 山-like component continues through or breaks. I can't get 24 however I count.

>> U+27055 19 15

These counts are correct based on different orthographic conventions for the components that may be broken or joined horizontally, such as the grass radical at the top and the two hands at the bottom (PRC orthographic practice is to join the strokes). Probably the majority of off by one differences are due to the different ways of writing the grass radical (joined as 3 strokes, or broken as 4 strokes).

>> U+25F22 20 16

I can get both these counts, and any number of in-between counts depending on how you interpret the middle component as being written, how you write the grass radical and whether there is one vertical stroke skewering the whole character or whether it is broken into several vertical strokes. This is a good example of why ideographic stroke counting is an art not a science.

The problem with designing localised fonts for CJK-B is that ISO/IEC 10646 does not provide a multi-column chart showing the different source glyphs for each character, as it does for CJK, CJK-A and CJK-C, so there is nowhere convenient to look to in order to resolve or explain these differences.

However, there is an IRG document that does provides the source glyphs for all of CJK-B, but it is so huge that it is not available online, and I have not yet seen it ... though I'm hoping to get hold of it soon, and if I do I'll check to see what the PRC and Taiwan source glyphs for these characters look like.

# Michael S. Kaplan on Thursday, November 22, 2007 9:02 AM:

Most of the one off by just a single stroke are much easier to see what is going on -- I checked the data on the ones that seem more questionable (I am never above assuming a bug on my part!) but did not find one.

Though you are right, it is likely that working fromn their source would make things easier.... :-)

# Cheong on Thursday, November 22, 2007 10:45 PM:

I live in Hong Kong, and the stroke count for the 4 characters would be 13, 29, 16 and 18 respectively for me.

And I don't actually use the glyphs to count, because it's generally known that the glyphs displayed is not the same as how we'd actually write the characters. (For example, I think we'd write the middle stroke in the last character into 2 parts, so that middle stroke get a count of 2 instead of 1. On the other hand, the lower horizontal stroke of the third character is displayed as 2 stroke in the glyph, but when we write it it'd become a single horizontal stroke, so the stroke get the count of 1 instead of 2)

# Cheong on Thursday, November 22, 2007 10:56 PM:

When I was in secondary school, we heavily used a tool book named 「同音字彙」.It was primary created for checking words that have the same Cantonese pinyin (i.e. speaks the same), but the stroke count reference in the book is also accurate (In Hong Kong people's definition at least anyway). :)

I have no idea whether the book is still in production, though.

# Bruce Rusk on Friday, November 23, 2007 2:28 PM:

I suspect that some of this discrepancy arises from two causes.

As other comments have pointed out (e.g. Andrew West's), there are systematic differences in the ways strokes are counted because of typographic conventions. For example, the radical cao 艹 is counted as four strokes (two crosses) in Taiwan and three (one horizontal line with two vertical lines crossing it) in the PRC. That difference alone could account for thousands of the one-stroke differences.

Other parts of characters are written differently in the two regions; the element 并 (in U+27055) is written with its lower portion either as 开or in two halves, as in the display above. This means it could count as either 6 or 8 strokes.

The second reason is that the PRC stroke count data may have been produced in an automated may that counted some elements automatically, even if they were written in a way that takes a non-standard number of strokes. Thus in U+28F71, the element on the left, 阜, is written with 8 strokes but is semantically equivalent to 阝, which in most PRC dictionaries is counted as two strokes (though it counts as three in Taiwan). If the PRC data were generated by identifying elements in character and adding them up, and 阜was treated as 阝, this could explain the huge difference on this character.

But U+272F0 is just wrong.

# Andrew West on Friday, November 23, 2007 7:46 PM:

I think Bruce is quite right about U+28F71. If you count the radical 阜 as if it were 阝 (the former is the archaic form of the latter), giving it 3 strokes (as per Taiwan convention), and count the righthandside as 21 strokes (as per Kangxi Dictionary and kRSUnicode) you do indeed get the Taiwan count of 24. I wonder if "24" is not a mistake caused by confusion with its next door neighbour U+28F70 𨽰 which is the same character but with the 阝-form radical, and which should thus have a count of 24.

# Michael S. Kaplan on Friday, November 23, 2007 10:58 PM:

I suppose in response to the U+272F0 comment from Bruce I should express shock and outrage at the notion that any government's provided information could ever be faulty.

I'm kind of tired right now though, maybe another time....

# Michael S. Kaplan on Friday, November 23, 2007 11:01 PM:

Hey Andrew -- of course U+28f70 is not on the Taiwan list. :-(

# Michael S. Kaplan on Saturday, November 24, 2007 12:28 AM:

Additional info -- on the simplified list, the stroke count for U+28f70 is given as 23....

# Andrew West on Sunday, November 25, 2007 10:55 AM:

Simplified stroke count for U+28f70 of 23 sounds right to me, as 阝 is two strokes in the PRC.

The fact that the Taiwan list has U+28f71 but not U+28f70 makes me suspicious, as the former is the unusual, archaic form not given in the Kangxi Dictionary, wheras the latter is found in the Kangxi Dictionary. Makes me think that the character count was intended for U+28f70, but mistakenly given to  U+28f71.

# Chris on Friday, November 30, 2007 10:53 AM:

I just want to drive home the point alluded to by a few of the commenters:  it's wrong to assume that these stroke count values should be the same!  The fact that there are discrepancies should come as no surprise as all, because in Unicode the same "character" can have multiple variant forms.

# Andrew West on Sunday, December 02, 2007 1:40 PM:

Looking at the draft CJK-B mult-column charts (IRG N1381) that I have just got hold of throws some light on why the Taiwan stroke counts for U+28F71 and U+272F0 are wrong.

The Taiwan source glyph for U+28F71 is shown with the ordinary 阝 radical (i.e. like U+28F70) rather than the archaic 阜 form of the radical (see "http://www.babelstone.co.uk/Blog/Images/IRG_N1381_2296.jpg"), which would give it a stroke count of "24" rather than the expected "29".

The Taiwan source glyph for U+272F0 is shown with two 虫 "insect" radicals instead of the expected single radical(see "http://www.babelstone.co.uk/Blog/Images/IRG_N1381_1840.jpg"), which would give it a stroke count of "19" rather than the expected "13".

In both cases the Taiwan source glyph is wrong, and it is these wrong source glyphs that seems to have been used as the source for the Taiwan stroke count data. I discuss the case of U+272F0 in more detail in my latest blog post ("http://babelstone.blogspot.com/2007/12/cjk-b-case-study-1-u272f0.html").

# Michael S. Kaplan on Sunday, December 02, 2007 1:46 PM:

I was going to do a follow-up post tomorrow, but I doubt it would have been better than yours, Andew. :-)

# Andrew West on Sunday, December 02, 2007 5:42 PM:

I'm sorry if I spoiled your follow-up, but I'm sure you have a different, probably more interesting take on the subject than me -- my  post is probably too detailed for anyone but the most dedicated CJK/Unicode geeks. I wasn't going to blog on the subject originally, but there was just too much information to put into the comments to someone else's blog. But in the end I am glad that I did, as my first ever blog post (http://babelstone.blogspot.com/2005/11/tibetan-extensions-1-astrological.html) was in response to one of your posts, and this turns out to have been my 61st post, so after a full cycle of blogging I am back where I started (if you follow what I mean).

# Michael S. Kaplan on Sunday, December 02, 2007 5:51 PM:

No worries at all, I think I will do a follow-up for some additional aspects of what happens next that your post made me think of /wonder about....

But the details were incredible, and make me wonder where to get my own copy of the multicolumn data!

# Andrew West on Tuesday, December 04, 2007 9:48 AM:

And finally, U+25F22 𥼢 has a Taiwan stroke count of "20" because its Taiwan source glyph is actually the same as U+25F52 𥽒 (in the Kangxi Dictionary U+25F52 is under 米 plus 14 = 20 strokes).

# Michael S. Kaplan on Wednesday, December 05, 2007 1:08 PM:

Forensic strokology? Awesome!


referenced by

2008/05/07 Four exceptions to prove the rule

2008/03/17 If we sorted Bopomofo like we do Pinyin, would it still be considered "Traditional" Chinese?

2007/12/03 Every character has a story #31: U+272f0 from CJK Extension B, an ideograph that proves that every rose has its thorn! (aka It wasn't my fault, but [from the Windows standpoint] it was because of me....)

2007/11/24 We didn't split up, because we were in Japan (aka They have an opening that you might fit)

go to newer or older post, or back to index or month or day