Every character has a story #31: U+272f0 from CJK Extension B, an ideograph that proves that every rose has its thorn! (aka It wasn't my fault, but [from the Windows standpoint] it was because of me....)

by Michael S. Kaplan, published on 2007/12/03 09:31 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/12/03/6643180.aspx


Yes, the end of the title is an allusion to a late 80s Poison power ballad based on a Bret Michaels love affair that did not work out (due to a philandering lady, in that case).

The other day, I posted How bad does it need to be in order to be not good enough, anyway? and I was focusing on the differences between Traditional and Simplified stroke data that was being used for stroke-based collation, and wondering about the net effect in Macao (a Traditional Chinese-using region for which Microsoft uses the Simplified Chinese collation, as I discussed in Is it Macau or is it Macao?).

But Andrew West was looking at some of these more extreme cases of stroke count differences (as you can tell from the comments), in particular 𧋰 (U+272f0). and managed to prove in his own blog (his post CJK-B Case Study #1 : U+272F0) that every ideograph also has a story!

He was worried that he posted too much detail to be very interesting:

I'm sorry if I spoiled your follow-up, but I'm sure you have a different, probably more interesting take on the subject than me -- my  post is probably too detailed for anyone but the most dedicated CJK/Unicode geeks. I wasn't going to blog on the subject originally, but there was just too much information to put into the comments to someone else's blog.

though speaking personally I disagree. The only thing that would keep me from doing such a post myself here is that I lack the knowledge/wherewithall to do so....

Luckily I can simply link to him, instead! :-)

From Andrew's "case study" post:

I guess that once the Taiwan source glyph is corrected and the Taiwan stroke count data is amended it should be the end of the story, but the one thing that nags at me (as is the case with so many characters which only have a single Taiwan source reference) is what is the ultimate source of this character and which texts is it used in ?

For Microsoft, it raises an interesting problem if/when the reference glyph is fixed....

Okay, let's say they do fix the reference glyph, and subsequently, the stroke count.

What does Microsoft do?

Note that our Traditional Chinese font that includes U+272f0 does not have this problem (we did not pick up the incorrect glyph, possibly the font foundry realizing the same thing Andrew did and not wanting to perpetuate the mistake, but then also not telling us, either -- not to imply that there is or isn't a definite mechanism for such? Or perhaps there is a separate quality issue in the font itself?).

So either way, at this point it is just an anomaly in the sorting table, a known bug with no official communication on the change yet, but we expect at some point there might be such communication.

Since we are litedrally based on a standard in this case, no change could even be considered until it is known through official sources.

In a total stroke based collation such as this one, the difference between 13 and 19 is pretty huge, so one assumes that eventually the change would have to be picked up.

But even a change to one code point could cause index corruption in a database, which means a new major version would be required for the character.

And before rushing in to fix U+272f0 (which as Andrew mentioned it is not clear where it is needed), we have to consider the bigger problems in the other 9,767 differences and with Extension B in general.

How reliable is the ret of the data? And how many additional problems are already fixed in the font that ships even in Vista but are not fixed in the collation tables since those tables are based on a standard working from what amounts to a completely different set of reference glyphs?

I always tended to think of pronunciation-based sorts as being more worrisome technically, since an ideograph can have multiple pronunciations and by putting a stake in the ground for a version and saying that one pronunciation is the most common, we have to allow fro the fact that over time things change, and in the future the most common pronunciation might be different. We had several such changes for Hanja in the Korean collation in Vista, for example.

But now it seems like we have to look at stroke count data with the same careful eye, never knowing when future corrections would come in based on bugs....

Maybe an automated program should be run over all of the characters in PMing-ExtB, counting strokes and comparing against the stroke data in the standard, and then figuring out where other bugs might be.

But I imagine getting resources for such a review would be a challenge, and the notion of assuming the font is always right here is also flawed -- the font and data could both be wrong, after all.

The engineer in me has a hard time dealing with the fact that there are an unknown number of mistakes here some of which could perhaps be ferreted out, and the linguist-wannabe does not feel much better about that (though he is less convinced of the overall usefulness of a total-stroke-based collation and is thus less troubled by anomalies).

Plus, there is nothing to say that there are not also mistakes on the Simplified side too. More worries and more resources needed (and this one troubles that linguist-wannabe a bit more since the stroke count/stroke order based sort has in theory a bit more utility, though the notion that there are millions of people who would know the correct order to draw these ideographs they have never seen from millenia ago is also suspect!).

It is a mess, to be sure. Inevitably I am back to Andrew again, and his intro text from the post:

The CJK Unified Ideographs Extension B [13MB] block that was added to Unicode/10646 in 2001 comprises 42,711 characters, and it is no secret that there are many problems with this huge collection of mostly quite rare characters, including hundreds of cases of unifiable characters that have been erroneously encoded separately and even a handful of completely duplicate characters. There is enough material to keep a dedicated CJK-B blogger busy for years to come, but I certainly don't want to go down that particular path.

I worry more about the ones that cause implementation issues like U+272f0 will, but even so I would be just as worried about having to go down that path as he, perhaps more. Technically I worry more for my sucessors who own the area, though I do feel partially responsible since the errors of the Taiwanese standard based on errors in Unicode/10646 were perpetuated into Windows on my watch.

Should I feel worse that it was literally my request to the subsidiaries to provide the additional data I would need to extend the tables?

(They had requested us to extend them and had been refused for a long time based on technological issues that I figued out workarounds for.)

Well, either way I do feel worse. It wasn't my fault, but from the Windows standpoint it was because of me....

 

This post brought to you by 𧋰 (U+272f0, an Extension B CJK ideograph causing me to lose a bit of sleep!)


# ReallyEvilCanine on 3 Dec 2007 2:08 PM:

I strarted trying to figure out how to get 19 strokes out of a 13-stroke glyph and the first thing I thought was "Someone who doesn't know how to write CJK counted <i>all</i> the lines," but that only got me halfway to the magic number. Fortunately I had a bottle of Caol Ila at home. As I continued to ponder the question (not having Andrew's reference materials) I continued to sate my thirst. It took about four hours and half a bottle but I finally saw 19 strokes. Half an hour later it was up to 22 and, I think, 31 by the time I went to bed.

In all seriousness, how does one argue in favour of Unicode over GB-18030 when this sort of thing (extended CJK) was supposed to have been sorted in 3.0... and in 4.0... and in 5.0...

# Michael S. Kaplan on 3 Dec 2007 3:35 PM:

GB-18030 has the same potential for errors,with these characters, as it is an IRG based bug and both rely on the IRG -- as a standard, it [GB-18030] is a Chinese-specific encoding of Unicode!

# ReallyEvilCanine on 3 Dec 2007 4:56 PM:

/Potential/ for error, yes, but arguing from the Chinese side is easier: "At least it's /our/ language. Who better to encode it properly?" Considering the top-down hierarchy of everything there, changes can be effected remarkably quicker than through the consortium should they so choose. Don't get me wrong -- I've always been a Unicode flag-waver. I find the position of Devil's Advocate suits me well, and these are reasonable concerns.

BTW, could you perhaps see your way to reactivating at least the <i> HTML italics?

# Michael S. Kaplan on 3 Dec 2007 6:25 PM:

Ah, but then they would break their own stated policies on stability. :-)

Which they could I am sure choose to do, but currently they are not taking that track....

This Community Server install does not let me support HTML tags....

# Daniel Cheng on 5 Dec 2007 4:21 AM:

The Taiwan government have a standard fonts available for download at http://www.edu.tw/EDU_WEB/EDU_MGT/MANDR/EDU6300001/bbs/1-4-2/kai.htm?open


referenced by

2008/05/07 Four exceptions to prove the rule

go to newer or older post, or back to index or month or day