On distinctions that are primarily with [and without] difference

by Michael S. Kaplan, published on 2007/02/15 06:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/02/15/1667070.aspx

If you read here regularly, you might remember when I posted If you add enough characters to a sort, intuitive distinction can suffer late last year.

A few people have (since that time) reported as a bug the behavior in Vista that I was describing in that post.

Not so much for Korean, which has pretty much always had this behavior (not to metion that pure Hangul usage is much more common than Hanja, and certainly more common than "K2" Hanja!).

And not so much for Japanese (which did not have as many characters to add so the problem was not really so acute).

But Chinese, well that's another story.

To review the change for Chinese:

Pinyin (pronunciation) sort - Each unique Pinyin pronunciation (sound plus tone value) has its own primary weight; the stroke count and order are used as a secondary weight to break ties, covering over 62,000 ideographs;

Stroke count sort for Simplified Chinese - each unique stroke count and order is given a primary weight, with the underlying code point value being used to break ties and provided with secondary weight, for a total of over 70,000 ideographs.

Bopomofo (pronunciation) sort - Each unique Bopomofo pronunciation plus total stroke value is given its own primary weight, with code point order used to provide secondary distinctions and break ties. This led to over 48,000 ideographs being covered.

Stroke count order for Traditional Chinese - Since only 46 unique total stroke counts existed in the data and no information about stroke order was provided, half of the primary weight was based on the stroke count and the other half (plus sometimes the diacritic weight) are used to help break ties. The total number of ideographs covered is over 54,000.

Now for lack of a better term I'll call all four of these orders "the Korean style sort" since this is the way that sorting has worked for Korean in Windows (with the Hangul pronunciation giving the primary weight and other factos affecting the secondary weight).

To give an example of a typical report, say you have the following five filenames (the stroke counts of each character in string given after the string, in green):

杜成義 (7, 6, 13)
杜德偉 (7, 15, 11)
李玟 ( 7, 8)
李翊君 (7, 11, 7)
李聖傑 (7, 13, 12)

Now since in a PRC stroke count sort (LCID 0x00020804) in Vista considers the stroke count and order to be the primary weight and the code point to be the secondary weight the order ends up being the the following:

杜成義 (7, 6, 13)
李玟 ( 7, 8)
李翊君 (7, 11, 7)
李聖傑 (7, 13, 12)
杜德偉 (7, 15, 11)

rather than the XP order, which is:

李玟 (7, 8)
李翊君 (7, 11, 7)
李聖傑 (7, 13, 12)
杜成義 (7, 6, 13)
杜德偉 (7, 15, 11)

Since the Vista order can cause identical letters to not sort together, this can clearly look like a bug. Even though the nature of the sort is reasonable (and no one actually complained about the plan or even about the sorts for several years that it was available in the beta).

Collation has a lot more to do with user's intuitive expectations than any kind of reasonable argument one might make about the logic behind a sort!

So it turns out that there are some folks out there who actually don't like the "Korean style sort" here and actually prefer having every ideograph given a unique primary weight which would let identical ideographs sort together rather than potentially separate when compared against many other ideographs with identical stroke counts.

Trying to come up with a solution leads to some unique challenges given the original goals of the implementation:

Providing linguistically appropriate weights for all of the new ideographs
Giving default weights to all of Unicode 5.0
Making the sort key size as small as feasible

So if now (that we have shipped) is the time for thought experiments to see if the eventual plan in some future version could be to get out of the "Korean style sort" kind of plan, then deciding which one or more of these goals to change or shift becomes an interesting one, since rolling back any one of them can be considered a legitimate regression of functionality that exists.

The first goal is a really hard one to give up given how long people were wanting the weights there, and how glad people were to see them put in. Though perhaps in practice some things could be done (it is unlikely that all 70,000+ ideographs are actually used in China so perhaps some effort to be somewhat bigger than the old table while still smaller than every ideograph would make sense if the right subset could be found).

On the other hand, maybe that second point is the one to rethink, on a per sort basis -- I mean, how crucial is to to give the right weights to (e.g.) Ethiopic when the locale being used is Chinese? Perhaps that is the idea to question here, no matter how long it has been the way things work.

The easiest answer is of course in that third point -- a longer sort key would give all the room needed to make the distinction. It is in fact how support for Extension A and Old Hangul sorting was added in XP. But it feels to me like a cop-out of sorts. It's like as if the answer to every technical problem is just throw more resources at it, and since that could lead to bigger indexes and such which could have a hugely negative impact on people, it could actually be expensive for customers in some big database situations.

Then there are a whole host of other creative ideas like new special flags, creatively combining some of these ideas, and so on.

To be honest none of this will go anywhere without prototyping them all and giving the results to folks so they can see what they would be getting into. Because like it or not there is almost certainly some important East Asian study of the Amharic language whose computer systems will explode due to the oversized sort keys. All of which is a geekily poetic way to point out that you can't swing a cat in these parts without hitting someone's implementations, and that we'll be damned whether we do something or not, in someone's eyes.

And of course all of this is really overthinking the problem at the moment, since perhaps in the end the best choice will be to leave things as they are. It will really be years before we really know if the few reports that have come in actually do represent an issue to be addressed or not. Luckily there is time to deal with this iterative process.... :-)

This post is brought to you by 〇 and ㅇ (U+3007 and U+3147, a.k.a. IDEOGRAPHIC NUMBER ZERO and HANGUL LETTER IEUNG)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

go to newer or older post, or back to index or month or day