If you add enough characters to a sort, intuitive distinction can suffer

by Michael S. Kaplan, published on 2006/11/01 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/11/01/916477.aspx

I have decided that I need to channel the spirit of the father of comedian of Emo Phillips and juxtapose a few concepts that have been going on in this blog for the last few years. :-)

First, near the beginning of the blog in November of 2004, I explained the answer I give to people when They ask me "why is my Korean text in random order?". It boiled down to secondary distinctions being used to tell the difference between two Hanja that both have the same Hangul pronunciation, an issue that actually even increases a bit in Vista when the work to add over 20,000 "K2" Hanja to the Korean sorting tables took place but which has in any case existed for as long as the Hangul sort has existed in Windows.

Then in May of last year when I talked about A few of the gotchas of CompareString, I went through secondary distinctions and the impact of NORM_IGNORENONSPACE a bit more (among other things).

Last December I decided to focus a bit more on the whole issue when I answered in a bit more detail the question What's a secondary distinction?, and really dug into how when it is a real linguistic feature (rather than a case like Korean Hangul/Hanja where it is principally done for the sake of making room in the weight table), it is a feature that a native reader of a language will understand even if they cannot fully describe it, and which a naive user will expect while being even less able to explain it well.

Then yesterday when I took that naive case an combined it with this post, and showed it doing things that are perhaps less intuitive (in the post "àèìòù" < "äëïöü" but "àèìòù " > "äëïöü"), I showed how even what is intuitive can easily become non-intuitive. In case you read that post and want to avoid getting hung up on the space, you can put a period or any symbol there instead; you will get the same results.

In Vista, where over 20,000 Hanja were added to the Korean locale's sort, this large number is dwarfed by the additions of other locales, such as the over 37,000 Han added to the Taiwanese stroke count sort or over 28,000 Han added to the Bopomofo sort. Or the over 41,000 Han added to the PRC's Pinyin-based sort table, or even the almost 50,000 Han added to the PRC stroke count table. And remembering that the primary weights have just a single 16bit WORD per sort element to deal with and that this space is shared with every other script, there simply isn't enough room to fit everything and give them all a primary weight. So, like with Korean, the effort was made to put like ideographic elements together, as follows:

Korean Hangul sort-- Each unique Hangul syllable has a primary weight; all Hanja that share that Hangul syllable's pronunciation are given a unique secondary weight under the that Hangul syllable, covering nearly 40,000 characters when you look at all of the Hangul plus all of the Hanja.
Pinyin (pronunciation) sort for China - Each unique Pinyin pronunciation (sound plus tone value) has its own primary weight; the stroke count and order are used as a secondary weight to break ties, covering over 62,000 ideographs when all is said and done.
Stroke count sort for China - each unique stroke count and order (I'll talk more about these another day) is given a primary weight, with the underlying code point value being used to break ties and provided with secondary weight, for a total of over 70,000 ideographs.
Bopomofo sort for Taiwan - Each unique Bopomofo pronunciation plus total stroke value is given its own primary weight, with code point order used to provide secondary distinctions and break ties (Bopomofo pronunciation alone could not be used, as this led to only 1,370 unique pronunciations, some sharing over 500 ideographs!). This led to over 48,000 ideographs being covered.
Stroke count order for Taiwan - Since only 46 unique total stroke counts existed in the data and no information about stroke order was provided, half of the primary weight was based on the stroke count and the other half (plus sometimes the diacritic weight) are used to help break ties. The total number of ideographs covered is over 54,000. Of the five listed here this is the least satisfying in terms of data provided to help make useful primary and secondary distinctions, and some thought is being made for the future in how to make those distinction more meaningful since there are times that the order will be less intuitive. For most uses, however, the sort will still be of use, especially if the NORM_IGNORENONSPACE flag is not used.

This only scratches the surface of the different ways one might wish to collate CJK ideographs, and in each case the struggle to determine what makes up a primary and what makes up a secondary distinction is not a trivial one. The results will not always be as intuitive as a native speaker/reader of a language might expect, a fact that will drive future versions to work to get even better information about expected cause for distinctions....

This post brought to you by 〇 (U+3007, a.k.a. IDEOGRAPHIC NUMBER ZERO)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/02/15 On distinctions that are primarily with [and without] difference

go to newer or older post, or back to index or month or day