If you add enough characters to a sort, intuitive distinction can suffer

by Michael S. Kaplan, published on 2006/11/01 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/11/01/916477.aspx

I have decided that I need to channel the spirit of the father of comedian of Emo Phillips and juxtapose a few concepts that have been going on in this blog for the last few years. :-)

First, near the beginning of the blog in November of 2004, I explained the answer I give to people when They ask me "why is my Korean text in random order?". It boiled down to secondary distinctions being used to tell the difference between two Hanja that both have the same Hangul pronunciation, an issue that actually even increases a bit in Vista when the work to add over 20,000 "K2" Hanja to the Korean sorting tables took place but which has in any case existed for as long as the Hangul sort has existed in Windows.

Then in May of last year when I talked about A few of the gotchas of CompareString, I went through secondary distinctions and the impact of NORM_IGNORENONSPACE a bit more (among other things).

Last December I decided to focus a bit more on the whole issue when I answered in a bit more detail the question What's a secondary distinction?, and really dug into how when it is a real linguistic feature (rather than a case like Korean Hangul/Hanja where it is principally done for the sake of making room in the weight table), it is a feature that a native reader of a language will understand even if they cannot fully describe it, and which a naive user will expect while being even less able to explain it well.

Then yesterday when I took that naive case an combined it with this post, and showed it doing things that are perhaps less intuitive (in the post "àèìòù" < "äëïöü" but "àèìòù " > "äëïöü"), I showed how even what is intuitive can easily become non-intuitive. In case you read that post and want to avoid getting hung up on the space, you can put a period or any symbol there instead; you will get the same results.

In Vista, where over 20,000 Hanja were added to the Korean locale's sort, this large number is dwarfed by the additions of other locales, such as the over 37,000 Han added to the Taiwanese stroke count sort or over 28,000 Han added to the Bopomofo sort. Or the over 41,000 Han added to the PRC's Pinyin-based sort table, or even the almost 50,000 Han added to the PRC stroke count table. And remembering that the primary weights have just a single 16bit WORD per sort element to deal with and that this space is shared with every other script, there simply isn't enough room to fit everything and give them all a primary weight. So, like with Korean, the effort was made to put like ideographic elements together, as follows:

This only scratches the surface of the different ways one might wish to collate CJK ideographs, and in each case the struggle to determine what makes up a primary and what makes up a secondary distinction is not a trivial one. The results will not always be as intuitive as a native speaker/reader of a language might expect, a fact that will drive future versions to work to get even better information about expected cause for distinctions....


This post brought to you by (U+3007, a.k.a. IDEOGRAPHIC NUMBER ZERO)

no comments

referenced by

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/02/15 On distinctions that are primarily with [and without] difference

go to newer or older post, or back to index or month or day