A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

by Michael S. Kaplan, published on 2007/09/24 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/24/5085893.aspx


Previous posts in this series:

Now that I have been talking about collation in Windows across ten separate blog posts, I thought it might make sense to talk about the characters in Unicode that take up more space in the standard than any others -- ideographs.

Whether you call them Han or Hanja or Kanji, they are all basically Chinese characters used in either Chinese, Korean, or Japanese.

The story for collating these items was not created in a vacuum, but there also were not simple uncomplicated sources that were used in their creation.

The collation story is in fact kind of a messy one, due to many different factors:

But the goal is quite simple:

  1. In the default table, put all of the ideographs after almost everything else in Unicode -- first regular CJK, then Extension A, then Extension B, in code point order for each section.
  2. For each specific East Asian language, put the relevant ideographs in the expected order for the expected sort in question.

There are many different collations across the various locales, and I have talked about various issues in many different posts, from Why is there no pronunciation-based sort for Japanese? to Supporting a pronunciation based sort for East Asian languages... to Is it Macau or is it Macao? to 'Acceptable' Japanese sort order? and more.

The simple fact is that trying order over 70,000 items is going to be complicated, though hopefully as intuitive as it can be....

Now prior to Vista there were several specific problems in the tables:

Though even with addressing all of these problems, there was a problem (in some people's minds) starting with Vista -- an issue I hypothetically discussed in If you add enough characters to a sort, intuitive distinction can suffer and then more directly in On distinctions that are primarily with [and without] difference. That latter post even had a nice high-view narrative of several the various East Asian sorts,

In the next post I'll dig in a bit and provide some examples with different sort keys across different locales....

 

This post brought to you by (U+247e, a.k.a. PARENTHESIZED NUMBER ELEVEN)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

go to newer or older post, or back to index or month or day