A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

by Michael S. Kaplan, published on 2007/09/24 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/24/5085893.aspx

Previous posts in this series:

Part 0: The empty string sorts the same in every language
Part 1: The law of the letter -- e.g. Latin < Greek < Cyrillic
Part 2: The string that won? Didn't have a mark on him!
Part 3: Should you let a string make it's case? If so, Y?
Part 4: It isn't a race but let's make an EXCEPTION and cross the Finnish line
Part 5: EXPANSIONing your horizons
Part 6: Relax, be calm, and deCOMPRESS if you are feeling out of sorts
Part 7: You're very thin now, but I can still recognize you
Part 8: You can often think of ignoring weights as a form of ignorance
Part 9: Not always transitive, but punctual and punctuating
Part 10: I've kana wanted to start talking about Japanese

Now that I have been talking about collation in Windows across ten separate blog posts, I thought it might make sense to talk about the characters in Unicode that take up more space in the standard than any others -- ideographs.

Whether you call them Han or Hanja or Kanji, they are all basically Chinese characters used in either Chinese, Korean, or Japanese.

The story for collating these items was not created in a vacuum, but there also were not simple uncomplicated sources that were used in their creation.

The collation story is in fact kind of a messy one, due to many different factors:

The tables were mostly not updated for multiple versions of Windows despite the fact that more and more characters were coming into general use;
Most of the characters that were not added to the tables had some weight, just not the one to put them in the correct order;
Some of the characters actually had no weight, with the predictable results thereof;
In the case of pronunciation based sorts, the "most common" pronunciation of some characters actually changed over the course of the last 10+ years.

But the goal is quite simple:

In the default table, put all of the ideographs after almost everything else in Unicode -- first regular CJK, then Extension A, then Extension B, in code point order for each section.
For each specific East Asian language, put the relevant ideographs in the expected order for the expected sort in question.

There are many different collations across the various locales, and I have talked about various issues in many different posts, from Why is there no pronunciation-based sort for Japanese? to Supporting a pronunciation based sort for East Asian languages... to Is it Macau or is it Macao? to 'Acceptable' Japanese sort order? and more.

The simple fact is that trying order over 70,000 items is going to be complicated, though hopefully as intuitive as it can be....

Now prior to Vista there were several specific problems in the tables:

Missing ideographs
Some overlap between the language specific table and the extras meant to be put in the end
A few mistakes
A few changes in official source data (like for most common pronunciation)
Missing support of the expected repertoire in several national standards.

Though even with addressing all of these problems, there was a problem (in some people's minds) starting with Vista -- an issue I hypothetically discussed in If you add enough characters to a sort, intuitive distinction can suffer and then more directly in On distinctions that are primarily with [and without] difference. That latter post even had a nice high-view narrative of several the various East Asian sorts,

In the next post I'll dig in a bit and provide some examples with different sort keys across different locales....

This post brought to you by ⑾ (U+247e, a.k.a. PARENTHESIZED NUMBER ELEVEN)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

go to newer or older post, or back to index or month or day