A&P of Sort Keys, part 12 (aka Han sorts first!)

by Michael S. Kaplan, published on 2007/10/08 10:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/08/5270854.aspx


Previous posts in this series:

Sorry for the small vacation from the series, but being in Cleveland for a week kind of threw off the schedule a bit. Hopefully if you had been following along you are still around. :-)

This post is not going to be about Han Solo and whether he shot first in the Star Wars scene with the bounty hunter Greedo. If you are here for that, see http://www.hanshootsfirst.org/ or the Wikipedia article about the issue.

Or you can see the Recursion shot first! article for am amusing photo of George Lucas wearing a t-shirt that says Han Shot First while talking to Harrison Ford.

This article, unlike those, is (sort of) about Chinese ideographs (which is what I promised in the last article) in part, although with a minor diversion into Korean.

Now if you want to be absolutely technical, Chinese ideographs (or ideograms if you prefer) actually aren't ideographs at all; they are logographs (or logograms if you prefer), since they represent morphemes (words) and not ideas. But that is neither here nor there for the purposes of this article.

And Unicode (which calls them ideographs) is not about linguistics, it is about scripts, anyway. It is how they get all into symbols without feeling like that have to take a shower afterward :-)

In Korean, the Han are called Hanja, and on Windows they tend to sort after their most common Hangul pronunciation, something that has been true for quite a long time. Of course there is an occupational hazard here since the "most common pronunciation" can tend to change over time, but Hanja are not nearly as common in Korean as Hangul (they are even less common in North Korea, a place for which Windows ships with no locale of its own for Wassenaar Arrangement reasons....

Speaking of which, there is not a whole lot Microsoft can do to support the North Korean view on Hangul compared to the South Korean one, as I disussed in Traditional versus mondern sorts, and which would make for a fascinating technical problem if it were not such a stifling political one....

It is also a little known fact that pre-Vista, on Windows the Hanja (along with the Hangul) tends to sort before all of the other scripts (something not true of CJK Extension A or CJK Extension B, but true of most of the rest in the table). I mentioned this years ago in Unlike LCMapString, the sort keys for English characters precede the sort keys for Korean, where I also explained that this order changes does not happen in .NET.

In Vista, the new sort version was an opportunity to fix this inconsistency (which was particularly painful for its effect on the ,.NET Compact Framework, which used the built-in CE tables and thus would have different results than the regular .NET Framework).

It is particularly jarring even in Windows if you aren't expecting it (as seen in the sort key values for these items, mostly from the top of the table:

U+ffa1         HALFWIDTH HANGUL LETTER KIYEOK
U+3131   
    HANGUL LETTER KIYEOK
U+3200   
   PARENTHESIZED HANGUL KIYEOK
U+3260   
    CIRCLED HANGUL KIYEOK
U+ac00   
    HANGUL SYLLABLE KIYEOK A
U+320e   
    PARENTHESIZED HANGUL KIYEOK A
U+326e   
    CIRCLED HANGUL KIYEOK A
U+4f3d   
    CJK UNIFIED IDEOGRAPH
U+4f73   
    CJK UNIFIED IDEOGRAPH
U+0041   
A     LATIN CAPITAL LETTER A

Some of you may notice a U+ffa1 that faces the opposite direction of the other KIYEOK values -- no font seems to want to claim this reversed character, which is very weird. But it is right in Arial Unicode MS so there is some random font GDI is getting a glyph from sometimes that is just wrong....

Anyway, looking at the sort keys, in English vs. in Korean:

en-US  ᄀ     80 02 01 01 06 01 01 00 (or 52 02 01 01 06 01 01 00 on Server 2003)
       ㄱ    80 02 01 01 07 01 01 00 (or 52 02 01 01 07 01 01 00 on Server 2003)
       ㈀    80 02 01 0a 01 01 01 00 (or 52 02 01 0a 01 01 01 00 on Server 2003)
       ㉠    80 02 01 0c 01 01 01 00 (or 52 02 01 0c 01 06 01 01 00 on Server 2003)
       가    80 03 01 01 01 01 00    (or 52 03 01 01 01 01 00 on Server 2003)
       ㈎    80 03 01 0a 01 01 01 00 (or 52 03 01 0a 01 01 01 00 on Server 2003)
       ㉮    80 03 01 0c 01 01 01 00 (or 52 03 01 0c 01 01 01 00 on Server 2003)
       伽    9f 60 01 01 01 01 00 
       佳    9f 96 01 01 01 01 00
       A     0e 02 01 01 12 01 01 00

ko-KR  ᄀ     0e 02 01 01 06 01 01 00
       ㄱ    0e 02 01 01 07 01 01 00
       ㈀    0e 02 01 0a 01 01 01 00
       ㉠    0e 02 01 0c 01 01 01 00
       가    0e 03 01 01 01 01 00
       ㈎    0e 03 01 0a 01 01 01 00
       ㉮    0e 03 01 0c 01 01 01 00
       伽    0e 03 01 41 01 01 01 00
       佳    0e 03 01 43 01 01 01 00
       A     80 02 01 01 12 01 01 00

Notice the changes that literally swap the first letter in the DEFAULT table (A) with the first character in the Korean table in the Korean sort? It is easy if you live in the USA and English is your native language to claim that the script positions are arbitrary and use that as the basis to object to this re-ordering in Korean, but I can understand the original desire to put Korean first in a Korean sort (as someone who happens to be from a place that has always listed first and a person who's last name never shows up first or last in an alphabetized list!).

Notice also how the Hangul do something fairly consistently always, even if the Hanja only follow for Korean.

And finally, did you see the change in Server 2003? Although no substantive comment or bug or spec came with the change, it happened in the same checkin as the one I mentioned in Every character has a story #29: U+1000^H^H^H^H0f40, (TIBETAN or MYANMAR LETTER KA, depending on when you ask), although unlike the parts I discussed in that post, the changes here have very little to do with Unicode conformance; they relate to a desire to not have Hangul to sort in with the same weight range that is shared among all of the East Asian collations (because even though Hangul does not mean much within Chinese or Japanese, it can still be unsettling if you have Hangul within your data and you see it showing up randomly inside your Han or your Kanji.

And I know that the upset is not just my assumption, since after this change was reverted in Vista (we could not afford that much space in the weight table to support such a complete non-scenario), we received several internal bug reports from people who had such data (either accidentally or as a set of test data) who were curious about the strange interspersion of Korean....

BONUS INFO (feel free to file it under "How to be incompatible with everybody!"):

SQL Server has managed to skip all of this shifting around done in Server 2003 and Vista since its tables are based on the Win2000 off-Beta data (as I explained in detail here); on the negative side of that is the fact that SQLCLR results between SQL Server and the Common Language Runtime and completely inconsistent in every possible way (since the managed tables are based on Server 2003 plus losing the "Korean first" code)....

 

This post brought to you by (U+246b, a.k.a. CIRCLED NUMBER TWELVE)


no comments

referenced by

2010/11/09 I [will have] told you so! Well, perhaps too late (all things considered)...

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/02/19 The most important language in the whole wide world is yours, and you hardly even know yours! -- NOT!

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

go to newer or older post, or back to index or month or day