by Michael S. Kaplan, published on 2006/03/12 13:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/12/549951.aspx
Although the model for collation would be simpler if it never changed, the fact is that changes do happen, so it is important to capture that change.
I won't talk about the Spanish case today, though that one is interesting for other reasons -- stay tuned for a future blog post. :-)
But there are several other interesting ones to ponder....
Georgian is a good example -- there are four letters that do not appear in modern use but can appear in older documents. So the 'modern' sort puts these four characters at the end of the alphabet rather than interspersing them in the traditional order.
Those four characters are:
U+10f1 ჱ GEORGIAN LETTER HE
U+10f2 ჲ GEORGIAN LETTER HIE
U+10f3 ჳ GEORGIAN LETTER WE
U+10f4 ჴ GEORGIAN LETTER HAR
These are of course the modern Mkhedruli Georgian characters; in theory you would also want to handle the Khutsuri and Nushkuri in a similar way (all three scripts discussed here).
Although in practice a modern sort's handling of archaric characters inside of script subranges used only in archaic contexts is, to say the least, questionable. In my opinion, at least. :-)
These two sorts are supported by Windows -- 0x0437 for the traditional sort and 0x10437 for the modern one.
Now if you look at Korean Jamo you have a dfferent situation.
The original ordering was first described by Choi Sejin (Wikipedia link) in the year 1527; it goes something like this:
ㄱ ㄴ ㄷ ㄹ ㅁ ㅂ ㅅ ㅈ ㅊ ㅋ ㅌ ㅍ ㅎ ㅏ ㅑ ㅓ ㅕ ㅗ ㅛ ㅜ ㅠ ㅡ ㅣ
Now this ordering was created before several other innovations such as the "double consonants" were added to the language, which creates a real question for where the new Jamo should be added to 'alphabetical order.'
Is it most faithful to Choi Sejin's classical ordering (which is highly respected) to put the new Jamo at the end, or to intersperse them in the appropriate places next to the Jamo that they relate to?
An interesting question, one with many linguistic, philosophical, and historical issues tied up with it.
As an aside, one could perhaps argue that the whole LVT -- leading/vowel/trailing -- mechanism used in discussions about Jamo/Hangul collation is an artifact of implementations -- and that the reason that ᄀ (U+1100) and ᆨ (U+11a8) look the same is that they are the same -- note that Choi Sejin's order did not include two separate letters here to handle whether a consonant was leading or trailing?
Ok, back to that other interesting question. :-)
In South Korea, the decision was made to do the interspersing, an argument which one could argue has a more linguistic basis (on the other hand one could make the same argument for the phonemic decision in Lithuanian!). In North Korea, on the other hand, the decision was made to put most of the new Jamo at the end.
Which of course means that this not only involves linguistic, philosophical, and historical issues, but add to that political issues, as well....
Now since the 11,172 modern Hangul Syllables are actually built from these Jamo, this "small" question would have a marked impact on the sorting of Hangul. Not being a native speaker/writer/reader of Korean I cannot say for sure, but I do wonder how easy it is to work with one order if one learned with the other....
In Windows, only the option that intersperses the Jamo is supported. At the present time there are also too many political issues tied up in the question to allow any other option to be chosen.
Though I admit to that curiousity about how recognizable the other ordering would be in practice to a child, or to an adult, in South Korea. Would it be as jarring as the Lithuanian collation ("Y sorts just after I" rather than after X) would be to a native English speaker, only more so since it affects such a greater number of characters?
This post brought to you by "ᆨ" (U+11a8, a.k.a. HANGUL JONGSEONG KIYEOK)
(as distinguisghed from "ᄀ" U+1100 a.k.a. HANGUL CHOSEONG KIYEOK, of course!)
# Michael S. Kaplan on 12 Mar 2006 1:20 PM:
pm on 7 Oct 2008 3:51 AM:
How about Spanish? I don't speak Spanish. All I want to find out is which sort (traditional or modern) is the most common.
Michael S. Kaplan on 7 Oct 2008 9:10 AM:
Hey pm -- did you look at the fourth comment down the page? It links to a blog that talks about Traditional vs. Modern Spanish....
referenced by
2010/04/20 You can't get this particular bit of proverbial toothpaste back into the tube
2008/09/19 Sorting the DPRK all Out
2008/08/28 Collation backstory?
2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)
2007/02/20 Ssang Your Life (or alternately: I'd Like To Teach the World To Ssang)
2007/01/01 Report of an IME that splits and separates more Hangul by 9 am than most IMEs do all day
2006/10/12 It's LIFO (last-in, first-out) in Hebrew
2006/09/23 The modern solution to the problem of Traditional Spanish in Vista
2006/08/21 Decimal vs. hexadecimal LCIDs, backcompat, and being weird
2006/07/22 We're off on the road to Korea! We certainly do get around...