by Michael S. Kaplan, published on 2006/07/22 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/22/674270.aspx
(Apologies to Mel Brooks for the borrowing of the song from History of the World, Pt 1!)
Back when I posted about Traditional versus modern sorts, I mentioned that:
As an aside, one could perhaps argue that the whole LVT -- leading/vowel/trailing -- mechanism used in discussions about Jamo/Hangul collation is an artifact of implementations -- and that the reason that ᄀ (U+1100) and ᆨ (U+11a8) look the same is that they are the same -- note that Choi Sejin's order did not include two separate letters here to handle whether a consonant was leading or trailing?
If you look at the keyboard that is used in the Korean IMEs that ship with Windows/Office (shown here for the base state and the shift state):
it seems pretty clear that I was right. Native speakers of Korean do not distinguish the leading and trailing consonants, so as long as the input method know what it is doing, you just enter your Jamo and it figure out where things go in an intelligent manner.
So let's look at how that works. Taking for example 삫 (U+c0ab, a.k.a. HANGUL SYLLABLE SSANGPIEUP I HIEUTH). Now if we take this precomposed Hangul syllable and convert it to normalization Form D, we get:
삫
or
or
HANGUL CHOSEONG SSANGPIEUP
HANGUL JUNGSEONG I
HANGUL JONGSEONG HIEUH
Now if I want to type this is on the IME, I would type Q:
followed by l (that is a lowercase L):
followed by a g:
So, in a weird way the user is kind of typing in Jamo , although a simplified form of Jamo that does not distinguish ᄒ (U+1112, HANGUL CHOSEONG HIEUH) from ᇂ (U+11c2, HANGUL JONGSEONG HIEUH). It is up to the IME to take what is typed and figure out what is meant.
Of course this begs the question -- why couldn't Unicode have encoded things this way? :-)
You can also look at the characters in progress while you are typing; if you stop after the first keystroke you are given ㅃ (U+3143) and 삐 (U+c090), respectively. Clearly, the IME, while always expecting Jamo from the user, is never outputting combining Jamo....
In any case, and now we get to why I arbitrarily chose this particular Hangul syllable, let's look at the sort key for this code point in XP:
23 11 01 01 01 01 00
Let's look at a particular Old Hangul syllable made up of the following sequences that are legal according to the table from the OpenType site that lists the 121 legal Old Hangul sequences:
1107 1109 1110 116d 1161 1175 11b8 11ba 11ae
If you have a font that will compose the combining Jamo into Old Hangul syllables (which is I admit no mean feat -- the shaping support exists in XP SP2 and Vista but it only works if you have a font and that is anything but easy), it will look something like one of these, depending on the font style:
(The fonts above are using the Gulim, Batang, Dotum, and Gungsuh styles, respectively)
On a unrelated but unfortunate note: although Notepad has no problem properly treating the whole syllable as a single unit for the purposes of cursor navigation and selection, WordPad would only select or move past the syllable when the first Jamo was selected (literally requiring nine clicks of the arrow key to move past the syllable). This is true in Vista as well, not just XP SP2. Hmmm....
Of course if you so not have such an Old Hangul font, it will look more like:
ᄇᄉ툐ᅡᅵᆸᆺᆮ
as I discussed in Theory vs. practice for Korean text collation.
And in case that was not enough of a blocker, I was unable to make any of the IMEs I had available to me (on any version of Windows) type the Old Hangul syllable. Thus, the problem with a "smart" IME is obvious when you want it to do something that it is not smart enough to do. :-)
Now in any case, if you look at the sort key for this syllable, it is
23 11 ff 37 ff 26 ff 58 01 01 01 01 00
Compare that to the one we got earlier:
23 11 01 01 01 01 00
and this Old Hangul syllable will basically sort after this precomposed Hangul syllable.
Now the pieces of the weight that fit after the 0xff sentintels are for the Leading, Vowel, and Trailing pieces of the syllable. For better or worse, this particular syllable (created via the process of choosing random long entries from that OpenType appendix and putting them together Mr. Potato Head style) will sort after U+c0ab.
So, did I have a point here? Well, I guess you could say that Old Hangul appears to be difficult at the moment, and the exact source of the solution seems to be elusive since it involves help from both typographers and creators of IMEs.
There are over 5,000 Level 1 Old Hangul syllables according to recent documents I have seen, and in theory there are many, many more so a generative model seems ideal here. With a smart IME that knows a bit more about how to put the Jamo together (in this case any time one types an L after an L, a V after a V, or a T after a T, the IME should just keep on composing....
This post brought to you by 삫 (U+c0ab, a.k.a. HANGUL SYLLABLE SSANGPIEUP I HIEUTH).
# Dean Harding on 23 Jul 2006 7:28 PM:
# Michael S. Kaplan on 23 Jul 2006 8:26 PM:
# Dean Harding on 23 Jul 2006 9:24 PM:
# Michael S. Kaplan on 23 Jul 2006 10:03 PM:
# Dean Harding on 24 Jul 2006 9:24 AM:
# Michael S. Kaplan on 24 Jul 2006 10:48 AM:
# Dave Smith on 10 Aug 2006 1:44 PM:
referenced by
2010/07/11 Um…I've noticed you around…I find you very attractive…um…would you conjoin with me?
2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD
2008/03/03 On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)
2007/01/01 Report of an IME that splits and separates more Hangul by 9 am than most IMEs do all day