We're off on the road to Korea! We certainly do get around...

by Michael S. Kaplan, published on 2006/07/22 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/22/674270.aspx


(Apologies to Mel Brooks for the borrowing of the song from History of the World, Pt 1!)

Back when I posted about Traditional versus modern sorts, I mentioned that:

As an aside, one could perhaps argue that the whole LVT -- leading/vowel/trailing -- mechanism used in discussions about Jamo/Hangul collation is an artifact of implementations -- and that the reason that  (U+1100) and (U+11a8) look the same is that they are the same -- note that Choi Sejin's order did not include two separate letters here to handle whether a consonant was leading or trailing?

If you look at the keyboard that is used in the Korean IMEs that ship with Windows/Office (shown here for the base state and the shift state):

it seems pretty clear that I was right. Native speakers of Korean do not distinguish the leading and trailing consonants, so as long as the input method know what it is doing, you just enter your Jamo and it figure out where things go in an intelligent manner.

So let's look at how that works. Taking for example (U+c0ab, a.k.a. HANGUL SYLLABLE SSANGPIEUP I HIEUTH). Now if we take this precomposed Hangul syllable and convert it to normalization Form D, we get:

삫

or

U+1108 U+1175 U+11c2

or

HANGUL CHOSEONG SSANGPIEUP
HANGUL JUNGSEONG I
HANGUL JONGSEONG HIEUH

Now if I want to type this is on the IME, I would type Q:

 

followed by l (that is a lowercase L):

followed by a g:

So, in a weird way the user is kind of typing in Jamo , although a simplified form of Jamo that does not distinguish (U+1112, HANGUL CHOSEONG HIEUH) from (U+11c2, HANGUL JONGSEONG HIEUH). It is up to the IME to take what is typed and figure out what is meant.

Of course this begs the question -- why couldn't Unicode have encoded things this way? :-)

You can also look at the characters in progress while you are typing; if you stop after the first keystroke you are given ㅃ (U+3143) and 삐 (U+c090), respectively. Clearly, the IME, while always expecting Jamo from the user, is never outputting combining Jamo....

In any case, and now we get to why I arbitrarily chose this particular Hangul syllable, let's look at the sort key for this code point in XP:

23 11 01 01 01 01 00

Let's look at a particular Old Hangul syllable made up of the following sequences that are legal according to the table from the OpenType site that lists the 121 legal Old Hangul sequences:

1107 1109 1110 116d 1161 1175 11b8 11ba 11ae

If you have a font that will compose the combining Jamo into Old Hangul syllables (which is I admit no mean feat -- the shaping support exists in XP SP2 and Vista but it only works if you have a font and that is anything but easy), it will look something like one of these, depending on the font style:

(The fonts above are using the Gulim, Batang, Dotum, and Gungsuh styles, respectively)

On a unrelated but unfortunate note: although Notepad has no problem properly treating the whole syllable as a single unit for the purposes of cursor navigation and selection, WordPad would only select or move past the syllable when the first Jamo was selected (literally requiring nine clicks of the arrow key to move past the syllable). This is true in Vista as well, not just XP SP2. Hmmm....

Of course if you so not have such an Old Hangul font, it will look more like:

ᄇᄉ툐ᅡᅵᆸᆺᆮ

as I discussed in Theory vs. practice for Korean text collation.

And in case that was not enough of a blocker, I was unable to make any of the IMEs I had available to me (on any version of Windows) type the Old Hangul syllable. Thus, the problem with a "smart" IME is obvious when you want it to do something that it is not smart enough to do. :-)

Now in any case, if you look at the sort key for this syllable, it is

23 11 ff 37 ff 26 ff 58 01 01 01 01 00

Compare that to the one we got earlier:

23 11 01 01 01 01 00

and this Old Hangul syllable will basically sort after this precomposed Hangul syllable.

Now the pieces of the weight that fit after the 0xff sentintels are for the Leading, Vowel, and Trailing pieces of the syllable. For better or worse, this particular syllable (created via the process of choosing random long entries from that OpenType appendix and putting them together Mr. Potato Head style) will sort after U+c0ab.

So, did I have a point here? Well, I guess you could say that Old Hangul appears to be difficult at the moment, and the exact source of the solution seems to be elusive since it involves help from both typographers and creators of IMEs.

There are over 5,000 Level 1 Old Hangul syllables according to recent documents I have seen, and in theory there are many, many more so a generative model seems ideal here. With a smart IME that knows a bit more about how to put the Jamo together (in this case any time one types an L after an L, a V after a V, or a T after a T, the IME should just keep on composing....

 

This post brought to you by (U+c0ab, a.k.a. HANGUL SYLLABLE SSANGPIEUP I HIEUTH).


# Dean Harding on 23 Jul 2006 7:28 PM:

My girlfriend is Korean, and she's been teaching me Hangul. It's an awesome language - one of the most logical written-languages you can think of. It's completely phonetic! It took me about 3 days to learn how to read it (though it's taking much longer to learn how to speak it!!)

You're also right that Korean's don't distingush between the leading consonant and the trailing consonants. I didn't even know they had different names.

By the way, where you did you get those pictures of the keyboard? I've been trying to get an on-screen Korean keyboard so I can learn to type it as well but I guess I didn't really look very hard :)

# Michael S. Kaplan on 23 Jul 2006 8:26 PM:

Hi Dean,

I actually cheated -- I used MSKLC but I filled in the jamo myself and then took a screenshot and trimmed it to just the area with the keys.

Or if you meant to ask where I got the contents of the layout itself, I think I technically own the DLLs for the layouts. :-)

# Dean Harding on 23 Jul 2006 9:24 PM:

Oh no, I just meant the picture... thanks anyway, it'll be a great help to me for learning the layout :-)

# Michael S. Kaplan on 23 Jul 2006 10:03 PM:

Oh okay, I probably should have put the punctuation in too, in that case. :-)

# Dean Harding on 24 Jul 2006 9:24 AM:

Thank you muchly!

One other thing I noticed, on XP SP2 at work, the U+1108 U+1175 U+11c2 sequence appears as three separate characters, but in Vista they're composed - has anything changed there? I've got complex scripts turned on in XP SP2, so it should be the same, right? Or maybe there just exists different fonts in Vista...

# Michael S. Kaplan on 24 Jul 2006 10:48 AM:

Hmmm.... some work was done in Vista to do the composing when you have text with the combining Jamo. But it is limited to characters that actually exist in the font unless you find a font that supports Old Hangul composition/shaping....

# Dave Smith on 10 Aug 2006 1:44 PM:

BTW, the "We're off on the road to x! We certainly do get around..." songs were, I believe, originally from the Bing Crosby/Bob Hope "Road to" movies (Road to Morocco, etc.).  Mel Brooks just 'borrowed' it.

Cheers,
Dave

referenced by

2010/07/11 Um…I've noticed you around…I find you very attractive…um…would you conjoin with me?

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/03/03 On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)

2007/01/01 Report of an IME that splits and separates more Hangul by 9 am than most IMEs do all day

go to newer or older post, or back to index or month or day