A&P of Sort Keys, part 14: The Hangul is really getting OLD

by Michael S. Kaplan, published on 2008/08/21 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/08/21/8883467.aspx

Previous posts in this series before the long unexplained hiatus:

Part 0: The empty string sorts the same in every language
Part 1: The law of the letter -- e.g. Latin < Greek < Cyrillic
Part 2: The string that won? Didn't have a mark on him!
Part 3: Should you let a string make it's case? If so, Y?
Part 4: It isn't a race but let's make an EXCEPTION and cross the Finnish line
Part 5: EXPANSIONing your horizons
Part 6: Relax, be calm, and deCOMPRESS if you are feeling out of sorts
Part 7: You're very thin now, but I can still recognize you
Part 8: You can often think of ignoring weights as a form of ignorance
Part 9: Not always transitive, but punctual and punctuating
Part 10: I've kana wanted to start talking about Japanese
Part 11: It's not like ideographic sorts were developed idiopathically
Part 12: Han sorts first!
Part 13: About the function that is too lazy to get it right every time

I am not going try to explain the long unexplained hiatus the series went through, because that would ruin the whole unexplained nature of it. Odds are that 99.99% of the people reading this don't care, anyway. :-)

So one thing I have skipped thus far is the handling for Old Hangul (even when I was talking about Hangul in prior blogs in the series and before it).

The reason was simple.

Though I'[m not going to tell you that yet, either. I'll do it near the end.

Now first I'll start with an excerpt from The Unicode Standard (3.12: Conjoining Jamo Behavior):

Hangul Syllable Composition

The following algorithm describes how to take a sequence of canonically decomposed characters D and compose Hangul syllables. Hangul composition and decomposition are summarized here, but for a more complete description, implementers must consult Unicode Standard Annex #15, “Unicode Normalization Forms.” Note that, like other non-jamo characters, any combining mark between two conjoining jamos prevents the jamos from composing.

First, define the following constants:
SBase = AC0016
LBase = 110016
VBase = 116116
TBase = 11A716
SCount = 11172
LCount = 19
VCount = 21
TCount = 28
NCount = VCount * TCount

Iterate through the sequence of characters in D, performing steps 2 through 5.

Let i represent the current position in the sequence D. Compute the following indices, which represent the ordinal number (zero-based) for each of the components of a syllable, and the index j, which represents the index of the last character in the syllable.

    LIndex = D[i] - LBase
    VIndex = D[i+1] - VBase
    TIndex = D[i+2] - TBase
    j = i + 2

If either of the first two characters is out of bounds (LIndex < 0 OR LIndex ≥ LCount OR VIndex < 0 OR VIndex ≥ VCount), then increment i, return to step 2, and continue from there.

If the third character is out of bounds (TIndex ≤ 0 or TIndex ≥ TCount), then it is not part of the syllable. Reset the following:

    TIndex = 0
    j = i + 1

Replace the characters D[i] through D[j] by the Hangul syllable S, and set i to be j + 1.

   S = (LIndex * VCount + VIndex) * TCount + TIndex + SBase

Example. The first three characters are

    U+1111 ᄑ hangul choseong phieuph
    U+1171 ᅱ hangul jungseong wi
    U+11B6 ᆶ hangul jongseong rieul-hieuh

Compute the following indices:

    LIndex = 17
    VIndex = 16
    TIndex = 15

Replace the three characters as follows:

    S = [(17 * 21) + 16] * 28 + 15 + SBase
       = D4DB₁₆
       = 퓛

Okay, there you go.

Notice that one could in theory create an Old Hangul syllable and go through the same kind of algorithm, slightly modified.

In fact the Old Hangul state machine on Windows kind of does.

If you take that theoretical character I mentioned in We're off on the road to Korea! We certainly do get around..., that looks like this:

And is made up of the following Jamo:

1107 1109 1110 116d 1161 1175 11b8 11ba 11ae

Okay, great . So what do we do here?

Well, it runs through a state machine that finds as its closest estimation 삫 (U+c0ab, aka HANGUL SYLLABLE SSANGPIEUP I, aka U+1108 U+1175 U+11c2).

Now I happen to think this example kind of points out a flaw in the state machine, but we will run with it. :-)

The machine does two things for each part of the syllable (Lead, Vowel, Trailing):

It figures out the highest index number (as defined in Unicode's 3.12 information of Jamo composition) and stores it;
It stores extra information for each third of the Jamo sequence, for later.

Thus it will have a notion, when it is done, of both the modern Hangul syllable it is closest to, and also the extra information that will later end up in the sort key. This sort key basically becomes:

23 11 ff 37 ff 26 ff 58 01 01 01 01 00

and thus the extra information became the 0x37, the 0x26, and the 0x58 in the end of the weight there.

If you want to see the data that drives this state machine you can find it here. Or here are the small number of relevant entries at the end since that is a big file:

0x1107    5
    0x1107 0x00 0x07 0x00 0x00 0x2c    0x04    ; U+1107
    0x1109 0x01 0x00 0x00 0x00 0x00    0x01    ; U+1107,1109
    0x1110 0x01 0x08 0x14 0x1b 0x37    0x00    ; U+1107,1109,1110
    0x110f 0x01 0x08 0x14 0x1b 0x3a    0x00    ; U+1107,110f
    0x1112 0x01 0x08 0x14 0x1b 0x3d    0x00    ; U+1107,1112

0x116d    4
    0x116d 0x00 0x00 0x0c 0x00 0x24    0x03    ; U+116d
    0x1161 0x01 0x00 0x0c 0x1b 0x25    0x01    ; U+116d,1161
    0x1175 0x01 0x00 0x0c 0x1b 0x26    0x00    ; U+116d,1161,1175
    0x1165 0x01 0x00 0x0c 0x1b 0x29    0x00    ; U+116d,1165

0x11b8    10
    0x11b8 0x00 0x00 0x00 0x11 0x51    0x09    ; U+11b8
    0x11ae 0x01 0x00 0x00 0x11 0x52    0x00    ; U+11b8,11ae
    0x11af 0x01 0x00 0x00 0x00 0x00    0x01    ; U+11b8,11af
    0x11c1 0x01 0x00 0x00 0x11 0x54    0x00    ; U+11b8,11af,11c1
    0x11b7 0x01 0x00 0x00 0x11 0x55    0x00    ; U+11b8,11b7
    0x11b8 0x01 0x00 0x00 0x11 0x56    0x00    ; U+11b8,11b8
    0x11ba 0x01 0x00 0x00 0x00 0x00    0x01    ; U+11b8,11ba
    0x11ae 0x01 0x00 0x00 0x12 0x58    0x00    ; U+11b8,11ba,11ae
    0x11bd 0x01 0x00 0x00 0x12 0x59    0x00    ; U+11b8,11bd
    0x11be 0x01 0x00 0x00 0x12 0x5a    0x00    ; U+11b8,11be

And now we get to a slightly less contrived case, namely the various doubled and tripled conjoining Jamo both in Unicode now and the new ones being added (now in Stage 6 of the approval process) that I discuss in Using a character proposal for a 'repertoire fence' extension. They will be in these three subranges in an upcoming version:

29 in Old Hangul initial consonants (in Hangul Jamo Extended-A block: A960..A97F)
23 in Old Hangul medial vowels (in Hangul Jamo Extended-B block: D7B0..D7FF)
49 in Old Hangul final consonants (also in Hangul Jamo Extended-B block: D7B0..D7FF)

So these ones that were constructed now are meant to exist on their own, and you can even see the leading and trailing consonants in the proposal:

HX124 한글 초성 비읍-시옷-티읕
HANGUL CHOSEONG PIEUP-SIOS-THIEUTH

HX335 한글 종성 비읍-시옷-디귿
HANGUL JONGSEONG PIEUP-SIOS-TIKEUT

And in the not-yet-official data for Unicode as:

A972;HANGUL CHOSEONG PIEUP-SIOS-THIEUTH;Lo;0;L;;;;;N;;;;;
D7E7;HANGUL JONGSEONG PIEUP-SIOS-TIKEUT;Lo;0;L;;;;;N;;;;;

though the vowel (which would be HANGUL JUNGSEONG YO-A-I) you cannot find there, which would suggest that OpenType has at least one Jamo vowel sequence defined for Old Hangul that neither the existing Unicode standard nor any proposal from Korea lists!

Oops?

I wonder whose bug that is?

Anyway, back to the point -- would it be easy to add an entry to the table for the new characters that will be added to Unicode based on the proposal any time, assuming that the Jamo exists. Thus all of these new characters can be kept backwards compatible with the old sequence, though it is likely that the order might not be the same between what the Koran proposal suggested vs. what is there now, which means Microsoft gets to decide what order it wants to be compatible with at whatever point these characters are added....

Either with itself or with whatever order the standard suggests.

I'll leave off with why this one annoyed me so much.

The Old Hangul support I showed a bit in We're off on the road to Korea! We certainly do get around... has existed in a font that was created and made publicly available for download around the time of the Korean version of Office 2000, though the download was taken down at the request of the folks in Korea, who have been unhappy with the model for conjoining Jamo in Unicode that eventually led (over half a decade later) to the proposal I reference in Using a character proposal for a 'repertoire fence' extension.

One of the reasons the proposal was eventually accepted was that there were no existing implementations of the fully conjoining model available. Though in this case the reason that it wasn't available was that th folks doing the proposing made sure it wouldn't be.

Talk about killing your parents and then asking the court for leniency on the grounds that you're an orphan!

I don't look forward to an input method to support the input of these characters, though I will talk more fully about this another time....

This blog brought to you by ⑭ (U+246d, aka CIRCLED NUMBER FOURTEEN)

Robbie on 21 Aug 2008 4:11 AM:

Well, I was wondering why windows desktop search protocol was enforing hangul so this article is interesting.

Guess you have to lower your 99.99% now ;).

Great article by the way.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/04/21 If no one supported the OLD Old proposal, jumping in to support the NEW Old proposal may not make sense…

2008/10/09 Making a point without explaining the whole point of the point? *That* is the point!

go to newer or older post, or back to index or month or day