by Michael S. Kaplan, published on 2008/08/21 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/08/21/8883467.aspx
Previous posts in this series before the long unexplained hiatus:
I am not going try to explain the long unexplained hiatus the series went through, because that would ruin the whole unexplained nature of it. Odds are that 99.99% of the people reading this don't care, anyway. :-)
So one thing I have skipped thus far is the handling for Old Hangul (even when I was talking about Hangul in prior blogs in the series and before it).
The reason was simple.
Though I'[m not going to tell you that yet, either. I'll do it near the end.
Now first I'll start with an excerpt from The Unicode Standard (3.12: Conjoining Jamo Behavior):
Hangul Syllable Composition
The following algorithm describes how to take a sequence of canonically decomposed characters D and compose Hangul syllables. Hangul composition and decomposition are summarized here, but for a more complete description, implementers must consult Unicode Standard Annex #15, “Unicode Normalization Forms.” Note that, like other non-jamo characters, any combining mark between two conjoining jamos prevents the jamos from composing.
First, define the following constants:
SBase = AC0016
LBase = 110016
VBase = 116116
TBase = 11A716
SCount = 11172
LCount = 19
VCount = 21
TCount = 28
NCount = VCount * TCount
- Iterate through the sequence of characters in D, performing steps 2 through 5.
- Let i represent the current position in the sequence D. Compute the following indices, which represent the ordinal number (zero-based) for each of the components of a syllable, and the index j, which represents the index of the last character in the syllable.
LIndex = D[i] - LBase
VIndex = D[i+1] - VBase
TIndex = D[i+2] - TBase
j = i + 2
- If either of the first two characters is out of bounds (LIndex < 0 OR LIndex ≥ LCount OR VIndex < 0 OR VIndex ≥ VCount), then increment i, return to step 2, and continue from there.
- If the third character is out of bounds (TIndex ≤ 0 or TIndex ≥ TCount), then it is not part of the syllable. Reset the following:
TIndex = 0
j = i + 1
- Replace the characters D[i] through D[j] by the Hangul syllable S, and set i to be j + 1.
S = (LIndex * VCount + VIndex) * TCount + TIndex + SBase
Example. The first three characters are
U+1111 ᄑ hangul choseong phieuph
U+1171 ᅱ hangul jungseong wi
U+11B6 ᆶ hangul jongseong rieul-hieuh
Compute the following indices:
LIndex = 17
VIndex = 16
TIndex = 15
Replace the three characters as follows:
S = [(17 * 21) + 16] * 28 + 15 + SBase
Okay, there you go.
Notice that one could in theory create an Old Hangul syllable and go through the same kind of algorithm, slightly modified.
In fact the Old Hangul state machine on Windows kind of does.
If you take that theoretical character I mentioned in We're off on the road to Korea! We certainly do get around..., that looks like this:
And is made up of the following Jamo:
1107 1109 1110 116d 1161 1175 11b8 11ba 11ae
Okay, great . So what do we do here?
Well, it runs through a state machine that finds as its closest estimation 삫 (U+c0ab, aka HANGUL SYLLABLE SSANGPIEUP I, aka U+1108 U+1175 U+11c2).
Now I happen to think this example kind of points out a flaw in the state machine, but we will run with it. :-)
The machine does two things for each part of the syllable (Lead, Vowel, Trailing):
Thus it will have a notion, when it is done, of both the modern Hangul syllable it is closest to, and also the extra information that will later end up in the sort key. This sort key basically becomes:
23 11 ff 37 ff 26 ff 58 01 01 01 01 00
and thus the extra information became the 0x37, the 0x26, and the 0x58 in the end of the weight there.
If you want to see the data that drives this state machine you can find it here. Or here are the small number of relevant entries at the end since that is a big file:
0x1107 0x00 0x07 0x00 0x00 0x2c 0x04 ; U+1107
0x1109 0x01 0x00 0x00 0x00 0x00 0x01 ; U+1107,1109
0x1110 0x01 0x08 0x14 0x1b 0x37 0x00 ; U+1107,1109,1110
0x110f 0x01 0x08 0x14 0x1b 0x3a 0x00 ; U+1107,110f
0x1112 0x01 0x08 0x14 0x1b 0x3d 0x00 ; U+1107,1112
0x116d 0x00 0x00 0x0c 0x00 0x24 0x03 ; U+116d
0x1161 0x01 0x00 0x0c 0x1b 0x25 0x01 ; U+116d,1161
0x1175 0x01 0x00 0x0c 0x1b 0x26 0x00 ; U+116d,1161,1175
0x1165 0x01 0x00 0x0c 0x1b 0x29 0x00 ; U+116d,1165
0x11b8 0x00 0x00 0x00 0x11 0x51 0x09 ; U+11b8
0x11ae 0x01 0x00 0x00 0x11 0x52 0x00 ; U+11b8,11ae
0x11af 0x01 0x00 0x00 0x00 0x00 0x01 ; U+11b8,11af
0x11c1 0x01 0x00 0x00 0x11 0x54 0x00 ; U+11b8,11af,11c1
0x11b7 0x01 0x00 0x00 0x11 0x55 0x00 ; U+11b8,11b7
0x11b8 0x01 0x00 0x00 0x11 0x56 0x00 ; U+11b8,11b8
0x11ba 0x01 0x00 0x00 0x00 0x00 0x01 ; U+11b8,11ba
0x11ae 0x01 0x00 0x00 0x12 0x58 0x00 ; U+11b8,11ba,11ae
0x11bd 0x01 0x00 0x00 0x12 0x59 0x00 ; U+11b8,11bd
0x11be 0x01 0x00 0x00 0x12 0x5a 0x00 ; U+11b8,11be
And now we get to a slightly less contrived case, namely the various doubled and tripled conjoining Jamo both in Unicode now and the new ones being added (now in Stage 6 of the approval process) that I discuss in Using a character proposal for a 'repertoire fence' extension. They will be in these three subranges in an upcoming version:
So these ones that were constructed now are meant to exist on their own, and you can even see the leading and trailing consonants in the proposal:
HX124 한글 초성 비읍-시옷-티읕
HANGUL CHOSEONG PIEUP-SIOS-THIEUTH
HX335 한글 종성 비읍-시옷-디귿
HANGUL JONGSEONG PIEUP-SIOS-TIKEUT
And in the not-yet-official data for Unicode as:
A972;HANGUL CHOSEONG PIEUP-SIOS-THIEUTH;Lo;0;L;;;;;N;;;;;
D7E7;HANGUL JONGSEONG PIEUP-SIOS-TIKEUT;Lo;0;L;;;;;N;;;;;
though the vowel (which would be HANGUL JUNGSEONG YO-A-I) you cannot find there, which would suggest that OpenType has at least one Jamo vowel sequence defined for Old Hangul that neither the existing Unicode standard nor any proposal from Korea lists!
I wonder whose bug that is?
Anyway, back to the point -- would it be easy to add an entry to the table for the new characters that will be added to Unicode based on the proposal any time, assuming that the Jamo exists. Thus all of these new characters can be kept backwards compatible with the old sequence, though it is likely that the order might not be the same between what the Koran proposal suggested vs. what is there now, which means Microsoft gets to decide what order it wants to be compatible with at whatever point these characters are added....
Either with itself or with whatever order the standard suggests.
I'll leave off with why this one annoyed me so much.
The Old Hangul support I showed a bit in We're off on the road to Korea! We certainly do get around... has existed in a font that was created and made publicly available for download around the time of the Korean version of Office 2000, though the download was taken down at the request of the folks in Korea, who have been unhappy with the model for conjoining Jamo in Unicode that eventually led (over half a decade later) to the proposal I reference in Using a character proposal for a 'repertoire fence' extension.
One of the reasons the proposal was eventually accepted was that there were no existing implementations of the fully conjoining model available. Though in this case the reason that it wasn't available was that th folks doing the proposing made sure it wouldn't be.
Talk about killing your parents and then asking the court for leniency on the grounds that you're an orphan!
I don't look forward to an input method to support the input of these characters, though I will talk more fully about this another time....
This blog brought to you by ⑭ (U+246d, aka CIRCLED NUMBER FOURTEEN)
Robbie on 21 Aug 2008 4:11 AM:
Well, I was wondering why windows desktop search protocol was enforing hangul so this article is interesting.
Guess you have to lower your 99.99% now ;).
Great article by the way.
go to newer or older post, or back to index or month or day