by Michael S. Kaplan, published on 2005/09/14 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/14/463569.aspx
Last February and then again the day before yesterday I talked a bit about how Korean was encoded twice in the Unicode standard.
I thought it might be worthwhile to quote an interesting bit that Ken Whistler posted on the Unicode List the other day, when he pointed out that it was actually encoded four times, not two!
He was doing it to explain why such things can make usage more complicated (to some people who were mistakenly thinking that perhaps it set some kind of precedent to re-encode that they would desire to emulate), which in turn was one of the forces that keeps the two forms that our collation does work to support separate. But his text does apply to Unicode in general, and was quite interesting, so I will repost it here:
[Korean] wasn't encoded once in the standard -- it was encoded *FOUR* times.
Doubt me? Examine the standard:
- Encoding #1: U+1100..U+11F9, as combining jamos [these are the Korean Jamo I referred to - MSK]
- Encoding #2: U+AC00..U+D7A3, as preformed syllables [these are the Hangul syllables I referred to - MSK]
- Encoding #3: U+3131..U+318E, as compatibility jamos
- Encoding #4: U+FFA0..U+FFDC, as halfwidth jamos
Representing the *same* Korean text is done distinctly for each of those encodings.
And hey, sorting Encoding #2 is easy, because all the syllables are laid out in the collation order, so binary works just fine. Sound familiar?
But sorting *Korean* in Unicode is a bloody, awful nightmare with edge cases galore, because the encoding is such a mess to begin with. If you are dealing with any data originating from encoding #3 or #4, you have to put in place transducers to convert representation, or get only partially correct results. [the Microsoft implementation chooses the latter of these two choices - MSK] And even for encoding #1 and #2, which are meant to work with each other and which have canonical equivalence relations built in, you *still* have funky edge cases because the combining jamos are more expressive than then preformed syllables (which don't cover ancient Hangul), and you can't depend just on the binary order of the preformed syllables -- which was one of the big reasons for creating them in the first place. [it is entirely possible that these complications are some of the additional reasons that the two are not generally mixed in practice by native speakers who deal with both types of characters - MSK]
Making encodings *more* complex does not make them simpler to process.
I won't make any additional comment beyond the above, I think the issues above (and why it is good that Microsoft carefully avoids most of them anyway!) kind of speaks for itself.... :-)
This post brought to you by "ㄱ" (U+3131, a.k.a. HANGUL LETTER KIYEOK)
referenced by