Every rule has an exception that proves the rule, even in Unicode!!!

by Michael S. Kaplan, published on 2013/11/18, original URI: http://blogs.msdn.com/b/michkap/archive/2013/11/18/10468856.aspx


Every rule has an exception that proves the rule.

Even in The Unicode Standard!!!

You might be wondering what rule in particular I have in mind.

It is the somewhat famous Unicode cannot re-encode scripts once they are already encoded in the Standard rule.

You know, the rule that is cited every single time a re-encode proposal is sent to the Unicode Technical Committee.

Even this rule has its very own exception.

I can give you that exception in one word.

Korean.

From a completely technical standpoint, Korean has been encoded in Unicode FIVE TIMES THAT STILL EXIST.

How's THAT for an exception? 😏😏;-)

I think it qualifies!

Allow me to briefly go over the five encodings and their consequences for Unicode, Microsoft, Korea, North Korea, and many others.

  1. Halfwidth Jamo (U+FFA0..U+FFDC) -- largely for backwards compatibility with prior standards, now used ornamentally largely, it could theoretically be used to encode all Korean text that exists, in either Modern or Old Hangul (though no one uses it for this);
  2. Compatibility Jamo (U+3131..U+318E) -- primarily used now by Korean Input Method Editors (IMEs) for incomplete Hangul Syllable input;
  3. Modern Hangul Syllables (U+AC00..U+D7A3) -- can be used to represent all Modern Hangul:
  4. Conjoining Jamo (original version) (U+1100..U+11F9) -- the original way to represent ALL Modern and Old Hangul decomposed into individual Jamo;
  5. Conjoining Jamo (accepted version) (U+1100..U+11F9, U+A960..U+A97F, U+D7B0..U+D7FF) -- the currently accepted way to represent ALL Modern and Old Hangul decomposed into individual Jamo.

Now encoding #4 is tied up with Unicode Normalization and the implementation that Microsoft formerly supported in Uniscribe and OpenType (until it was intentionally removed!) and which I think might still be supported in collation and which was never supported in the IME.

And encoding #5 is what is currently supported in all the support Microsoft provides in its latest version and which Unicode Normalization does not support.

Exactly what does it ultimately mean if we are left with this strange state? Is anyone hurt by it?

Is anyone specifically hurt by the missing pieces here?

I refuse to link to earlier blogs that tried to support all of these things without contradictions or problems, since no one wanted to go that way....


comments not archived

go to newer or older post, or back to index or month or day