Everyone: Repeat after me... We need both cases of Cherokee *in* the Unicode BMP!

by Michael S. Kaplan, published on 2013/11/05 06:10 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2013/11/05/10463141.aspx

For the most part, I am retired from my former Unicode duties, or at least 75-80% of them.

I mean I still go to the one meeting in Redmond, and (usually) speak at the IUC (when I haven't broken my hip or whatever).

But there are times when I wish I was still more involved.

Like now.

It seems there is a proposal being discussed at the UTC this week to add upper case letters for Cherokee (proposal here) (warning: that link is to a 1mb PDF).

Basically, they want to add the uppercase forms.

But that is a nightmare, from my point of view.

My concern is that the new upper cased form is (in the proposal, at least) in the SMP, rather than the BMP, which is where the current letters are).

This is a mild nuisance for collation, since it will only work for the Cherokee user locale, and it will take an extra half day for the owning team to add po the table.

But it is a much bigger nuisance for me (as owner of the two Cherokee keyboards that we would have to update.

Well, the Cherokee Nation keyboard would be pretty easy, now that I think about it.

On the other hand, the Cherokee Phonetic keyboard layout, already the most complicated keyboard layout we ship, would now be more than six times as complicated, rather than the twice as complicated of all characters in the BMP.

Actually, that's a lie. It is completely impossible if any.of them are in the SMP. The Cherokee Phonetic Keyboard layout would be impossible. Impossible!

I freely admit it is a side effect of the way that even the most complicated dead key keyboard layouts work. The end of a dead key chain must be a single UTF-16 code point...

And given the choice between twice as complicated and impossible, the winner should be obvious.

There really is very little prior usage to go by, unless you count when they disunified Coptic from Greek, or added the third Georgian case; but both of those situations were kept in the BMP, even though the Georgian was put in.a new spot, kind of like they did for Korean now that I think about it.

Perhaps we will have a leg to stand on, and some precedents! 😏😏😏

Sigh. Maybe I need to jump into the conversation here. Or at least get the people who are involved with Unicode on the case.

Think that this blog is a great passive/aggressive way to go? 😏😏😏

Steven R. Loomis on 5 Nov 2013 6:21 AM:

"on the case" …nice…

At least you didn't write it in *drum roll* all caps…

Michael S. Kaplan on 5 Nov 2013 7:14 AM:

Sometimes you get carried by the words...

Anubhav Chattoraj on 5 Nov 2013 9:12 AM:

Microsoft needs to move away from the "one-to-one correspondence between keys and inputtable symbols (except when there are dead keys)" keyboard model. The day Windows treats every keyboard layout as an IME, I will be a happy man.

(On Linux, IBus sure seems to be moving in that direction.)

Michael S. Kaplan on 5 Nov 2013 10:23 AM:

And they did it! ;-)

Piotr Dobrogost on 5 Nov 2013 10:43 AM:

Inadequate MS implementation driving changes in the world wide standard? I hope not.

Charlie on 5 Nov 2013 11:03 AM:

And who did what? :-(

Joshua on 5 Nov 2013 11:55 AM:

Just one more piece of evidence why UTF-16 was a terrible thing to build Windows on. As soon as UCS-2 was found to be not full coverage and wchar_t couldn't be relied on to hold all codes, everything should have been retooled. Yes I know UTF-8 hadn't been invented yet, but UTF-1 had been invented and published.

Michael S. Kaplan on 5 Nov 2013 12:25 PM:

I guess sometimes the tail wags the dog! ;-)

Doug Ewell on 5 Nov 2013 3:30 PM:

"The end of a dead key chain must be a single UTF-16 code point..."

Wait, I thought you just said that was a limitation of MSKLC, not Windows, and that there was hope it would soon be removed from MSKLC.

tex texin on 5 Nov 2013 11:22 PM:

Methinks you should fix the MSKLC to not care about BMP vs. SMP... This is 2013... ;-)

Azarien on 6 Nov 2013 6:25 AM:

I have the impression that Unicode (or is it ISO) puts too much pressure on "prove the existing usage" of proposed characters, forgetting that it's not 1991 anymore, and people often cannot use a character *because* it's not in Unicode.

Anon on 6 Nov 2013 7:32 AM:


Personally, I hate needing to use a non-ASCII character from Unicode because it seems like less than 0.1% of the world can see any given glyph.

As an example, in Chrome, all of the non-alphanumerics on this blog are not visible, regardless of encoding or font selection; In IE, none of the non-alphanumerics are visible unless 'Auto-select' is toggled in the Encoding settings (even then, there's no evidence that the visible characters have any relationship to what was originally entered); on Android, they're all Android-head-emoji.

The corollary to "If it isn't there, people won't use it" is "If no one can see it properly, there's no reason to bother looking for it."

Doug Ewell on 6 Nov 2013 11:12 AM:

@Azarien: There does need to be an "existing use" requirement, so that everyone who's ever invented a notational system, or (like me) an alphabet, doesn't insist that it be formally encoded. The committees have already come a long way in this regard—in 1988, the criterion was publication in contemporary books and newspapers; today we have Phaistos, and new symbols invented to fill gaps in sets.

Proof of existing use doesn't seem to be part of the Cherokee issue anyway. According to the proposal, it wasn't completely understood when the original letters were encoded that uppercase forms even existed.

referenced by

2015/01/26 There's nothing small about Cherokee -- yet....

go to newer or older post, or back to index or month or day