Behind the return of the Unicode IME

by Michael S. Kaplan, published on 2006/07/22 16:09 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/22/675085.aspx

Nick Lamb is a regular reader who often keeps me on my toes.

In response to my recent post Return of the Unicode IME, Nick commented [line breaks inserted by me]:

Can users expect the final version to go beyond the BMP?

Also it seems short-sighted to skip characters that are unused but might perhaps be allocated later. The principle of forwards compatibility seems to argue in favour of allowing these characters despite the modest increase in file size.

It makes sense not to allow the user to enter U+FFFE since it does not and will never exist, but why not allow e.g. U+0620 which might some day get allocated?

These are some really excellent thoughts, and since they represent decisions for what to do with this sample that I really have to sort out how to proceed, I thought it might make sense to talk about them right here at SIAO. :-)

For the first version, I actually had written a small C# program that walked through the Unicode range and used the previously discussed .NET 2.0 CharUnicodeInfo class to get all of the code points that were neither UnicodeCategory.OtherNotAssigned nor UnicodeCategory.Control, 0x000 to 0xffff.

The issue of supplementary characters was actually the one that delayed the post by over a week, interestingly enough.

The first version of the IME had every single assigned Unicode code point from 0x0000 to 0x10ffff, minus the C0 and C1 controls.

The file was over 7mb in size, which seemed a little bigger than I wanted to start with!

So I started by just trimming out the Plane 15 and Plane 16 Private Usea Area characters, as it did seem like these could be left out for now and I figured they could always be added back later.

I rejected the idea of being "consistent" in my PUA treatment and getting rid of everything in UnicodeCategory.PrivateUse. Because despite the bias I seem to show in posts like The PUA? P.U.! and Keeping out the Undesirables, I am not trying to punish people who are using it....

Anyway, this made a file that was about 3mb in size, which seemed okay (I had alredy decide to provide a zipped version of the file anyway by that point).

But then I started using the file that included these Plane 1, 2, and 14 characters....

What I noticed was how annoying it was to have so much of the 1### an 2### characters not automatically being inserted since there were a ton of potential characters that were asctually 1#### and 2#### characters -- it really made the whole IME unusable for some significant pieces of the Basic Multilingual Plane!

I realized that that the Plane 15 and 16 PUA that I had stripped out before I even tried the IME would have the same type of problem, adding a whole bunch of f#### and 10#### type characters to the mix.

So anyway, for the initial beta, I decided to take them all out. I am thinking about the issue and what to do though -- I am leaning toward putting them all under a special prefix like maybe S so you would type S10001 to get LINEAR B SYLLABLE B038 E inserted.

This is of course not a perfect solution since it means more to type, and an even bigger file.

Another solution that occurred to me was having separate files for the separate planes, but that solution seemed somewhat unsatisfying to me, and the idea of one for the BMP and one for everything else was only slightly better; it still seemed like th wrong way to go.

Anyone have any thoughts here on either solution? And on what might be the best character to use if the prefix idea is used? Or another idea entirely?

The other issues he raises are all kind of related, and they make me want to misquote Ernie Hudson from Congo and say "The Unicode tribe has many levels of 'unassigned'. A code value isn't unassigned until UTC completely stops assigning characters." :-)

What I mean is that as far as I can recall there is no specific Unicode property that distinguishes between e.g. 0x0620 (not yet allocated) and 0xFFFF (perpetually reserved), so a bit more work would have to go in to allow one but not the other (assuming the list is captured somewhere).

And then there are those "kinda" reserved slots that there is no way to fill them with characters in the actual block but maybe when real estate gets tight enough *might* be used for something unrelated... that is not captured anywhere, really. The code point assignments I am referring to here would likely be very controversial in both the UTC and in WG2. It makes me less eager to try and sort through and categorize the UnicodeCategory.OtherNotAssigned characters....

If nothing else I'd hate to end up having battles with people about the issue. :-)

So, thoughts? I am definitely interested in feedback here for the next update of the input method....

This post brought to you by 𐀁 (U+10001, a.k.a. LINEAR B SYLLABLE B038 E)

# Tom Gewecke on 22 Jul 2006 4:57 PM:

How about using the prefix 0 (zero) for the BMP?

# Michael S. Kaplan on 22 Jul 2006 5:06 PM:

How about using the prefix 0 (zero) for the BMP?

That is an interesting possibilty, Tom.

Though I would hate to lose the ability to type four things for the BMP code points (plus if it ever does support plane 15 there are many 0f### chars there -- leading to a whole bunch of confusion for all of the 0fXX chars in the BMP).

# Michael S. Kaplan on 24 Jul 2006 6:03 PM:

Any other thoughts? C'mon pepple, this is *your* beta!

# Tom Gewecke on 25 Jul 2006 9:35 AM:

Is there any way to make the IM require a return or a space or other character to mark the end the number sequence, so nothing would happen until the user typed that?

# Michael S. Kaplan on 25 Jul 2006 9:38 AM:

That is how it was originally working, but I wanted to make it like the original Unicode IME where that was not required.

Which is not to say I am completely wed to that idea -- do you think it would be better to require the commit step to insert the character?

# Tom Gewecke on 25 Jul 2006 10:32 AM:

Yes, I think that a commit step would be fine. In OS X, which has a similar IM, you have to input both surrogates to go beyond the BMP, which is of course terrible.

# John Cowan on 21 Jan 2008 1:52 PM:

The Unicode property that U+FFFF has and U+0620 does not is, of course, Noncharacter_Code_Point in file http://www.unicode.org/Public/UNIDATA/PropList.txt . This list should be stable.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/06/21 Back to Sri Lanka (conceptually)

2008/01/21 Behold the Table Driven Text Service, Part 0 (You have to start somewhere!)

go to newer or older post, or back to index or month or day