The Letter Police can EAT MY SHORTS!

by Michael S. Kaplan, published on 2009/07/22 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/07/22/9844827.aspx

Regular reader from way back (and perhaps still regular reader!) Jan Kučera asked over in the Suggestion Box:

Very good question, that first one. The second one freaks me out a little though. :-)

This Word-specific extension to Uniscribe's own forays into the same feature in Thai are something I often knock for two specific reasons:

Now I kind of indirectly discussed some of my own problems with this scenario in Why international test is an art (and why there are few fine artists), where I mentioned the challenges in supporting Thai in MSKLC due to the need to be able to display individual letters on keyboard keys that are for most scenarios illegal in isolation (an illegality enforced by Uniscribe).

Personally I am opposed to this feature on any language for all of the reasons given, and am thankful for the times I can turn it off.

(And of course the other option is to put in a SPACE or as NO BREAK SPACE and use it as the "base" character" (this is how I solved the problem for Thai in MSKLC, since there is no way to turn off the feature in Uniscribe like there is for Word).

This is also the workaround for the Visual Studio case or really any attempt at being the LETTER POLICE one may run across....

And if I could plead with anyone wanting to implement either smart or "unsmart" versions of this feature? Please don't! :-)

If you want to have input smarts around a specific language, put it into an input method, not into an application. Three reasons:

1. Development Cost: The complex language rules only have to be implemented once, not by every application. Don't expect application developers to get the rules right, they are complex and need computational linguists to implement and test.

2. User Experience: You get consistent input behaviour across all applications on the system. See point 1.

3. Language Support: You can add another input method to support other languages for a given script relatively easily, but fixing incorrect assumptions about how the script is used in multiple applications can be very difficult and take a long time.

We have a long way to go in terms of input method support and consistent behaviour across applications.

A couple of other examples where Office has tried too hard include:

* Blocking Latin script diacritic combinations it doesn't know about (goodbye Africa!) - finally fixed in Office 2007

* Automatic (and incorrect) transformation of medial sigma in Polytonic Greek texts

Hi Michael!

I'm still here, sure :) thanks for getting into this.

Couple of thoughts:

1. In consumer applications like Word I am willing to accept that some "base character" should be used for users creating instructional/educational materials. But what would you suggest in this case? I am not allowed to enter neither SPACE nor NO BREAK SPACE followed by the combining mark in Word. (Not sure if correctly but I also tried to use ZWJ and it did not helped either.)

(Wouldn't it be more helpful if the input system added such "base" automatically instead of blocking the input completely? ...just asking)

2. Not sure if I got this one - is it possible to turn this POLICE off in Word?

3. Maybe off topic, but I've found that Office applications also automatically transforms two consecutive syllables into consonant+syllable pair (eg. பப into ப்ப) which, though it is usually much more likely intended combination, it makes it really tricky to enter words like கட்டடம், especially into Excel cell... I wonder if there is any list available of these "input enhancements", or if there are based on some international recommendations...

3. Writing a code working with character or string tables for example requires to not use any helper base, just raw combining characters. I don't know how others, but using 'ௌ' instead of '\u0bcc' makes much more readable code, at least for me. Finally we have new text editor in VS 2010 which no longer has these POLICIES. Cool :]

4. Well sure I had no intentions to implement any of these "features". My interest was actually more the other way. So if I got it correctly, I can't turn off this POLICY for any of the available input methods, even as a developer.. right?

Thanks and have a nice weekend!