Short-sighted text processing #1: Uniscribe filters nothing

by Michael S. Kaplan, published on 2010/12/18 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/12/18/10106789.aspx


It was just the other day that regular reader Random832 commented to my I think MaxLength needs protection to assure safer text:

Who decided that cutting off your string and beeping at you was good UI, anyway? Surely it'd be better to just not let the user submit the form until they delete enough stuff to fit [and provide interactive feedback about the limit, a la twitter]

Well, I'm only half-serious - i'm sure this was easier with the limitations programmers had to work with in 1983 - but surely no new programs should be using this "feature".

Now there's a question.

It is in fact an excellent point, though.

I mean, for all the bold talk about one of the possible indicatons for "complex script support" is indeed Filters out illegal character combinations -- and this is called out in Thai, specifically. GoGlobal goes a bit further in its Complex Scripts FAQ:

Why do we need to filter illegal character combinations?
Since Thai syllables consist of a consonant optionally followed by one vowel and/or one tone mark, some character combinations (e.g., two vowel marks in succession) are nonsensical. Thus, one of the tasks of complex script enabling is to filter out or disallow illegal character combinations.

Interestingly, it makes me think of the prototypical example of this behavior:

  1. Open Notepad
  2. Switch to the That Kedmanee keyboard
  3. Type the "J" key, which tries to type U+0e48, aka THAI CHARACTER MAI EK

Every time you hit the key, the computer will beep and insert nothing.

But can I tell you a secret?

You have to promise that you won't tell anyone. It is kind of embarrassing.

Uniscribe isn't doing that.

Seriously.

I can put a bunch of those characters in a row just fine, in text, in an applicaion other than Notepad. And Uniscribe will display them.

I can even paste lines of them into Notepad:

or here:

่่่่่่่่่่่่่่่่่่่่่่่่
่่่่่่่่่่่่่่่่่่่่่่่่
่่่่่่่่่่่่่่่่่่่่่่่่
่่่่่่่่่่่่่่่่่่่่่่่่

and guess what? There's no problem with doing it.

The code that "filters" these characters sits in code called by the EDIT control that checks for two things:

If both are, while you are typing, this code that is not in Uniscribe itself will fail the attempt to insert the text, and it will beep.

Obviously that doesn't work so well for text that is already present (how do you scold someone for illegal text alreay typed?), so in that case Uniscribe will just do as it is told. And it will of course include the 'empty circle" that implies a missing base character.

Now there are several problems inherent in this direction for the text processin engine to go, and I am going to get into that more tomorrow.

But I wanted to start by saying that it is a limited number of dumb controls that screw with the input stream while you are typing that is doing the work here -- Uniscribe filters nothing.

There are some folks who will like upcoming parts to this series, so I hope that (for example) Andrew West and Martin Hosken are around. Because both of them and a few folks like them, are gonna like this one....


Andrew West on 18 Dec 2010 4:48 PM:

I'm always around, and if the upcoming parts are anything like this one then I am sure I will like them.  I never knew about this edit control issue before, and it pained me to test clicking on U+0E48 in BabelMap and hear the edit control beep at me (the edit control is the Achilles heel of BabelMap, and something that I have long wanted to replace, but you can get round it by selecting UCN or NCR mode before clicking on U+0E48, and then reselecting character mode afterwards).  Why would anyone write such evil code?

Michael S. Kaplan on 18 Dec 2010 11:31 PM:

If it's any consolation, this one spot is the only bit of code in the entire Windows code base that uses SCRIPT_PROPERTIES->fRejectInvalid, which mens it is unlikely that anyone else is doing it (its not like Uniscribe gave directions to help others do it anyway!)....

Doug Ewell on 20 Dec 2010 8:00 AM:

> Uniscribe isn't doing that.

Uniscribe is a rendering engine. Why would anyone suspect it is filtering input?

Michael S. Kaplan on 20 Dec 2010 9:13 AM:

Yet filtering illegal character combinations has long been pointed out as one of the central bullet points regarding complex scripts that Uniscribe is designed to deal with....

Doug Ewell on 20 Dec 2010 10:50 AM:

On input?

Michael S. Kaplan on 20 Dec 2010 10:57 AM:

Hey, I'm not defending it in the blog that points out it is not true. But without proof it was listed as one of five points requiring Uniscribe support. The docs are clear on this [falshood]....

Michael S. Kaplan on 20 Dec 2010 11:10 AM:

See for example the About Complex Scripts topic in MSDN -- note that there are now *six* points, including the one I recommended back in 2005, and Unicribe relates to an implementation of all of them....


referenced by

2011/08/15 If you change the behavior of typing sequences you should never type, is it a bug?

2011/04/28 The Sally Kimball Addition To The Dead Keys Conundrum: An Encyclopedia Brown Mystery

2011/01/06 Short-sighted text processing #6: OpenType and Apple and OpenType

2011/01/05 Short-sighted text processing #5: PU[A]! That pad THAI is pretty spicy....

2011/01/04 Short-sighted text processing #4: Squeezing every bit of text you possibly can out of MacOffice 2011

2010/12/30 Short-sighted text processing #3: The Protcols of the EDIT for i18n

2010/12/20 Short-sighted text processing #2: Getting hurt while playing on the bleeding edge

go to newer or older post, or back to index or month or day