It does not always pay to be compatible

by Michael S. Kaplan, published on 2005/10/07 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/10/07/478127.aspx


A few months ago, someone emailed me about some trouble they were having with their Arabic keyboard that had created with MSKLC. They were confused about the fact that none of the letters seemedd to ever shape. I asked them to send me the .KLC file so I could take a look. I went ahead and loaded it up in MSKLC on my machine:

It looked okay, although frankly it looked pretty similar to the Arabic 101 keyboard that has been in Windows.

But then I ran a tool that showed the code points instead of the characters:

Yikes! They were using characters in the Arabic Presentation Forms (A and B). Suddenly it became clear.

You see, Arabic is a language that shapes. So let us take U+0628 (ARABIC LETTER BEH). By itself, it looks like this:

ب

But things change when you combine it. Let us say it is at the beginning of the word, followed by U+062a (ARABIC LETTER TEH):

بت

That character on the far right is the BEH -- see how it looks different now?

Ok, let us say that its surrounded by two different letters, say preceeded by U+062e (ARABIC LETTER KHAH) and followed by U+062a (ARABIC LETTER TEH):

خبت

See that  BEH in the middle there?

And now to round things out, let us say that the BEH is at the end, say after U+062e (ARABIC LETTER KHAH):

خب

Well, these four forms are known as the ISOLATED, INITIAL, MEDIAL, and FINAL forms.

Now back in the days before fonts were smart enough to do this sort of shaping, many legacy standards were built by actually encoding very possible form for each letter, thus:

   U+fe8f (ARABIC LETTER BEH ISOLATED FORM)

   U+fe91 (ARABIC LETTER BEH INITIAL FORM)

   U+fe92 (ARABIC LETTER BEH MEDIAL FORM)

   U+fe90 (ARABIC LETTER BEH FINAL FORM)

By combining the correct form with the correct form of KHAH or TEH you can make something look right, sometimes (other times the way they shape will cause these presentation forms to look not quite right). Combining that problem with the need for not quite four times the number of letters, and to train people to type the correct letter depending on where it is in the word, and it is just a nightmare.

If you do not know the language this can be hard to conceptualize. So let us try doing with the Latin script, using cursive writing. It is even more complicated, due to the wide variety of attatchment points.

b

ob

oba

ba

and so on -- to support English alone in such cases would easily require 1000 or more glyphs, and you would have a really hard time writing without constantly picking the right letters.

So, the rule with presentation forms is that they are "pre-shaped" and thus do not need to be shaped again. But, like Latin, the exact attachment points may not always be the same, so it is best to use the real Arabic letter rather than the presentation forms. It will save you from needing to remember every form before you type a letter.... because as the title says, it cannot always pay to be compatible -- especially when one is being compatible with a hard to use legacy standard....

 

This post brought to you by "ب" (U+0628, a.k.a. ARABIC LETTER BEH)


# Jonathan on 9 Oct 2005 4:00 AM:

Actually, a better example is U+062D ARABIC LETTER HAH, whose forms are even more different (or was it MEEM, I forget).

# Michael S. Kaplan on 9 Oct 2005 8:01 AM:

There are several different choices that are possible here, of course....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/09/16 Providing more information is the best way to assure correct information is received

2009/02/04 The road to hell is paved with attempts at being compatible

2008/09/04 Staying away from the compatibility zone is still a good idea

2008/05/04 Who bells the cat when it comes to glyph substitution?

2006/06/01 Presentation forms in Microsoft keyboards?

2006/04/22 Dial 911, code page 864 isn't breathing

2006/02/14 Every character has a story #18: U+06cc and U+064a (ARABIC LETTER FARSI YEH and ARABIC LETTER YEH)

2006/01/14 Getting out of the compatibility zone, redux

2005/12/02 Getting out of dodge (or at least out of the compatibility range!)

go to newer or older post, or back to index or month or day