Keeping it simple, with complex scripts

by Michael S. Kaplan, published on 2005/01/17 03:16 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/17/354279.aspx

Let's face it, sometimes a language has complex typographical issues in it. If it were all easy to do then typography would not be such a complex and lucrative field that requires both technical understanding of language and artistry which I do not myself have. But every time I learn more about the issues that folks in MS Typography (a part of the GIFT team I am on!) deal with, the more respect I have for them and the folks who support the rendering of these fonts.

There is a help topic on MSDN that captures the essence of the complexity entitled About Complex Scripts, which starts off with a very succinct summary:

A complex script has at least one of the following attributes:

Allows bidirectional rendering.
Has contextual shaping.
Has combining characters.
Has specialized word-breaking and justification rules.
Filters out illegal character combinations.

Now the above is great if you are familiar with a language that uses the Hebrew, Thai, or Arabic script, but is not so useful if you only know about a language like English; it therefore goes on and gives other examples, some of which a wider range of people can identify with:

Bidirectional rendering refers to the script's ability to handle text that reads both left-to-right and right-to-left. For example, in the bidirectional rendering of Arabic, the default reading direction for text is right-to-left, but for some numbers, it is left-to-right. Processing a complex script must account for the difference between the logical (keystroke) order and the visual order of the glyphs. In addition, processing must properly deal with caret movement and hit testing. The mapping between screen position and a character index for, say, selection of text or caret display requires knowledge of the layout algorithms.

Contextual shaping occurs when a script's characters change shape depending on the characters that surround them. This occurs in English cursive writing when a lowercase "l" changes shape depending on the character that precedes it such as an "a" (connects low to the "l") or an "o" (connects high). Arabic is a script that exhibits contextual shaping.

Combining characters or ligatures are characters that join into one character when placed together. One example is the "ae" combination in English; it is sometimes represented by a single character. Arabic is a script that has many combining characters.

Specialized word break and justification refers to scripts that have complex rules for dividing words between lines or justifying text on a line. Thai is such a script.

Filtering out illegal character combinations occurs when a language does not allow certain character combinations. Thai is such a script.

I really like this text (and have for as long as I have known about it), especially the part about contextual shaping. I have personally seen it make a difference in understanding to people who were having trouble grokking the nature of complex scripts and who really did not have the time to learn a language like Thai or Arabic.

Those same people who would actually rail on about how the Arabic is "ridiculous" and "unnecessarily intricate" for having a letter change form depending on its position in a word stopped thinking so when they realized that they had been doing the same sort of thing in cursive writing of their own language since they were children.

People even start to get impressed about cursive fonts like Script MT Bold¹ when they realize how much work it is to deal with all of those contextual differences. And they understand why the legacy characters in Arabic Presentation Forms-A and Arabic Presentation Forms-B that encode up to the four different Arabic forms may not be the best way to represent Arabic text, especially when they imagine an analagous compatibility Latin block with multiple forms of many ordinary letters in the English alphabet.

And suddenly people kind of like that there are controls (like RichEdit) and libraries like (Uniscribe and GDI+) and operating systems (like Windows) that deal with most of these issues automatically. There are some basic conventions upon which RichEdit, Uniscribe, and GDI+ basically agree:

When moving through text with an arrow key, move through one text element at a time, since the user often thinks of them as a single character.
When selecting text, select entire text elements rather than pieces of them.
When forced to break lines, try to break at word boundaries; if that is not possible then at least try to break at text element boundaries so that the integrity of what the user thinks of as a character is not destroyed.
When hitting the DELETE button to delete the in front of the cursor, delete the entire text element.
When hitting the BACKSPACE button to delete behind the cursor, usually delete just the code point since the user may have typed it in that way and may be unhappy in the case of typos to lose multiple code points (though more sophisticated processors like RichEdit will properly delete surrogate pairs in their entirety since they were almost certainly not typed separately).

Obviously when applications are consistent that helps the user to understand behavior on the computer, and when the operating system provides a library like GDI+ or Uniscribe it helps developers to do all of this without having to program all of the individual behaviors in their own code. So everybody wins!

Well, there are the occasional hiccups, like the one I described a few days ago in 'CharNext(ch) != ch+1, a lot of the time', behavior which affected many browsers including IE, FireFox, and Opera (heads up to the FireFox people, they would have gotten me to abandon IE on the spot if I knew they were so in tune with international issues that this worked well without someone like me even prompting -- I get downright ornery about poor/inconsistent international support! But alas, you missed your chance...).

The other place I tend to notice problems is in applications like DVArchive, which I use to communicate with a small bank of ReplayTV units (don't know what they are? think Tivo). It is a formerly open source project (now closed source) which reportely runs on any platform that runs Java 1.4.2. Though here is of course a problem where cross-platform projects (whether open source or closed) run into trouble -- they tend to avoid platform-specific features and obviously the complexity of libraries like Uniscribe almost certainly cause them to be platform-specific. I even miss simple keyboard shortcuts that work in every other application but fail in this one. :-(

But that last paragraph is fairly offtopic and is a complex issue of another type, so I will move back to the relatively simpler area of complex scripts, now. I'll give more on my thoughts about cross platform applications another time. :-)

There are people who object to the term complex scripts. The reasons vary, but I will mention two categories of those reasons, briefly.

I have known people who are in East Asia who deal with ideographic languages who see it as decreasing emphasis on their languages, which is really not the case since the issues regarding line breaking are still an issue with ideographic languages. In fact, support of these languages with proper line breaking led to some of the first technological attempts at solutions on Windows for complex scripts in general. People may not always lump them in with the conventional notion of complex scripts, but it is a simple fact that ideographic languages will not look as good when Uniscribe is not enabled to help their rendering....

I have also known people who speak one of the languages that is impacted (such as Arabic or Hindi or Tamil) who think of it as being somehow insulting to call their language "complex". But I can promise that no insult is intended -- this is just a recognition that some languages use scripts that require more effort to support correctly. I find certain complex scripts to appear to me to be the most aesthetically pleasing, enough that I am a little afraid to learn how to read a language like Thai since not knowing now to read it allows even something as mundane as a grocery list to appear beautiful to me.

Thanks to the work that has been done on the platform, support of complex scripts can often be as simple as you need it to be!

1 - If you do not have the Monotype font Script MT Bold on your machine, this will not look like script to you, and the demonstration did not work for you. Sorry!

This post brought to you by " ุุ" (U+0e38, a.k.a. THAI CHARACTER SARA U)
(Note that THAI CHARACTER SARA U is illegal to start a line with, so attempting to type it by itself in Notepad will fail since it is an illegal sequence -- try copying from this page and pasting it somewhere to see whether your application handles it right!)

# Jonathan on 17 Jan 2005 2:08 AM:

The main problem with the term "complex script" is that it's not obvious to laymen - which is why I like the actual CPL text: "complex script and right-to-left languages", which is more obvious, at least for my case (Hebrew).

Regarding "Complex" as an insult - It will insult those who are looking for something to be insulted by. For example, I know a native Chinese person (from PRC) who calls the Chinese used in Taiwan "complicated", instead of the official, politically-correct "Traditional". Or maybe it's insulting on o Taiwan people?

# Leons Petrazickis on 17 Jan 2005 7:53 AM:

How about 'intricate'? Turning 'intricate' into an insult is a challenge.:)

# Raymond Chen on 17 Jan 2005 2:33 PM:

("intricate" can have the connotation of "unnecessarily complicated")

How about "rich", "sumptuous", "subtle", "mature", "refined" or simply "advanced"...

# Doug on 13 Feb 2005 9:13 AM:

How about "get over it, your script doesn't use 26 plain letters".

# Doug on 13 Feb 2005 9:14 AM:

I'm surprised to read that one of the characteristics considered to define a "complex script" is combining characters.

Latin and Greek both can -- and in some languages/usages, must -- use combining accents with letters. And even English uses ligatures for "fli", "fi", etc.

And Microsoft defines Latin and Greek as "simple scripts".

# Koji Ishii on 28 Nov 2007 2:27 PM:

Which would you call Japanese, simple or complex?

From your definition, it looks like Japanese is complex due to its "word break and justification." Sometimes with "combining characters." I'm not sure if glyph substituion in vertical writing is "contextual shaping," since it's not contextual, but is shaping.

Maybe Japanese is in between simple and complex; not as complex as bidirectional languages, but also not as simple as Latin languages. I hope you find a good place to classify these languages too, so that they are not forgotten in MSDN.

# Tanveer Badar on 16 Dec 2007 9:05 AM:

"Filtering out illegal character combinations occurs when a language does not allow certain character combinations. Thai is such a script."

Similar rules exist for Urdu too. To list a few

1- No word shall begin with 'ھ' (U+06BE), ء (U+0674)، ے (U+06D2) or ڑ (U+0691).

2- Only certain combinations of 'ھ' are allowed with other characters like ب، پ، ت، ٹ، ج، چ، د، ڈ، ک، گ۔

And the list is not exhaustive and may not be accurate, I don't remember all the combinations at the moment.

3- This character is not a substitute for 'ہ' (U+0647) as sometimes seen in modern writings when people are not so linguistically aware.

# Michael S. Kaplan on 16 Dec 2007 11:16 AM:

The filtering only happens for Thai, AFAIK -- What is illegal in Urdu may not be illegal in the other 80+ languages that use the Arabic script.....

# Tanveer Badar on 28 Dec 2007 6:42 PM:

I was giving example of the language I am rather fluent in.

Hermann Weber on 27 Mar 2011 10:10 AM:

Going through your blog all day long trying to find a solution for my situation, I am still not sure if this can be done at all:

Let's assume I have a Thai sentence (which doesn't use a space between words) which does not include Zero-Width-Word-Breakers, is Uniscribe capable of telling me the different words in it?

I wouldn't have dreamt that this might be possible at all with MS-only techniques, but when I used Ctrl and the arrow keys in Notepad to go to the next word, Notepad did jump to the next word in this sentence, and I wonder how it does it.

Is this a Uniscribe feature?

Clearlycrystal on 28 Oct 2011 12:03 AM:

Michael, The people who were raised speaking a language to not consider their language complex. But for people who are learning to read and write the language it is complex.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/12/01 A whole new spin on the term 'Vertical markets' (aka in SiaO we trust?)

2006/10/02 Can you name that TUNE?

2006/05/31 Did he say shaping? It's not in the script!

2005/10/18 Font Linking vs. Font Fallback, #2

go to newer or older post, or back to index or month or day