Short-sighted text processing #5: PU[A]! That pad THAI is pretty spicy....

by Michael S. Kaplan, published on 2011/01/05 07:01 -05:00, original URI:

Previous blogs in this series:

 This part of the series is going to talk about a slightly different issue that the stuff I mentioned in Parts 1 and 2.

It is about the way that Thai was supported until its support in Unicode became more widespread, and the way that Lao is still supported in a lot of places while Lao Unicode support is finally gaining more ground.

Some of this will apply to other scripts as well, but Thai and Lao are the two languages/scripts where there seems to be a lot of it.

Now some of the interesting official support for Thai if you go back to Windows <= Server 2003 took advantage of two specific OpenType tables: GSUB (The Glyph Substitition Table) and GPOS (The Glyph Positioning Table). And they did it in an interesting way, one that pissed off some people.

They used some slots in the PUA (private use area) of Unicode.

 Now these characters are quite similar to some actual characters in the Thai Unicode block, but small differences help a lot when you are trying to support a complex script!

Others had previously mentioned this like Peter Constable when he worked for SIL, in docs like Use of the Unicode Private Use Areas by Software Vendors and Handling of PUA Characters in Microsoft Software.

Now as it turns out, there are some interesting consequences of this.

You see, some time after Server 2003 shipped (I'd have to ask someone like Simon or Judy to know exactly when), this kind of use of the PUA was determined to really bad. It just got many people very upset and unhappy.

From a language/script support standpoint, OpenType can substitute and position glyphs even if they are not in the font's CMAP and don't have code points, so there was no special reason that they had to be in the PUA.

Though that may have made it slightly easier for some people doing their own private versions of Thai support. Especially people using our fonts on non-Microsoft platforms, which may be why it was not such a priority. Idle speculation on my part, though other areas that used thye PUA like some of the CJK cases were left in when this one was not; perhaps this blog could have been part of the anti-Microsoft conspiracy theory series? :-)

Anyway, I digress....

There was a huge push to remove as much PUA as possible, so these little guys went.

So you could look at fonts from XP and look at the same font in Windows 7 and see that they were removed:


On the whole, most people consider this to be a good thing.

Though some legacy issues have popped up!

Do you remember my Documented, schmockumented! It's still kind of from a few years back that talked about how you could get Unicode characters from glyphids? In it I mentioned:

any time the glyph ID values would actually have been obtained via other means (like the GSUB glyph substitution table or via more advanced features like VERT for vertical writing), no characters will be found.

This turns out to not always be the case when the technique for getting characters is a bit more sophisticated (like perhaps Office's UCSCRIBE wrapper around Uniscribe) and/or when the thing being substituted is not (like perhaps these Thai PUA mappings).

Thus leading to a recent bug report:

I have a Thai customer who has recently migrated from Server 2003 to Server 2008 using Office 2003 and has come across what appears to be some form of Unicode related OS backwards compatibility issue I am looking for some input on.

If they open an _MS Word_ document in Server 2008 with Thai characters that was created and saved in Server 2003, they see that certain Thai characters (those with accents above/below letters) are substituted with square characters.

If they save the document as Rich Text Format and open this in Server 2008, these special characters are substituted by Japanese characters instead.

Creating a new document in Server 2008 is absolutely fine – the issue only occurs when opening documents created in Server 2003 and then opened in Server 2008.

After investigating, it was determined that the bytes that looked like square characters (notdef glyphs) and Japanese characters were actually those PUA code points,the ones that in the older versions of the Thai fonts were these alternate forns, anad now they aren't anything (except they happen to be in that area which is often used for CJK).

In the end, the visious fast move to try to get of the PUA is only partially to blame; really whoever was doing actual physical substitutions bears the bulk of the responsibility....

no comments

referenced by

2011/06/24 An irresistible force walks into an immovable object (aka the Thai that binds us)

2011/01/06 Short-sighted text processing #6: OpenType and Apple and OpenType

go to newer or older post, or back to index or month or day