Providing more information is the best way to assure correct information is received

by Michael S. Kaplan, published on 2010/09/16 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/09/16/10062966.aspx

I thought a small amount of information on PDF and Unicode might make sense as a bit of introductory material for some of the blogs I have already written. :-)

I owe a a great deal of the centralization of the Unicode question in PDF to information from Leonard Rosenthol of Adobe, since although I knew about most of this, I was often missing correct terms and never thought about them in such a good framework of comparison until after communicating to him. Anything incorrect here is probably my fault, though!

Now there are three basic ways for Unicode data to be in PDF. And really any one, or two, or all three of these methods can be there.

The first method is storage of the font encoding information itself (in TTF & OTF). Any effort to properly display Unicode data will include this information, which can make it one of the "easiest" pieces of information to include if you are an application that is doing the work to use the font information to display the text. If you are not such an application creating PDF then it is not something you would likely have - but if you are doing the work to render information in an application then it can be some easily available information to include.

Reverse engineering from that information is obviously quite possible, though I like to think of it as being a way to extract essentially equivalent text, since there are cases -- from Unicode normalization and the way choices are made to favor composed glyphs, as well as other such mappings -- that the text you would get back from using only this data may not be exactly what was originally in the document. This is similar to the issue I pointed out in Documented, schmockumented! It's still kind of cool.... but easier since one has more information available to do the reversal if one has this information.

The distinction should not matter in most cases, especially if you have the same fonts as were used in creating the orginal PDFs. But in cases where you don't have the exact same fonts there can be subtle differences that might matter to you.

Although this is the easiest way to store "Unicode" data in the PDF since it is for the most part just embedding the work you had to do anyway to render, it is not always the easiest way to extract text, however. It can be easy to get it wrong when you do that extraction.

The second method is the ToUnicode table, which is what it sounds like -- the actual Unicode data, with regular pointers (indexes into the text) so the data in the PDF information is directly referencing Unicode data. Depending on what data is stored, you can choose what data will be able to be extracted rather directly.

Now this give a great opportunity to work around some of the potential problems with the font encoding information, obviously (since you are putting actual Unicode text in there you are choosing what goes there). Of course a PDF writer can put in information with the same kinds of transformations as well, but it is the opportunity to do better that is interesting.

I have seen examples where a document would use Arabic compatibility characters -- for more on these see blogs like It Does Not Always Pay to be Compatible and Getting out of dodge (or at least out of the compatibility range!) -- and the PDF's font encoding information would have the same thing, but the ToUnicode table would have the real Unicode Arabic characters. This sort of thing is a great way to help with searching algorithms that scan PDFs so is not an unreasonable way to go here depending on the ultimate goal of the PDF itself.

This is not the most common way to go though -- the thing you will most often see in a ToUnicode table is the original text. If you have it anyway, it is easy to add.

The only "hit" to storing both is the size increase of the PDF. As Leonard specifically suggested to me, "...they require the least amount of work as the information is associated with the font itself and the page content need not be impacted. For most things, these should be sufficient..."

The third methodis the ActualText tag. This is the magical way to provide easy shortcuts to the exact data to get back when there are contextual glyphs in the PDF.

Now obviously you can in theory get the same information from either of the previous two methods (which is good because they are the most common!), but in the case of documents with many conceptual glyphs (only some of which might also be in Unicode, though in most cases when they are in Unicode they are compatibility characters as most of these contextual glyphs are things like the conjuncts you might see in Sanskrit, and so on -- special forms of text that can often represent explicit choices on the part of a font.

A great example of this is something I talked about in Which form to use if the form keeps changing?. Judicious use of the ActualText tag can effectively preserve these distinctions without requiring special processing to get the actual text back, something that can be required if you rely on the ToUnicode table to try get the information back.

When you think about cases like the "Ferrari" scenario of Acrobat PDF: the Yugo vs. the BMW vs. the Ferrari (for example), it is easy to see how not including good ActualText can make it easy to mess up cases like glyph reordering. I like to think of the ActualText (when it is available, which is comparatively rare since it does require extra work to create well) as a great list of shortcuts to get information that would otherwise be more difficult to get well.

And the bugs that appear in even the most sophisticated tools tend to bear me out on this opinion; things are incorrect often enough that including this data more often would be one of the best ways to get good results back. Though this does assume that PDF readers will make use of the information (not all of them do!), my general feeling is that providing more information is the best way to assure correct information is received.

Now, if you have PDF reading tools that can look at all of this information you can do a lot of forensic PDF work to determine the actual cause of many of the bugs that exist in PDFs today, especially for complex script cases. But forensic PDF work is really only interesting to two groups of people:

I tend to find both such processes to be very noble goals whenever they are identified, though unfortunately both tend to be a lot rarer than I would like....

The problem with Leonard’s diagnosis is that you are not dealing with two character-encoding environments within a PDF, one that’s broken (the default) and one that works (ActualText). In your ActualText tag, which is in fact meant for drop caps and the like, you are using the same character encoding at work everywhere else in the PDF.

A PDF that has trouble with decompositions or complex scripts will have that same trouble inside ActualText because the problem is endemic to PDF. There isn’t a little panic room inside PDF you can hide complex text in.

Adobe, in one guise or another, has known about this for years but failed to fix it, in part because fixing it is horrendously complicated by even my estimation. PDFs work reliably in Latin-script languages, Japanese and Chinese (horizontal text only), and hugely simplified RTL text – and in no other scripts.