by Michael S. Kaplan, published on 2010/09/06 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/09/06/10057561.aspx
Thinking back to July 6th's Beauty isn't only glyph deep and July 16th's Beauty isn't only glyph deep, even for Microsoft, one of major concerns was how truly awful the story is for PDF with complex scripts.
After talking to some really knowledgeable people when it comes to PDF, I thought I'd fill in some of the holes about the intricacies of the format and some of the plans for the future....
As this is an area that a lot of people care about, I thought it would be good to do that update.
The update you are about to read. :-)
It started when Eric got me in touch with Leonard, who knows more about PDF than just about anyone out there you might know.
He is the one who (for example) reminded me of something I already knew but seldom really think about: that Adobe Acrobat is not a tool for creating PDFs. It is at least half a dozen different tools for creating PDFs, under a wide range of different circumstances.
There is a definite bias toward my definition of "high quality" if you want to group these different implementations that way -- just look at main Acrobat UI, say. Or the add-in for Microsoft Office products like Word. These will always do better than the lowest level, which is little more than a print driver, primarily for use with applications that Adobe knows nothing about and which really only support printing. In which case there is an "Acrobat" solution even for them.
But proclaiming the three of them (that print driver and the Adobe add-in for Microsoft Word and the main Acrobat UI) to all be "Acrobat PDF creation solutions" is akin to looking at a Yugo and a BMW and a Ferrari, and just caling them all cars.
Sure, you are right, they are all cars. But the difference between them in regard to quality and performance is so staggering that you may as well not put them in the same category for most purposes.
Now I could have told you about that first category -- I mean sometimes all that will be sent to a printer is glyphids and features that Windows has like Documented, schmockumented! It's still kind of cool.... are only useful in the simple cases but that printer driver could indeed use it to get string data back in those simple cases. And thus even our "Acrobat Yugo" PDF writer cantheoretically get string data in the PDF if it wants.
Note that some PDF writers and readers won't even do that, but doing it won't help you in this Indic case anyway. It works great with English and other simple cases, though!
Some might object to my characterization of the Adobe Word add-in as the "Acrobat BMW" PDF writer, in contrast to the main UI from Adobe as the "Acrobat Ferrari" PDF writer. I might have objected myself, up until about a month ago.
And then, with some helpful hints (e.g. that Acrobat can load an XPS file), I took that 19,669 word Tolkāppiyam (தொல்காப்பியம்) document, and created two different PDFs from it.
One with the BMW, and one with the Ferrari.
The BMW produces a 450kb PDF file that still cannot find the word. Copy/paste to notepad shows that
செந்தமிழ்
becomes
செசெந்தமிமி ழ்
And yes that space is in there, as are the incorrect characters.
I think I may wait for the BMW recall. It created a huge number of these errors.
Now the Ferrari tells a different story.
When I create an XPS file from the .DOCX, it has 100% text fidelity. And when I run it through the Ferrari (in Acrobat from XPS) creates is a 648kb file still cannot find the word.
In this case
செந்தமிழ்
became
ெசந்தமிழ்
Note this one is much closer, it is just the U+0b9a U+0bc6 that became U+0bc6 U+0b9a – a small reordering issue, multiplied by many other similar occurrences throughout the document - note that these are cases that rendering technologies like Uniscribe do the reordering for since
ச + ெ = செ
Well, it should at least. :-)
Okay, so it turns out that all four of the PDF creation options I have mentioned in these three blogs:
are all currently inadequate for Tamil and obviously some other complex scripts (though perhaps #4 will work even better for scripts that do not specifically do glyph reordering).
In the meantime, XPS does offer the full lossless fidelity that I really wish PDF did, and it (the Open XML Paper Specification) is an open, cross-platform solution.
In the back of mind I am imagining taking stuff like the thoughts behind Documented, schmockumented! It's still kind of cool....and finding a way to try to recover the text in all of the broken PDFs created by these tools.
This Forensic Text Recovery is of course largely fantasy without a lot more internal knlowledge of all of these writers, but a guy can dream!
I'll talk about more issues in this sphere another day, in another blog....
Joe Clark on 6 Sep 2010 7:38 AM:
And, as I told you before, issues of text direction and conjuncts are barely known at all at Adobe. I told you that at the time you were complaining that all the people you asked (who didn’t know the answer to the problem) didn’t know the answer. So I can understand how you’d forget.
You can offer up any weirdo MS file format as an alternative to PDF if you want, but nothing will come of it, just as nothing came of Silverlight. (Few users, if any, want Microsoft formats.) The solution is to repair the PDF specification so it works for something more bizarre to an American developer than English, Japanese, or Hebrew. This will not happen in our lifetimes.
Michael S. Kaplan on 6 Sep 2010 8:07 AM:
Um, ignoring the Word bug, if PDF created directly by other Office 2007/2010 products works with full fidelity, then there is as solution already -- using PDF.
And it was an Adobe employee who recommended to me that using XPS (created by Microsoft) in Adobe Acrobat gives better results than Adobe's own Office add-in does. So there are others outside of Microsoft who see value in the fidelity of XPS. I am not saying it is a replacement -- I was Adobe to fix the bugs in Acrobat, all of them. But in the meantime there is a way to store data with 100% fidelity that even Adobe can use later. There is something significant there....
Not to mention the cross-platform products for XPS are hardly all coming from Microsoft, either. So XPS clearly has its uses to some people.
And of course Silverlight has nothing to do with any of this (though Silverlight 4 supports complex scripts even better than InDesign due to known bugs there in Indic and other scripts -- maybe Adobe will find that loading Silverlight in Flash using some new conversion gives better international support THERE too? <grin>).
Naga Ganesan on 6 Sep 2010 1:27 PM:
Hi Mike,
Is it only Tamil, Hindi, ... having the problem of extract the original text from PDF?
What about Thai and Korean - do these scripts' PDF yield the text correctly?
N. Ganesan
Michael S. Kaplan on 6 Sep 2010 2:25 PM:
I have only been looking at the Indics, so far. I suspect Thai might also have problems. But Korean might be okay here. Of course until it is tried your guess is as good as mine. :-)
Aaron on 7 Sep 2010 11:45 AM:
So what you're saying is that XPS is like the Caterham R500. It's faster than a Ferrari around a track, much cheaper, and you can order it as a kit!
referenced by