Beauty isn't only glyph deep

by Michael S. Kaplan, published on 2010/07/06 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/07/06/10031112.aspx

For the official record, this particular posted blog is my own opinions. I am speaking only for myself.

More specifically I am not speaking for Microsoft, Unicode, Adobe, Google, Quark, INFITT, Tamil Nadu, Malaysia, Singapore, Sri Lanka, or India. Though I suspect that over 50% of them might [at least privately] agree with me on over 50% of what I am about to say.

Adobe’s for all intents and purposes (and intensive purposes, a fair misuse of language under the circumstances!) ubiquitous contribution to the world that provides a document format that can be read anywhere and still look the same.

It is so ubiquitous that Apple (which as we know is gunning for Adobe's Flash these days) supports PDF just fine. I can see the ad campaign now:

I mean looking good is not a bad thing. Looking good both constantly and consistently can be a great thing -- think of Cher, for example!

For example, take some Tamil text. Having been in India at a huge Tamil conference recently and having a backpack of books I was given, I have a lot of Tamil text to choose from right now.

I have the Microsoft fonts in Windows 7 that let me render it properly. In creating a document in Word, I can use the fonts and get proper looking Tamil. If I embed the font data then I can send the document to you so that you can see them properly too, even in earlier versions, even if I used some of the newly added symbols or the TAMIL FA without the dotted circle on the left.

Or I can not embed the fonts and hope that you have them. Since you probably don’t have the newer font on those down-level platforms, you will probably just get occasional square boxes (null glyphs) and inappropriate dotted circles. The 100% fidelity and portability is lost, though you do still have the actual data that you can perhaps find a font for or use for other purposes (like searching archival documents and so on).

To Word in this case, the underlying data is what is most important. What it looks like in this case is only crucial if you go to some extra effort but this is secondary to content fidelity.

MICROSOFT WORD: How you look is fine and dandy but what’s on the inside is what really counts.

Okay, so I won’t be doing any ad copy for Microsoft. I’m trying to prove a point, not build a PR campaign!

In the case of Tamil (and other Indic text), the visual forms are stored in the form of FONT GLYPHID values or related goo. How everything looks is the most important piece, and the underlying data is not. That is what they store so they can get 100% portability with PDF documents.

And although I have been told you can optionally choose to store the underlying text in the PDF as well, this can (roughly) double the size of the PDF and is neither the default option in Adobe Acrobat nor a requirement in PDF writers that they do this (the vast majority of free and also most paid PDF writing tools do not).

Note that some people I respect claim it is either not possible to do or very hard to do (find?) in Acrobat. Either they are missing something obvious, or its hard, or it isn’t there. Other people I respect claim it is.

Okay, so Adobe would probably put out a cease and desist if that were my an advertising campaign I was doing. Don't worry, I’m not!

Not entirely as an aside and just to show it is possible in the format even if Adobe and most everyone else don’t do it, if you use the “Save as PDF” feature in Microsoft Word, you always get the data stored in there, too. This is why you can always find data in a PDF created by Word, even if it is complex script data like Indic language text made up of conjuncts and reordered vowels and such.

I've heard in the past some people complaining how PDFs created with Word's feature were bigger than using Adobe's, and they assumed it was a Microsoft bug.

Well yes, if containing the actual data is a true flaw in your mind then yes it's a bug.

PDF in Microsoft Word: It’s big-boned, not fat². Except you can actually count on it³!

Bet you may have wondered why PDFs from Microsoft Office tend to be bigger (remember that size thing I mentioned earlier?) -- now you know! :-)

Even really popular guys who have tons of adoring fans (e.g. Steve Jobs) who hate Adobe's Flash in some ways, love Adobe's PDF and support it. Now I am not Steve Jobs here and nor do I want to be, but I'm just trying to say that there are people out there who feel PDF is just fine.

But I know I’m not the only person who hates the lack of search-ability or clipboard-ability of a lot of the text of various scripts of interest in the Unicode Standard, which is available up on the web in PDF form (but whose PDFs end up storing FONT GLYPHID values that cannot be retrieved as characters or searched, even the corporate members of Unicode who if memory serves have gotten a cool unlocked version at times -- it still puts crap in the clipboard).

Go on, try to find a Ka (க, TAMIL LETTER KA) in Chapter 9-- there are several of them. Unicode kept the file size down to 2.5mb by using tools that don't save the data, but made one of the more obvious tasks one might want to accomplish impossible. You can search for the English text quite handily though.

One of the great things about Microsoft's XPS format (beyond the tantrums it raised, I found those kind of priceless) is that it always does have the text. I won't say it was Adobe's stubbornness about 100% extract-ability that led to XPS, but if it wasn't than it should have been, because that is one really compelling damn reason. Why on earth would one be proud of a feature that is just for show? I mean if everything gets archived as PDF and PDF turns a lot of text into crap, then that means that a whole lot of people are hanging on to a lot of crap.

Put another way, who wants to go down in history for being famous for creating a way to archive books that can be as unsearchable as the original books, sitting of shelves getting dusty (you can't search those either except by looking at them)?

Hey how about Google Books? Google is archiving the whole world's books, right? Clearly content must be important to them.

After you've given up in disgust at Google Books, keep in mind that it may be the fault of PDF -- depends on what they are archiving, I guess. Anyone from Google know the story here? I know that they search PDFs, perhaps that is what they archive?

But either way, even Google has set the bar pretty low in their intent here in terms of indexing the world's books -- whether by choice or design -- and they are hardly going out of their way to admit it.

Getting back to Unicode, they do it right with the code charts - I can find letters there. and those are PDFs too. In case you thought no one else was ever doing it right.

And I know that India’s print publishing industry, which is actually on the rise (as opposed to the US print publishing industry that is being devastated by online journalism), is dependant on two things:

Most of them do not use Unicode since only Quark >= 7.0 supports Unicode and even that doesn’t support Indic text.

There are literally thousands of newspapers, journals, newsletters, and zines that are almost entirely made up of text in Tamil or Hindi or Marathi or Telugu or Malayalam or Bengali or Assamese or Oriya or some other language of India.

And if they use Unicode? Then they are all an unsearchable mess of content that may be lost to us forever if you are trying to look for information within them (as sophisticated as Google is I doubt they will ever get to the point of being able to find random glyphid values in PDFs when one requests a search for text!).

And if they don't use Unicode? Just as bad, though maybe if you know the encoding you can search for the specific thing you want.

How do the results from either comapny serve Unicode's goals? The president of Unicode works for Google, and the vice-chair of the Unicode Technical Committee (UTC) works for Adobe. They two of them are active members of both the UTC and the Unicode Editorial Committee, and their employers both pay USD$18,000 annually to be full members of the Unicode Consortium.

I'll stop for a moment if you want a chance to fume about the irony of some of the biggest members in Unicode putting so much money into supporting a standard that they may not be supporting in some of the largest projects they are hailed for producing for the world.

I have been getting an earful at Tamil Internet 2010 from people who still feel compelled to use TAB or TAM or TSCII or TACE 16 or a font hack for publishing since for them the ability to search for their own content later in their known hack is more important than not being able to see anything, and also not being able to search in all content over the entire internet is not yet on their radar mostly (since it can't be done now anyway). And thousands of newsletters are incompatible with each other, and produce content to betray their birthright for a bowl of portability.

And I would argue that the truly astounding number of PDFs that grows every day across almost all of South Asia that will not be able to be searched for their native content are decidedly NON-portable. No matter what PDF stands for.

That includes bring those free PDF writers forward. A hard problem, to be sure, but they created it themselves, and gave it to the planet. I think they owe it to the planet to make moves toward solutions.

...to organize the world's information and make it universally accessible and useful.

I was pointedly asked in Coimbatore whether Unicode could force a member company like Adobe or Google or Quark to do better here, and I pointed out they could not (and that Quark is not a member of Unicode anyway). The member company must voluntarily choose to do the right thing.

I presented on this very topic in Tamil Nadu, have had frank conversations with some of the most important people in Microsoft India while here, and have been very candid in conversations with everyone (also including lots of users of software) about what we could have done better in the past, what we still aren’t doing as well now as we could, and what I wish we would start doing.

And I could probably hurt my career at Microsoft a bit with a too brutally honest blog about my thoughts on Windows Phone 7 not supporting Silverlight 4, when clearly Apple's iOS4 is supporting Tamil text in the browser, probably using the fonts they licensed from us (I have written that very blog four times now while sitting in a hotel room in India and deleted it every time - I'm learning the dangers of blogging angry!).

But I can see the next steps here, everyone can. Microsoft's fault here is just timing. And we are making progress here. And we are trying to do lots of things better.

And as I pointed out, if you author in Microsoft Word (2010 as is, or 2007 with the Export to PDF download) then Adobe's PDF shortcomings are covered. I wouldn't hazard a guess on the numbers of PDFs that are saved correctly but it doesn't look very huge yet.

I'm not trying to sell copies of Word here when I say this, because I know not everyone affected in India is necessarily in a position to buy the latest version of Word. I'd be happier if Adobe stepped up here and make PDF 100% portable, including searchability of complex scripts. PDF writers that don’t store the original text should be declared non-conformant to a new PDF 2that Adobe is publicly committed to supporting for the sake of true portability.

It is one of the responsibilities of ubiquity -- to help make those who use it work better.

COIMBATORE: A debate here on which of the following mediums – cinema, television or print – had greater responsibility to protect and develop Tamil ended with the verdict that the print medium comprising newspapers, novels, magazines and new media was the best to carry out this task.

If that is true, then we need media that can be printed, digitized, stored, and archived -- not just the look of the content, but the content itself.

That is at most recent count over 1.18 billion reasons for Adobe and Google and Quark to do better here.

1 - Think Billy Crystal for this one.
2 - Think Eric Cartman for this one.
3 - Unlike Eric Cartman, I mean.

wiki.services.openoffice.org/.../Pdf_Import_Extension

www.forest.impress.co.jp/.../pdfimportext10.html

and it is free, open and works with some other languages than English.

Microsoft's handling of Unicode of East Asian double byte character sets it less than desirable with a PDF in Word. Take Japanese for example: I PDFs in Word gets confused on older PDF that were encoded with double byte JIS (Shift JIS) en.wikipedia.org/.../Shift_JIS versus newer encoded PDFs using Unicode. In Word I know of know way to force PDF encoding to use Shift-JIS or EUC-J etc. The problem continues to be businesses who refuse to upgrade the way PDFs are created in the first place. This kind of falls in line with the crux of your blog article in the first place.

My point was only being that importing and editing inside of MS Word is simply an after thought and perfectly worded "big boned", but I would say it has a long way to go for languages other than English. MS doesn't seem to be in a hurry to fix this either.

I'll still take PDFs versus everything being in a 100 MB "word" document before there were PDFs.

Good article.

I am not particularly worried about Japanese in this context as Japanese has its own powerful lobby to get what it wants -- market share and market demand (support for legacy JPN code pages is there specifically at their request, in fact!). So it is not "non-English" support that is the crux here so much as "complex script" support.

But I am interested in whether OpenOffice's PDF facility supports complex scripts or not. Does it? If it does that provides more pressure on other companies to do the right thing (if not then they should fix that!)....

Pretty shocking to read that there are tools this day which can turn a document containing text to PDF and end up with a file where said text cannot be copied or searched properly.

Mac OS X's built-in save to PDF feature works as expected for க.

Searching and copying both function just fine with the resulting pdf.

Talking about Word 2007 and PDF, I've encountered strange issue last week.

One of my clients requires all documents be sent to them in PDF format, and my company stores all our documents in DOC format. So I have to save them as PDF.

It happens that one of the document is a heavily edited manual with change tracking enabled and lots of non-accepted changes (not sure if it matters). The 6.4MB file as about 129k characters and 143 pages in DOC format. When using other tools, it converts to a 4.8MB file with 143 pages PDF file.

But when you use Word 2007's SaveAs function (our company uses genuine Office Exterprise 2007 with "Microsoft SaveAs PDF or XPS Add-in for 2007 Microsoft Office programs" and have all latest patches applied. It doesn't have any third party plugins installed other than the SAP 10 scanner if it counts.), you'll see Word hangs for about 3 minutes and generates a 10.5MB file with 546 pages PDF file. After the conversion, the page number displayed in Word 2007 is 546 pages as well. When you attempt to close Word at this time, Word freezes even if you've not done any editing. If you force it close, the next time you open Word it'll offer to recover the file, but later found the recovery block (not exactly remembered how it calls, but it's something like that) corrupted.

I wonder if this bug only exist under special condition(s). We have other even heavier edited document with non-accepted tracked changes and more pages that converts just fine. That's really strange...

Well PDF really is a pre-print format not an archival format.

And there are so many places PDF generation can go awry. The target version of PDF used by the PDF generator can have an impact. As can the fonts used. And few applications make use of the text embedding features in PDF. So all told ... not ideal format for information transfer or archiving.

Even English can be a problem to search, the use of ligatures and more advanced OpenType features in publishing software can lead to English language PDFs that are difficult to search.

One thing about all this that I find fascinating is that were this sort of thing happening to a language like Japanese or Russian or something, it wouldn't be me covering, it would be all the tech sites that love to rip on these companies (like the SlashGizCrunchBoy crowd). It is so disturbing to me to think about how one day this will be huge and someone will remember to dig up a link to this post where I pointed out the problem five or ten years earlier....

Ah, the magic phrase that no one I have ever talked about the problem with as ever used:

TAGGED PDF

Thanks, Joe! :-)

Agree on the Quark thing but I really can't move an entire industry segment within a country myself just because they're doing something wrong -- especially if they would have to pay to get the fix for the problem....

We hold these truths to be self-evident, that all software is created equal, that it is endowed by its Creator with certain unalienable Rights, that among these are compatibility, searchability and the pursuit of fidelity. That to secure these rights, software companies are incorporated among geeks, deriving their just powers from the consent of their customers, That whenever any Form of software becomes destructive of these ends, it is the Right of the customers to make them alter it or to migrate from it, and to find new companies, laying ther requirements foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that companies long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that customers are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such companies, and to provide new Guards for their future searchability and fidlity.

:-)

//But I am interested in whether OpenOffice's PDF facility supports complex scripts or not. Does it? //

Converting to PDF file from Open Office.org-Writer is a simple task; and complex scripts are supported for that purpose. So created PDFs can be also viewed without problems in PDF viewers.

For editing (not full fledged I understand) PDF files can be opened with "PDF Import extension" included Open Office Draw. However, complex scripted text lose out unicode code points that are second level implemented, it appears.

Here's my screenshot sample (using a Tamil one liner you now know well as seen in one of your earlier blog) in my Ubuntu 10.04 (Linux) platform :

sites.google.com/.../tamil-OOow-PDF-OOdraw.png .

The top of the 3 is the raw content in Open Office Writer. The middle is the pdf (made from OOo-writer) as viewed in (Evince) Document Viewer. The bottom most is the edit / view of the same PDF in OOo-Draw.

In OOo-Draw with pdf import extension, the opened texts are placed in text blocks and are editable.

So the problem is complex scripted contents in pdf are not editable. Hopefully powers be will come up with solutions!

(I had made the PDF with "tagged pdf" option)

K. Sethu

(Colombo, Sri Lanka)