Beauty isn't only glyph deep, even for Microsoft

by Michael S. Kaplan, published on 2010/07/16 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/07/16/10038461.aspx


It was earlier this month that I wrote the blog entitled Beauty isn't only glyph deep.

In it, I made serious accusation about what I consider the irresponsible behavior of the technological efforts of Google and Adobe.

And I stand by those accusations.

Not on behalf of Microsoft (who I didn't do this for) as a Microsoft employee (who I wasn't writing this as), but as one of the humans who depends on the quality of the work that both of these companies do. Just like almost everyone else out there, really.

I even pointed out how great it was that the "Save as XPS" and "Save as PDF" functionality in Microsoft Word 2007 and 2010 works around some of these shortcomings of these two companies. This was not to make people go buy copies of Word or Office, as I said -- it was to suggest that these other companies wake up and start doing their damn job.

I didn't mention it at the time, but it puts the whole effort Adobe put out to make Microsoft drop the "Save as PDF" functionality in the box for Office 2007 in a very different light -- like why they would fight a PDF done right in favor of their own poor quality add-in that does it wrong. Think about it for a moment, but not too long; I can get myself angry every time I think about it.

And when the official Google Blog can talk about their committment to the digital humanities and say things like

We’re proud of our own Google Books digitization effort, having scanned over 12 million books in more than 400 languages, comprising over five billion pages and two trillion words.

I want to tell them to sit down and shut up until they create something in Google Books that can SEARCH in the languages in question in Unicode, since in so many cases it can't.

Now if you were reading here or were pointed at that Beauty isn't only glyph deep blog, you saw it. A few people even pointed others to it. As blogs go it was slightly more popular than the average blog (not that I have average blogs, but you know what I mean).

But no one else really saw it. And no one is talking about it.

It bothers me a little (as I pointed out in a comment to that blog) that the "SlashGizCrunchBoy" crowd didn't jump in with opinions, and to be honest it is disappointing to me that no one seems to care when big companies treat emerging markets this way. But clearly they are too busy covering the latest "iPhone vs. Android" or "Apple vs. Adobe" or "When will AT&T's Exclusive iPhone deal end" stories as they play out their "geek version of paparazzi" fest. The color of a Jobs-ian bowel movement after he recovered from his surgery would get better coverage from most of these sites/people than the fundamental flaw in almost every PDF document created in some parts of the world or the fact that Google is archiving books in a format akin to cottage cheese for much of the planet's scripts.

Though tat is the world I chose to live in; if I wanted the Thurrotts and Scobles and Foleys of the world to be covering me, I'd be writing about very different topics. And this would be a very different Blog.

However, in a way I am glad that no one else cared.

Because it turns out that there is a bug in Microsoft's "Save as PDF" functionality, which causes some of the Unicode text in the Tagged PDF that Microsoft Word creates to be incorrect.

For small samples of text, everything works -- like my own documents and presentations I tested the functionality on before writing Beauty isn't only glyph deep in the first place.

But then if I have a larger bit of text, the corruption sets in.

My friend N. Ganesan actually sent me a Word document entitled Tholkaappiyam.doc which I'm not sure but I think may be the actual Tolkāppiyam (தொல்காப்பியம்), albeit with modern Tamil rather than the ancient form of the script used when தொல்காப்பியம் was first written millenia ago.

And it was in converting this 19,669 word document that I first saw the problem.

You see, if I took the whole document then I saw a bunch of corruption, but single words were find (even when those same single words were just fine when not part of a larger document).

You can try it yourself, I'll give you the first paragraph to try it out with:

சிறப்புப்பாயிரம்
வட வேங்கடம் தென் குமரி
ஆயிடைத்
தமிழ் கூறும் நல் உலகத்து
வழக்கும் செய்யுளும் ஆயிரு முதலின்
எழுத்தும் சொல்லும் பொருளும் நாடிச்

Now if you put the first word into Word 2007 or 2010, and then save it as PDF, the PDF will

But if you take as much as the whole paragraph and do the same thing, the text still looks right as PDF but the underlying text becomes:

சிறப்புப்பாயிரம்
வட வவங்கடம் தென் குமரி
ஆயிடடத்
ெமிழ் கூறும் நல் உலகத்து
வழக்கும் தசய்யுளும் ஆயிரு முெலின்
எழுத்தும் தசால்லும் தபாருளும் நாடிச்

which has a somewhat baffling set of mistakes in it, as you can probably see just by looking at the visual changes and noticing the illegal sequences caused by letters improperly placed as some of the other letters were modified.

And if you do the entire document then the PDF has this text that is even more different:

ஓிநப்புப்தா஦ி஧ம்
஬ட வ஬ங்஑டம் த஡ன் கு஥ரி
ஆ஦ிடடத்
஡஥ிழ் கூறும் ஢ல் உன஑த்து
஬஫க்கும் தஓய்ப௅ளும் ஆ஦ிபே ப௃஡னின்
஋ழுத்தும் தஓால்ற௃ம் ததாபேளும் ஢ாடிச்

Now, even the first letter of that first word is wrong and overall there are even more problems and lots of code points that are still in the Tamil range that aren't even defined.

Summary: the bigger it gets, the worse it gets.

Friend Saranya helped me verify that the bug was in the PDF itself, and not a bug in either the Adobe reader I was using or the copy/paste process itself (either one of which would still have been a bug -- just a very different one). Her help here was invaluable because I needed to know who to blame to know shat to write about this time.

Now it was not my intent to deceive anybody with that initial blog, but now knowing that my claims gave Microsoft more credit than it (than we) deserved, I am writing this blog. And I will update the older one to include a pointer to this one.

Because I don't slam competitor products often but if I do so and I am to stay credible, I must be just as willing to be hard on Microsoft when they do things wrong as I am on Adobe or Google or anyone else.

With the bonus being that I can get potentially more done with people in Microsoft than I can with those other companies, since I don't have a SlashDot-esque PR nightmare to hand Adobe or Google (given that antennas and iPhone recall requests and store employees making fun of iPhone vs. Evo and such need to take up all the column inches.

But at least inside Microsoft, I have the fact that the engineeers who do this work know that the entire effort is undermined by bugs like this, and the people who do the work really want things to be correct. I have no proof in the case of these other companies because they have not ever reported that these problems widely exist or that a solution is in the works.

My only hope is that these other companies have someone just like me within them, someone generally considered to be T.U.T.F. who gets very stubborn about inadequate support of core scenarios that affect language quality.

Because frankly the whole world deserves something better than this. From all of the tech companies -- including Adobe, Google, and Microsoft....


Daniel on 16 Jul 2010 8:03 AM:

So in reality Mac OS X also fails miserably when taking real text rather than single characters.

The generated PDF on first sight looks the same but if you try to copy from it, you end up with garbage:

!ற#$#பா'ர) வட ேவ-கட) ெத1 2ம4 ஆ'ைட7 த89 :;) ந= உலக7@ வழB2) ெசDEF) ஆ'G HதI1 எK7@) L=M) NGF) நாOP

And that same garbage is what you can search for.

You search for the digit 7 and it highlights த்.

People affected by this sure have my sympathy.

Joe Clark on 16 Jul 2010 1:47 PM:

Representation of scripts with conjuncts in tagged PDF is, at best, barely possible. The eternal issue of using a sequence of underlying characters that result in a single visible character (with later characters affecting previous ones) has not been resolved.

One tiny outpost of the PDF demimonde knows about this but has chosen to ignore it. The rest are in complete ignorance.

Michael S. Kaplan on 16 Jul 2010 10:11 PM:

The Microsoft Word implementation, if one ignores this one bug, is sound -- proving it is possible. But the bug proves that even when one is trying to do it right one can have problems....

Joe Clark on 17 Jul 2010 11:00 AM:

Yes, but not every method of producing a conjunct character (even decomposed diacritics on Latin) will work in tagged PDF.

Michael S. Kaplan on 17 Jul 2010 11:13 AM:

I have looked at the text stream and the glyphs and the connection between them, and while individual connections are weak in points where there is no good way to pick a "place" to say you are, that is just like when in the middle of a conjunct and there is no good "place" to put the cursor. It does not mean the text is unavailable.

Do you have an example where the text will not be able to be there, though?

Also (FWIW) the XPS text is completely available and does not have this issue. XPS may not be a PDF killer, but it is a PDF "reminder that you can do better when you put forth the effort" tool. :-)

Henrik Holmegaard, technical writer on 4 Nov 2010 4:17 AM:

> My only hope is that these other companies have someone just like me within them, someone generally considered to be T.U.T.F. who gets very stubborn about inadequate support of core scenarios that affect language quality.

If you want comment, then it does not make much sense to censure comment at the start. If you want to talk to yourself, though, you're welcome :-).

/hh

Michael S. Kaplan on 4 Nov 2010 5:08 AM:

Comments are always welcome!


referenced by

2010/09/29 Dotting the t's and crossing the i's is more work than that, PDF edition

2010/09/06 Acrobat PDF: the Yugo vs. the BMW vs. the Ferrari

2010/07/06 Beauty isn't only glyph deep

go to newer or older post, or back to index or month or day