Now that the excitement of PyCon 2009 is over, it is time for me to finish this brief series of blog posts on watermarking PDF files. In the first post I outlined how GraphicsMagick and Adobe Reader proved essential to the project for their ability to produce correct PDF files and then help me verify their correctness. The second post showed how an image can be applied as a watermark using the pdftk PDF Toolkit utility. The resulting watermark, after some margins had been added using a Python script, looked rather attractive:
Watermarked page (click for PDF). The light blue design is a PDF file that pdftk resizes and centers on the target document.
My last challenge was that, on certain pages, the watermark we were using had to be turned upside down. “Simple,” I thought, “I'll use pdftk to turn the watermark over before applying it!” I just had to process the watermark image with the letter S (“south”), which tells pdftk to rotate the image by 180°, and then use the result as the watermark:
$ pdftk arecibo2.pdf cat 1S output arecibo3.pdf $ pdftk in.pdf background arecibo3.pdf output wmark3.pdf
Upside-down watermark (click for PDF). Whoops! After turning the watermark upside down, pdftk lost the ability to properly center it.
Actually, the situation with my real customer's documents was even worse than in this artifical example: the real watermark was a bit smaller, and positioned differently, such that it actually got moved entirely off of the page thanks to this misplacement bug in pdftk! So I did not even start my investigation with the knowledge that the watermark was misplaced; for all I knew, it was simply missing from the output file.
That is why I got to spend an afternoon with some coffee and the PDF specification (thank you, Adobe web site!) examining the raw files for myself to discover what had happened.
Actually, pdftk proved indispensable even while I was investigating its own misbehavior! It was its uncompress command that made the internals of the PDF files visible to my text editor. The important difference turned out to be in a simple instruction, which you can see for yourself if you use pdftk to uncompress the two PDF files that are linked to from the above images, and then use diff -u to compare them:
q -q 8.52 0 0 8.52 208.06 85.16 cm /Xi0 Do Q +q -8.52 0 0 -8.52 574.26 877.16 cm /Xi0 Do Q Q
Here, diff is using the leading - and + characters to indicate that the first long line has been removed and replaced with the second. The numbers that you see are a coordinate matrix that specifies where the watermark image is being placed on the page. What has happened is that the offsets for the upside down watermark are wrong — and are wrong by exactly the size of the margins that we added in the previous blog post!
The problem quickly became clear, once I drew some boxes on graph paper and did some sums. Very nearly every PDF file in existence has a bounding box whose lower-left coordinates are (0,0). But in the previous blog post in this series, we quickly added margins to our PDF by adjusting these coordinates outward so that they lay at the very uncommon values of (-10,-10). You can check this by running the wonderful pdfinfo command against the arecibo2.pdf file yourself:
$ pdfinfo -box arecibo2.pdf ... Page size: 43 x 93 pts ... CropBox: -10.00 -10.00 33.00 83.00 ...
When we ask pdftk to apply a watermark that is right side up, all is well. But when we apply a watermark upside down, pdftk forgets to apply a negative sign to our negative bounding box coordinates, and they wind up shifting the image in the opposite direction.
It took me more than an hour of crazy experiments before I realized that I could work around the bug quite simply by turning the document upside down with pdftk, applying the watermark, and then turning the document upright again! I had spent time looking for a complicated solution that would fool pdftk into doing the right thing, when all I needed to do was to step back and figure out how to avoid the whole problem of upside down watermarks in the first place. How? With upside down documents!
$ pdftk in.pdf cat 1S output in-u.pdf $ pdftk in-u.pdf background arecibo2.pdf output wmark4-u.pdf $ pdftk wmark4-u.pdf cat 1N output wmark4.pdf