Adding margins to PDF watermarks

Date:	15 March 2009
Tags:	computing, document-processing, python

This is the second article in my series on adding “watermark” images to PDF files, which sit behind any text and graphics that were already on the page. Last week I outlined the first two lessons that I learned while developing this watermark process: first, always use Adobe Acrobat to verify that you are creating valid PDFs in your toolchain, and second, the version of GraphicsMagick that currently comes with Debian unstable produces better PDF files than the version of ImageMagick they ship.

Then I digressed with a blog entry on a slightly different topic, nested list comprehensions in Python, because I happened to write one while creating the image we will use as our sample watermark. It shows the famous Arecibo space message, and is a tiny image of only 23×73 pixels that looks like this when enlarged:

The basic watermarking process itself is very simple thanks to a wonderful tool that I discovered called pdftk (short for “PDF toolkit”) which, as usual, Debian has already packaged for me. It can rotate documents, extract pages, concatenate several files together, and help fill out PDF forms from data in a file. Of particular interest here is its ability to either “stamp” an image on top of each page of a document, or to place one in the background as a watermark.

The watermark image itself has to be a PDF file — pdftk does not deal in any other file formats — which is why I needed GraphicsMagick to convert the Arecibo image into a PDF in the first place. Putting the two steps together, one has a primitive but workable process for using a PNG image as a watermark:

$ gm convert arecibo.png arecibo.pdf
$ pdftk in.pdf background arecibo.pdf output wmark1.pdf

Hefty watermark (click for PDF). A first attempt at watermarking results in a huge watermark that reaches both to the top and bottom edges of the page.

As you can see, pdftk automatically adjusts the size of the watermark image to reach precisely to the edges of the page being marked — which is a huge favor given the difficulty I would have had in resizing the watermark myself to match the page size of the input file. But, in the above case, the result seems less than perfectly attractive; watermarks usually sit tidily near the center of a page, rather than running all the way against its edges.

Clearly, we want to add some margins to the watermark. And though margins are easy to add to some image formats — they would be simple to add to the arecibo.png file that we are using in this example — in actual practice I need to support watermarks that might be in vector formats like SVG or EPS. While I could go through each possible input format and contrive some way of adjusting its margins, it would obviously be much more convenient to convert everything to PDF first, and then add margins directly to the PDFs.

I used Debian's apt-cache search command to look for additional tools that might help me (which is how I found pdftk in the first place!) and found an old command called pdfcrop that was part of the texlive series of packages; it supports a --margins option with which whitespace can be added around a PDF file. But I found that it often would refuse to process a perfectly good PDF file with a horribly uninformative error message like:

Error: Cannot move `tmp-pdfcrop-10631.pdf' to `out.pdf'!

I tried to investigate the error message, but discovered that pdfcrop is actually a Perl script that writes LaTeX macros which are then run against the target PDF file. And it was last updated in 2004. I have, alas, elected not to make it part of my toolchain.

Then I discovered that Python itself has a quite serviceable PDF package named pyPdf, with the bonus that it is written in pure Python and therefore requires no external libraries! Thanks to its ability to adjust the “bounding box” that defines the edges of an image in PDF coordinates, adding margins was as simple as loading the image, doing some addition and subtraction, and then saving the result. To add modest 10-point margins to the Arecibo message, for example, we can create this wmargins.py script:

from pyPdf import PdfFileWriter, PdfFileReader
pdf = PdfFileReader(file('arecibo.pdf', 'rb'))
p = pdf.getPage(0)
for box in (p.mediaBox, p.cropBox, p.bleedBox,
                                    p.trimBox, p.artBox):
    box.lowerLeft = (box.getLowerLeft_x() - 10,
                     box.getLowerLeft_y() - 10)
    box.upperRight = (box.getUpperRight_x() + 10,
                      box.getUpperRight_y() + 10)
output = PdfFileWriter()
output.addPage(p)
output.write(open('arecibo2.pdf', 'wb'))

You can test this yourself by installing pyPdf in a convenient temporary directory with virtualenv, running the above script, then calling pdftk on the result:

$ virtualenv vpython
$ vpython/bin/easy_install pyPdf
$ vpython/bin/python wmargins.py
$ pdftk in.pdf background arecibo2.pdf output wmark2.pdf

Watermark with margins (click for PDF). Margins prevent the watermark from reaching the page edges, which allows the blocks of text to assume the role of defining the visual shape of the page.

All pretty simple, right? Well, it turns out that there was one final complication — and that, before I was finished, I actually wound up spending more than an hour reading the PDF specification in order to understand what, exactly, was going wrong! But that will be the topic for my last blog post in this series. Stay tuned.