This is the second article in my series
on adding “watermark” images to PDF files,
which sit behind any text and graphics that were already on the page.
Last week I outlined
the first two lessons that I learned
while developing this watermark process:
first, always use Adobe Acrobat
to verify that you are creating valid PDFs in your toolchain,
and second, the version of GraphicsMagick
that currently comes with Debian unstable
produces better PDF files
than the version of ImageMagick they ship.
Then I digressed with a blog entry on a slightly different topic,
nested list comprehensions in Python,
because I happened to write one
while creating the image we will use as our sample watermark.
It shows the famous Arecibo space message,
and is a tiny image
of only 23×73 pixels
that looks like this when enlarged:
The basic watermarking process itself is very simple
thanks to a wonderful tool that I discovered
called pdftk
(short for “PDF toolkit”)
which, as usual, Debian has already packaged for me.
It can rotate documents,
extract pages,
concatenate several files together,
and help fill out PDF forms from data in a file.
Of particular interest here is its ability to either “stamp” an image
on top of each page of a document,
or to place one in the background as a watermark.
The watermark image itself has to be a PDF file —
pdftk does not deal in any other file formats —
which is why I needed GraphicsMagick
to convert the Arecibo image into a PDF in the first place.
Putting the two steps together,
one has a primitive but workable process
for using a PNG image as a watermark:
$ gm convert arecibo.png arecibo.pdf
$ pdftk in.pdf background arecibo.pdf output wmark1.pdf

Hefty watermark (click for PDF).
A first attempt at watermarking results in a huge watermark
that reaches both to the top and bottom edges of the page.
As you can see,
pdftk automatically adjusts the size of the watermark image
to reach precisely to the edges of the page being marked —
which is a huge favor
given the difficulty I would have had
in resizing the watermark myself
to match the page size of the input file.
But, in the above case,
the result seems less than perfectly attractive;
watermarks usually sit tidily near the center of a page,
rather than running all the way against its edges.
Clearly, we want to add some margins to the watermark.
And though margins are easy to add to some image formats —
they would be simple to add
to the arecibo.png file that we are using
in this example —
in actual practice I need to support watermarks
that might be in vector formats like SVG or EPS.
While I could go through each possible input format
and contrive some way of adjusting its margins,
it would obviously be much more convenient
to convert everything to PDF first,
and then add margins directly to the PDFs.
I used Debian's apt-cache search command
to look for additional tools that might help me
(which is how I found pdftk in the first place!)
and found an old command called pdfcrop
that was part of the texlive series of packages;
it supports a --margins option
with which whitespace can be added around a PDF file.
But I found that it often would refuse to process
a perfectly good PDF file
with a horribly uninformative error message like:
Error: Cannot move `tmp-pdfcrop-10631.pdf' to `out.pdf'!
I tried to investigate the error message,
but discovered that pdfcrop is actually a Perl script
that writes LaTeX macros
which are then run against the target PDF file.
And it was last updated in 2004.
I have, alas, elected not to make it part of my toolchain.
Then I discovered that Python itself
has a quite serviceable PDF package named pyPdf,
with the bonus that it is written in pure Python
and therefore requires no external libraries!
Thanks to its ability to adjust the “bounding box”
that defines the edges of an image in PDF coordinates,
adding margins was as simple as loading the image,
doing some addition and subtraction,
and then saving the result.
To add modest 10-point margins to the Arecibo message,
for example, we can create this wmargins.py script:
from pyPdf import PdfFileWriter, PdfFileReader
pdf = PdfFileReader(file('arecibo.pdf', 'rb'))
p = pdf.getPage(0)
for box in (p.mediaBox, p.cropBox, p.bleedBox,
p.trimBox, p.artBox):
box.lowerLeft = (box.getLowerLeft_x() - 10,
box.getLowerLeft_y() - 10)
box.upperRight = (box.getUpperRight_x() + 10,
box.getUpperRight_y() + 10)
output = PdfFileWriter()
output.addPage(p)
output.write(open('arecibo2.pdf', 'wb'))
You can test this yourself by installing pyPdf
in a convenient temporary directory with virtualenv,
running the above script, then calling pdftk on the result:
$ virtualenv vpython
$ vpython/bin/easy_install pyPdf
$ vpython/bin/python wmargins.py
$ pdftk in.pdf background arecibo2.pdf output wmark2.pdf

Watermark with margins (click for PDF).
Margins prevent the watermark from reaching the page edges,
which allows the blocks of text to assume the role
of defining the visual shape of the page.
All pretty simple, right?
Well, it turns out that there was one final complication —
and that, before I was finished, I actually wound up spending
more than an hour reading the PDF specification
in order to understand what, exactly, was going wrong!
But that will be the topic for my last blog post in this series.
Stay tuned.