The September 2009 issue of Python Magazine

The September issue of Python Magazine appeared on the web late last week and only now, as a new week has started, am I finally sitting down to announce it! The articles range from technically heavy development topics to high-level thoughts about the whole Python community, with plenty in between.

I have to say that our prettiest article this month is “Using Python to Create Beautiful Documents” by Yusdi Santoso, who shares the basic secrets to document generation that he learned when building the EuroPython 2009 brochure using a Python program. Traditional typesetting and computer typography were both interests of mine when I was growing up, so it was fun to read Yusdi's introduction to using ReportLab to generate PDF documents. I look forward to his follow-up article that we will soon be publishing, on the specific techniques that he used in creating the EuroPython booklet.

The other technical articles are an introduction to using SOAP in Python; a guide to displaying objects in a Mac OS X GUI created with PyObjC; an article introducing Python's own built-in Tkinter GUI toolkit; and a small excursion of my own that attempts to explain the popular “trick” (well, it really confused me the first time I saw it!) of defining a decorator using a pair of nested functions. I should confess that my own article contains what is probably this issue's biggest mistake, as pointed out quite promptly by alert reader Emanuel Woiski: in the code sample that is its whole crux of my example, I somehow managed to omit one of the most crucial lines, shown here in bold:

def log(function):
    def log_wrapper(*args):
        print "called %s%s" % (
            function.__name__, tuple(args)
            )
        return function(*args)
    return log_wrapper

I suppose I will now need remedial cut-and-paste training of some sort.

Finally, the issue is rounded out by three articles that move back from Python coding and step out to wider vantage points. Justin Lilly provides an excellent guide to customizing your Vim setup so that it becomes a powerful Python integrated development environment. Steve Holden muses about why diveristy is so difficult and reveals some of the recent goings-on surrounding the diversity statement that the Python Software Foundation has been working on. And my own editorial seeks to point any Python Magazine readers who do not yet have a strong connection with the wider community in the direction of greater engagement with the world of Python.

All in all, I think the issue is a nice mix of fact, experience, and opinion. Please consider subscribing if you would like to hear more about what people are doing with Python, and how. I enjoy reading it; so might you.

Posted in Computing, Document processing, Python | 3 Comments »

Applying PDF watermarks upside down

Now that the excitement of PyCon 2009 is over, it is time for me to finish this brief series of blog posts on watermarking PDF files. In the first post I outlined how GraphicsMagick and Adobe Reader proved essential to the project for their ability to produce correct PDF files and then help me verify their correctness. The second post showed how an image can be applied as a watermark using the pdftk PDF Toolkit utility. The resulting watermark, after some margins had been added using a Python script, looked rather attractive:

Watermark with margins
Watermarked page (click for PDF). The light blue design is a PDF file that pdftk resizes and centers on the target document.

My last challenge was that, on certain pages, the watermark we were using had to be turned upside down. “Simple,” I thought, “I'll use pdftk to turn the watermark over before applying it!” I just had to process the watermark image with the letter S (“south”), which tells pdftk to rotate the image by 180°, and then use the result as the watermark:

$ pdftk arecibo2.pdf cat 1S output arecibo3.pdf
$ pdftk in.pdf background arecibo3.pdf output wmark3.pdf
Watermark with margins
Upside-down watermark (click for PDF). Whoops! After turning the watermark upside down, pdftk lost the ability to properly center it.
(more...)

Posted in Computing, Document processing | 2 Comments »

Adding margins to PDF watermarks

This is the second article in my series on adding “watermark” images to PDF files, which sit behind any text and graphics that were already on the page. Last week I outlined the first two lessons that I learned while developing this watermark process: first, always use Adobe Acrobat to verify that you are creating valid PDFs in your toolchain, and second, the version of GraphicsMagick that currently comes with Debian unstable produces better PDF files than the version of ImageMagick they ship.

Then I digressed with a blog entry on a slightly different topic, nested list comprehensions in Python, because I happened to write one while creating the image we will use as our sample watermark. It shows the famous Arecibo space message, and is a tiny image of only 23×73 pixels that looks like this when enlarged:

Arecibo message

The basic watermarking process itself is very simple thanks to a wonderful tool that I discovered called pdftk (short for “PDF toolkit”) which, as usual, Debian has already packaged for me. It can rotate documents, extract pages, concatenate several files together, and help fill out PDF forms from data in a file. Of particular interest here is its ability to either “stamp” an image on top of each page of a document, or to place one in the background as a watermark.

The watermark image itself has to be a PDF file — pdftk does not deal in any other file formats — which is why I needed GraphicsMagick to convert the Arecibo image into a PDF in the first place. Putting the two steps together, one has a primitive but workable process for using a PNG image as a watermark:

$ gm convert arecibo.png arecibo.pdf
$ pdftk in.pdf background arecibo.pdf output wmark1.pdf
Letter with basic watermark
Hefty watermark (click for PDF). A first attempt at watermarking results in a huge watermark that reaches both to the top and bottom edges of the page.

As you can see, pdftk automatically adjusts the size of the watermark image to reach precisely to the edges of the page being marked — which is a huge favor given the difficulty I would have had in resizing the watermark myself to match the page size of the input file. But, in the above case, the result seems less than perfectly attractive; watermarks usually sit tidily near the center of a page, rather than running all the way against its edges.

Clearly, we want to add some margins to the watermark. And though margins are easy to add to some image formats — they would be simple to add to the arecibo.png file that we are using in this example — in actual practice I need to support watermarks that might be in vector formats like SVG or EPS. While I could go through each possible input format and contrive some way of adjusting its margins, it would obviously be much more convenient to convert everything to PDF first, and then add margins directly to the PDFs.

I used Debian's apt-cache search command to look for additional tools that might help me (which is how I found pdftk in the first place!) and found an old command called pdfcrop that was part of the texlive series of packages; it supports a --margins option with which whitespace can be added around a PDF file. But I found that it often would refuse to process a perfectly good PDF file with a horribly uninformative error message like:

Error: Cannot move `tmp-pdfcrop-10631.pdf' to `out.pdf'!

I tried to investigate the error message, but discovered that pdfcrop is actually a Perl script that writes LaTeX macros which are then run against the target PDF file. And it was last updated in 2004. I have, alas, elected not to make it part of my toolchain.

Then I discovered that Python itself has a quite serviceable PDF package named pyPdf, with the bonus that it is written in pure Python and therefore requires no external libraries! Thanks to its ability to adjust the “bounding box” that defines the edges of an image in PDF coordinates, adding margins was as simple as loading the image, doing some addition and subtraction, and then saving the result. To add modest 10-point margins to the Arecibo message, for example, we can create this wmargins.py script:

from pyPdf import PdfFileWriter, PdfFileReader
pdf = PdfFileReader(file('arecibo.pdf', 'rb'))
p = pdf.getPage(0)
for box in (p.mediaBox, p.cropBox, p.bleedBox,
                                    p.trimBox, p.artBox):
    box.lowerLeft = (box.getLowerLeft_x() - 10,
                     box.getLowerLeft_y() - 10)
    box.upperRight = (box.getUpperRight_x() + 10,
                      box.getUpperRight_y() + 10)
output = PdfFileWriter()
output.addPage(p)
output.write(open('arecibo2.pdf', 'wb'))

You can test this yourself by installing pyPdf in a convenient temporary directory with virtualenv, running the above script, then calling pdftk on the result:

$ virtualenv vpython
$ vpython/bin/easy_install pyPdf
$ vpython/bin/python wmargins.py
$ pdftk in.pdf background arecibo2.pdf output wmark2.pdf
Watermark with margins
Watermark with margins (click for PDF). Margins prevent the watermark from reaching the page edges, which allows the blocks of text to assume the role of defining the visual shape of the page.

All pretty simple, right? Well, it turns out that there was one final complication — and that, before I was finished, I actually wound up spending more than an hour reading the PDF specification in order to understand what, exactly, was going wrong! But that will be the topic for my last blog post in this series. Stay tuned.

Posted in Computing, Document processing, Python | No Comments »

GraphicsMagick saved the day

I had never heard of GraphicsMagick until yesterday, when I discovered that the venerable, if clunky, ImageMagick suite was ruining one of my customer's print jobs by producing invalid PDF files. This is actually the third major failure that this particular project has encountered because of flaws in standard open-source document tools. In this and my next two blog posts, I will outline the bugs that I have encountered, in the hopes of saving some future reader the time that it took me to track them down.

But I will begin the series rather simply, with the first two lessons that I learned during the project:

1. Always verify PDF correctness with Adobe Acrobat.

The trusty Xpdf viewer, with which I have viewed PDF files for years, turns out to have a remarkable ability to decipher and display even somewhat damaged PDF files. That's a great feature — if someone else has produced the PDF, and you just need to read it, whether it's damaged or not. But this “feature” becomes a problem if you have just produced a PDF, and want to know about any errors in the file before your customer does!

In this situation, Adobe's Acrobat Reader should be your viewer of choice. Not only is it probably the software that your customers will be using anyway, but it is — and this seems intentional on Adobe's part — a very stringent interpreter. The error it displays for a corrupt PDF, I must admit, is among the least-informative I have seen this month:

There was a problem reading this document (14)

But the information that this does yield is invaluable: your customer will not be able to see or print this PDF until you find the bug in your toolchain and fix the result. It is, of course, objectionable and problematic to have to install a closed-source product on what might otherwise be a completely clean development system; but I have found it irreplaceable for its ability to show me problems before my customers see them.

This mistake cost me more time than you might imagine. The tool chain I built for my customer generates several intermediate PDF files, and it turns out that the error was happening fairly early in the process — so that one of the first files produced, and therefore each subsequent PDF, was invalid and could not be opened with Acrobat. But because I was viewing them all with Xpdf, I spent many minutes looking at the final step in the chain and wondering why that PDF tool was often dying, when the PDFs it was consuming looked so good in my viewer!

2. Avoid ImageMagick 6.3.7 when producing PDFs; try GraphicsMagick instead!

More recent versions of ImageMagick might not have any problem producing PDF files from PNG images, but the ImageMagick that currently ships with Debian unstable, version 6.3.7, seems to have routine problems trying to turn some of my customer's PNG files into valid PDFs. To avoid having to compile, install, and maintain my own version of ImageMagick, I cast around for an alternative, and was startled when Google brought me to the GraphicsMagick project! Here was ImageMagick done right: instead of creating dozens of commands on your system, as though this were the 1970s, GraphicsMagick defines a single gm binary with multiple sub-commands:

$ gm convert watermark.png watermark.pdf

Check out their web site for more great features; but I'm simply happy that the PDFs it has produced so far have been clean, correct, and consistent.

A question for my readers: can a good, open-source PDF checker be found somewhere, that is at least as stringent as Adobe Acrobat? Leave a comment below if you have a suggestion; such a tool would have made this project considerably easier!

Posted in Computing, Document processing | 5 Comments »