The Python language and poetry

If programming languages were poets, which poet would the Python language be?

Obviously, Python would be e. e. cummings: the poet for whom whitespace was most truly significant!

Posted in Computing, Python  |  No Comments »

The May 2009 issue of Python Magazine

The May issue of Python Magazine is now online! Obviously things fell a bit behind schedule this month, as I am writing this five days into the month of June. But with all of the excitement surrounding the publisher's flagship annual conference last month — php|tek 2009 in Chicago — it was certainly better that the magazine slip late than that delays interfere with an event which drew a couple of hundred people to learn more about quality software development.

What can you look forward to reading? This issue worked out as a balance between community news and several sophisticated technical articles on using Python better. The technical articles include:

(more...)
Posted in Computing, Python  |  1 Comment »

The April issue of Python Magazine

It was back in the early 1990s, if memory serves me, that the Coca-Cola company — it may have been in one of their annual reports — decided to change their perspective. They declared that the soft drink market, of which they held more than 50%, was simply too small to remain their focus. Rather than wondering what marketing campaign might allow them to chip away a few more percentage points from their competitors, they drew attention to the fact that Coca-cola products represented only 2–3% of all fluids consumed, worldwide, and started thinking about how to increase that number. They decided, in other words, that their biggest competitor would no longer be Pepsi, but water.

Now, think about the Python programming language, and try the same shift in perspective. Yes, we are all happy that readers of Linux Journal have selected Python as their Favorite Progrmming Language in this year's “Reader's Choice Awards” and we are pleased when we move up a few more percentage points in something like the TIOBE Programming Community Index. But comparing our market share with that of other programming languages can, too often, fool us into thinking that we are playing a zero-sum game in which our own community can expand only by prying other programmers' fingers off of the languages they currently know and love.

Try something else. Imagine all of the world's people, and ask: what percentage of them use Python? And what might it look like to increase that percentage? You will suddenly think of the grade school down the street, where you could volunteer one afternoon a week teaching Python programming at their computer club. You will realize that the local community college provides a simple IT curriculum, but gives their students no information about getting connected to a local programmer user's group. You will begin to wonder what difference simple Python skills might make in the hands of teenagers, college students, and young families raising children.

You will, in other words, start thinking about how to change the world.

Cover of April 2009 Python Magazine

And the reason that I'm excited about the April issue of Python Magazine is that our cover stories focus on Python programmers who, during the recent high-profile United States elections, made this kind of step outward into realizing what Python could do in their communities.

  • Mitch Trachtenberg has written our feature article, about how his ballot scanning software helped election volunteers and officials in Humboldt County, California, discover that nearly 200 ballots had not been counted by their Premier Elections Solutions voting systems. Thanks to his efforts, California has now de-certified that particular software from being used in elections in their state.
  • We also have an interview with Neal McBurnett, whose ElectionAudits software automated the statistical calculations in Boulder, Colorado, that allowed their election audits to focus on close races where more ballots have to be checked before a winner can be certified with confidence.
  • To celebrate the fact that April 22nd was World Plone Day, we feature an Introduction to Plone with instructions for quickly downloading, installing, and customizing this popular Content Management System.
  • Our more technical articles describe how to create your own domain-specific language with a parser that gets invoked when you run import on a file with an extension of your own choosing, and how to create a protocol simulation framework that lets you experiment with how packets will traverse a network. And Mark Mruss offers another Welcome to Python article that this time covers Python dictionaries, whose usefulness and elegant design were certainly a big part of my own adoption of the language.
  • Finally, Steve Holden shares his perspective as head of the Python Software Foundation, this time reflecting on the wonderful success of PyCon 2009 last month.

I hope you'll take a look at this month's issue, and I hope that you learn as much from reading it as I myself learned from the process of editing. Enjoy!

Posted in Computing, Python  |  No Comments »

pyron: Making Python package development DRY to the point of no return

I finally snapped last week.

After years of writing verbose and repetitive setup.py files for my Python packages, I am unable to write another. Instead, I have started writing Pyron, a tool that gathers the same information by inspecting a Python package itself. Not only does this mean that I get to stop repeating myself, but that my projects will become much more uniform because package metadata will be represented through common conventions instead of explicit (and repetitive) configuration. Though Pyron is still very primitive, it has already allowed me to reduce simple packages to only a README.txt plus their actual Python source code.

(more...)
Posted in Computing, Python  |  10 Comments »

Applying PDF watermarks upside down

Now that the excitement of PyCon 2009 is over, it is time for me to finish this brief series of blog posts on watermarking PDF files. In the first post I outlined how GraphicsMagick and Adobe Reader proved essential to the project for their ability to produce correct PDF files and then help me verify their correctness. The second post showed how an image can be applied as a watermark using the pdftk PDF Toolkit utility. The resulting watermark, after some margins had been added using a Python script, looked rather attractive:

Watermark with margins
Watermarked page (click for PDF). The light blue design is a PDF file that pdftk resizes and centers on the target document.

My last challenge was that, on certain pages, the watermark we were using had to be turned upside down. “Simple,” I thought, “I'll use pdftk to turn the watermark over before applying it!” I just had to process the watermark image with the letter S (“south”), which tells pdftk to rotate the image by 180°, and then use the result as the watermark:

$ pdftk arecibo2.pdf cat 1S output arecibo3.pdf
$ pdftk in.pdf background arecibo3.pdf output wmark3.pdf
Watermark with margins
Upside-down watermark (click for PDF). Whoops! After turning the watermark upside down, pdftk lost the ability to properly center it.
(more...)
Posted in Computing, Document processing  |  2 Comments »

My first issue of Python Magazine

My first issue of Python Magazine is out!

After two months of being tutored in the arts of magazine publishing by retiring editor Doug Hellmann, March has been my first month in the Editor-in-Chief chair of Python Magazine. It is exciting to have my first issue come out while I am here at PyCon 2009 in Chicago. I am talking with other programmers, meeting new friends, and, of course, eyeing everyone I meet with the question of whether they might be suitable author or technical editor material.

I am especially excited about our cover issue this month! “Commanding Robots with LEGO Mindstorms” shows how simple it is for a Python program to manipulate both a binary on-the-wire protocol and binary calls into a Windows DLL, all without ever having to leave the Python standard library! For me, this is the real magic of Python: that it not only introduced an incomparably clean syntax and tidy language feature set, but that developers of both the standard library and of third-party Python modules are committed to creating vastly simple interfaces for what in other languages can be very difficult problems. The article is a great guide to using best practices and powerful tools when linking Python to other protocols and libraries.

Other articles include “Getting Started with Message Queues” which talks about how to arrange your application around a central queue so that you can distribute expensive work across dozens of machines; “Statically Analyzing Python Code” by the author of PySmell about how Python's built-in code parsing tools can be used to start investigating powerful ideas like type inferencing; and “Using Python for Pedigree Analysis” that is yet another success story from the world of science, about how Python — which began to be adopted very early in its history by working scientists — continues to move into new areas wherever science needs a clean and powerful language to replace the tangle of low-level code and temporary scripts that traditionally characterize systems written by those whose first expertise is not software design.

Throw in our several regular columns and my own monthly editorial, and you have a complete issue. (I think it was while writing the editorial that it really sank in that I am now an Editor-in-Chief!) I hope you will consider subscribing, and I especially hope you will subscribe to the print edition — for only an extra dollar a month, you will receive an actual, physical artifact that you can leave in the breakroom at work, share with a friend who wants to explore Python, or leave at a client's site to expose their own employees to the world of Python and its community. And, say hello to me here at PyCon!

Posted in Uncategorized  |  2 Comments »

Adding margins to PDF watermarks

This is the second article in my series on adding “watermark” images to PDF files, which sit behind any text and graphics that were already on the page. Last week I outlined the first two lessons that I learned while developing this watermark process: first, always use Adobe Acrobat to verify that you are creating valid PDFs in your toolchain, and second, the version of GraphicsMagick that currently comes with Debian unstable produces better PDF files than the version of ImageMagick they ship.

Then I digressed with a blog entry on a slightly different topic, nested list comprehensions in Python, because I happened to write one while creating the image we will use as our sample watermark. It shows the famous Arecibo space message, and is a tiny image of only 23×73 pixels that looks like this when enlarged:

Arecibo message

The basic watermarking process itself is very simple thanks to a wonderful tool that I discovered called pdftk (short for “PDF toolkit”) which, as usual, Debian has already packaged for me. It can rotate documents, extract pages, concatenate several files together, and help fill out PDF forms from data in a file. Of particular interest here is its ability to either “stamp” an image on top of each page of a document, or to place one in the background as a watermark.

The watermark image itself has to be a PDF file — pdftk does not deal in any other file formats — which is why I needed GraphicsMagick to convert the Arecibo image into a PDF in the first place. Putting the two steps together, one has a primitive but workable process for using a PNG image as a watermark:

$ gm convert arecibo.png arecibo.pdf
$ pdftk in.pdf background arecibo.pdf output wmark1.pdf
Letter with basic watermark
Hefty watermark (click for PDF). A first attempt at watermarking results in a huge watermark that reaches both to the top and bottom edges of the page.

As you can see, pdftk automatically adjusts the size of the watermark image to reach precisely to the edges of the page being marked — which is a huge favor given the difficulty I would have had in resizing the watermark myself to match the page size of the input file. But, in the above case, the result seems less than perfectly attractive; watermarks usually sit tidily near the center of a page, rather than running all the way against its edges.

Clearly, we want to add some margins to the watermark. And though margins are easy to add to some image formats — they would be simple to add to the arecibo.png file that we are using in this example — in actual practice I need to support watermarks that might be in vector formats like SVG or EPS. While I could go through each possible input format and contrive some way of adjusting its margins, it would obviously be much more convenient to convert everything to PDF first, and then add margins directly to the PDFs.

I used Debian's apt-cache search command to look for additional tools that might help me (which is how I found pdftk in the first place!) and found an old command called pdfcrop that was part of the texlive series of packages; it supports a --margins option with which whitespace can be added around a PDF file. But I found that it often would refuse to process a perfectly good PDF file with a horribly uninformative error message like:

Error: Cannot move `tmp-pdfcrop-10631.pdf' to `out.pdf'!

I tried to investigate the error message, but discovered that pdfcrop is actually a Perl script that writes LaTeX macros which are then run against the target PDF file. And it was last updated in 2004. I have, alas, elected not to make it part of my toolchain.

Then I discovered that Python itself has a quite serviceable PDF package named pyPdf, with the bonus that it is written in pure Python and therefore requires no external libraries! Thanks to its ability to adjust the “bounding box” that defines the edges of an image in PDF coordinates, adding margins was as simple as loading the image, doing some addition and subtraction, and then saving the result. To add modest 10-point margins to the Arecibo message, for example, we can create this wmargins.py script:

from pyPdf import PdfFileWriter, PdfFileReader
pdf = PdfFileReader(file('arecibo.pdf', 'rb'))
p = pdf.getPage(0)
for box in (p.mediaBox, p.cropBox, p.bleedBox,
                                    p.trimBox, p.artBox):
    box.lowerLeft = (box.getLowerLeft_x() - 10,
                     box.getLowerLeft_y() - 10)
    box.upperRight = (box.getUpperRight_x() + 10,
                      box.getUpperRight_y() + 10)
output = PdfFileWriter()
output.addPage(p)
output.write(open('arecibo2.pdf', 'wb'))

You can test this yourself by installing pyPdf in a convenient temporary directory with virtualenv, running the above script, then calling pdftk on the result:

$ virtualenv vpython
$ vpython/bin/easy_install pyPdf
$ vpython/bin/python wmargins.py
$ pdftk in.pdf background arecibo2.pdf output wmark2.pdf
Watermark with margins
Watermark with margins (click for PDF). Margins prevent the watermark from reaching the page edges, which allows the blocks of text to assume the role of defining the visual shape of the page.

All pretty simple, right? Well, it turns out that there was one final complication — and that, before I was finished, I actually wound up spending more than an hour reading the PDF specification in order to understand what, exactly, was going wrong! But that will be the topic for my last blog post in this series. Stay tuned.

Posted in Computing, Document processing, Python  |  No Comments »

I finally understand nested comprehensions

This entire blog post can be summarized in the words of Guido himself that I have just discovered down at the bottom of PEP-202 (“List Comprehensions”):

The form [... for x... for y...] nests, with the last index varying fastest, just like nested for loops.

Have you ever seen a Python list comprehension like that, with two or more for loops inside? I have just written my first one! It was only recently that I discovered they were even possible, when I encountered several in a draft of the upcoming Natural Language Processing with Python book. (Which should be great — watch for O'Reilly to publish it!) They almost never turn up in other code that I encounter, and perhaps for good reason: they were deeply confusing the first time I saw them!

The code I have just written is shown below. It uses the Python Imaging Library to produce an image I will use in the series of blog posts that I started yesterday on watermarking PDF files. The code requires a small arecibo.txt file, detailing the radio message that was sent from the Arecibo Observatory in November 1974 to any other civilizations that might be listening. As you can see, I have successfully used two for clauses in the list comprehension that generates the image's pixels:

"""Draw the Arecibo message (blue on transparent)."""
from PIL import Image
image = Image.new("RGBA", (23, 73))
image.putdata([
    (192,224,255,255) if char == '1' else (0,0,0,0)
    for line in open('arecibo.txt')
    for char in line.strip()
    ])
image.save('arecibo.png')

Each pixel is a four-value tuple, by the way, because an RGBA image not only has a red, green, and blue channel for each pixel, but also an “A channel” specifying its opacity or transparency. The colors in use here are a completely opaque light blue, and a completely transparent color (the four zeros). The result looks something like:

Arecibo message

My mistake in reading the multiple for clauses was that, old C-language programmer that I am, I was expecting the comprehension structure to be concentric. That is, I thought that the last for must “enclose” the ones above it, creating a mess of lists inside of lists inside of lists. But it turns out that they are much simpler to read than that. Just read them like normal loops, with the “big loop” described first and the subsequent loops nested inside of it:

# The list comprehension said:
  [ expression
    for line in open('arecibo.txt')
    for char in line.strip() ]

# It therefore meant:
for line in open('arecibo.txt'):
    for char in line.strip():
        list.append(expression)

So, to read the comprehension, just picture colons appended to each for clause and, finally, the expression moved down inside of the innermost for loop.

Now that I have made this conceptual leap, I can “picture” the normal for loops each time I see a complicated list comprehension, and they are trivial to read and write! It still, I admit, feels odd that the expression, which would be deep inside of normal for loops, goes in front of them in a comprehension instead. And I am not sure that double comprehensions should become part of my normal coding style. (How many other Python programmers understand them? Has everyone else been using them without problems?) But they are a neat trick to have up my sleeve when I need to iterate over an image quickly and want to pack everything into a single, easily-bloggable expression.

Posted in Computing, Python  |  6 Comments »

GraphicsMagick saved the day

I had never heard of GraphicsMagick until yesterday, when I discovered that the venerable, if clunky, ImageMagick suite was ruining one of my customer's print jobs by producing invalid PDF files. This is actually the third major failure that this particular project has encountered because of flaws in standard open-source document tools. In this and my next two blog posts, I will outline the bugs that I have encountered, in the hopes of saving some future reader the time that it took me to track them down.

But I will begin the series rather simply, with the first two lessons that I learned during the project:

1. Always verify PDF correctness with Adobe Acrobat.

The trusty Xpdf viewer, with which I have viewed PDF files for years, turns out to have a remarkable ability to decipher and display even somewhat damaged PDF files. That's a great feature — if someone else has produced the PDF, and you just need to read it, whether it's damaged or not. But this “feature” becomes a problem if you have just produced a PDF, and want to know about any errors in the file before your customer does!

In this situation, Adobe's Acrobat Reader should be your viewer of choice. Not only is it probably the software that your customers will be using anyway, but it is — and this seems intentional on Adobe's part — a very stringent interpreter. The error it displays for a corrupt PDF, I must admit, is among the least-informative I have seen this month:

There was a problem reading this document (14)

But the information that this does yield is invaluable: your customer will not be able to see or print this PDF until you find the bug in your toolchain and fix the result. It is, of course, objectionable and problematic to have to install a closed-source product on what might otherwise be a completely clean development system; but I have found it irreplaceable for its ability to show me problems before my customers see them.

This mistake cost me more time than you might imagine. The tool chain I built for my customer generates several intermediate PDF files, and it turns out that the error was happening fairly early in the process — so that one of the first files produced, and therefore each subsequent PDF, was invalid and could not be opened with Acrobat. But because I was viewing them all with Xpdf, I spent many minutes looking at the final step in the chain and wondering why that PDF tool was often dying, when the PDFs it was consuming looked so good in my viewer!

2. Avoid ImageMagick 6.3.7 when producing PDFs; try GraphicsMagick instead!

More recent versions of ImageMagick might not have any problem producing PDF files from PNG images, but the ImageMagick that currently ships with Debian unstable, version 6.3.7, seems to have routine problems trying to turn some of my customer's PNG files into valid PDFs. To avoid having to compile, install, and maintain my own version of ImageMagick, I cast around for an alternative, and was startled when Google brought me to the GraphicsMagick project! Here was ImageMagick done right: instead of creating dozens of commands on your system, as though this were the 1970s, GraphicsMagick defines a single gm binary with multiple sub-commands:

$ gm convert watermark.png watermark.pdf

Check out their web site for more great features; but I'm simply happy that the PDFs it has produced so far have been clean, correct, and consistent.

A question for my readers: can a good, open-source PDF checker be found somewhere, that is at least as stringent as Adobe Acrobat? Leave a comment below if you have a suggestion; such a tool would have made this project considerably easier!

Posted in Computing, Document processing  |  4 Comments »

Installing Python packages for Emacs with virtualenv

The only rough edge I have found amidst the otherwise exceptional advice on Ryan McGuire’s Enigma Curry blog is that Ryan recommends installing Python packages with:

$ sudo easy_install package_url

This means that his Emacs configuration — which, very generously, he has started maintaining as a project on github so that other people can use it themselves, or branch their own versions — requires root access merely to install.

Like Ryan, I also keep my Emacs configuration under version control, so that improvements I check in from one account are easy to check out into all of my other accounts. Although my setup is probably too simple to be interesting as a public project, there is one aspect of it that I should share: unlike Ryan, I use the advanced technology of a virtualenv to hold the Python packages that Emacs needs. The virtual environment lives under my own account, and is easy to create, access, and rebuild, even in the absence of root privileges on a particular machine. Even better, packages that I install or upgrade inside of the virtual environment cannot interfere with Python programs running elsewhere on the system.

(more…)

Posted in Computing, Emacs, Python  |  4 Comments »