The terrible magic of setuptools

I am providing a bit of assistance to the wonderful Natural Language Toolkit Project, who have implemented a wide array of language processing algorithms in Python atop a common set of very sleek — and cleanly Pythonic — data structures for representing natural language. If you are at all interested, check out their recent book, Natural Language Processing with Python, which does a great job of showing how the NLTK works at the same time as it explains the computer science concepts of language processing.

The NTLK project wants to support installing their package as a Python egg, so they asked me to tidy up their setup.py file and prepare everything for distribution via the Python Package Index.

As usual, my desire for simple and reproducible behavior when distributing Python packages has run aground on the tangled magic for which setuptools is so justly famous.

The NLTK package's original setup.py, powered by the standard distutils, produced a tidy source tar.gz of around 850 KB. My fancy egg-capable version, on the other hand, wound up disgorging a source archive of no less than 24.7 MB in size. The NLTK people added a note to our open issue that politely asked me why their source distribution was now thirty times its original size without the package's MANIFEST.in file having been changed in the least.

I recognized the problem immediately, of course, as will many of you who have transitioned a project from the distutils to setuptools before. In order to make our lives so vastly grand and convenient, setuptools includes in a source distribution not only all of the files described in the package's manifest, but also every file in the area that is currently checked into Subversion! This introduces a thicket of potential problems: a project started without version control will suddenly behave differently when checked into Subversion; files disappear from the source distribution if you upgrade from Subversion to a version control system of which setuptools is ignorant; and, of course, any attempt to build your project from a clean svn export can result in a source archive with hundreds of files missing.

Since the NLTK has been distributed successfully for years using only distutils and a well-written project manifest, I simply wanted to turn off the magic. It would be possible, I suppose, to instruct packagers to run svn export before running the sdist command, but it offends good sense that you should have to check out a project in some special way just to get predictable behavior. The whole point of version control is that it's supposed to be meta-data that keeps up with history while having no effect on how an application builds and runs.

Having been bitten by this setuptools feature several times in the past, I decided that, this time, I would actually learn how to disable the behavior instead of trying to work around it yet again. “Where,” I thought, “is the switch? Where is the option?” After an exhausting search of the setuptools documentation and then, finally, its source code, I am perplexed to report that the only recourse left to us seems to be violence! You must actually damage setuptools before running the setup() function if you want to prevent it from trying to muck about with Subversion. The cleanest attack on the problem — or should we say the “cleanest surgery?” — seems to be:

# Add this to setup.py before you call setup()

from setuptools.command import sdist
del sdist.finders[:]

With this change, setuptools peacefully ignores whether the project happens to have been checked out of version control or not, and with a bit of tweaking the same source distribution can be produced as would have been generated by the distutils.

Now, at this point, you probably think that this blog post is shaping up to be a complaint against how setuptools pays attention to Subversion while providing no way to turn this intrusive feature off. But, in fact, that is all really just a side issue: it was a minor adventure that I wanted to share on the way toward my main point. I am not even sure whether the remedy suggested above is a good idea; I have asked the question on Stack Overflow, so we can see if someone else figures out a better approach.

The actual complaint of this blog post is that setuptools displays a troublesome, undocumented, and downright obfuscatory hysteresis with respect to the list of files that gets included in a source package. The fact is, after I discovered the sdist.finders trick quoted above, I actually thought it didn't work because the next source archive I created was exactly the same size as before.

The problem? It turns out that setuptools will choose to keep including a given file in your source archive, even if you have removed it from your manifest, until you force setuptools to rebuild the package's file list by destroying its SOURCES.txt file:

$ python setup.py sdist
$ wc -c dist/nltk-2.0b4.tar.gz
25648109
$ rm nltk.egg-info/SOURCES.txt
$ python setup.py sdist
$ wc -c dist/nltk-2.0b4.tar.gz
868517

Only with this removal will files that are no longer listed in the manifest, or governed by version control, quietly and finally drop out of the source archives you are creating. That is my complaint: that setuptools refuses to remove files from your source distribution unless you know to go remove a magic file from under its nose.

As I now go to close out that NLTK ticket, should I explain that packagers always need to manually remove SOURCES.txt before preparing an official source distribution for upload to the Python Package Index? Or should I be brave and add an os.remove() call to setup.py myself that always destroys the file before the setup() call gets underway? Is there any danger to that approach? Leave a comment and let me know!

Posted: Wednesday, July 15th, 2009 at 12:59 am
Categories: Computing, Python

You can leave a response, or trackback from your own site.

  • Rene Dudfield

    Do you need to use use setuptools to be on pypi? No.

    Maybe you don’t need to create eggs at all. Just upload your tar.gz and .zip?

  • Jason R. Coombs

    I’ve seen some packagers explicitly call egg_info before calling sdist in their releases (http://elixir.ematia.de/svn/elixir/trunk/release.howto). I suspect that will regenerate the sources.txt file.

  • Mehmet Ali

    As a nltk user & developer I’d like to see ntlk and its modules like wordnet in python package index. Good luck

  • Orestis Markou

    I just got bit by this exact same problem, because setuptools 0.6c8 doesn’t understand SVN 1.5 entries format. Fortunately updating to 0.6c9 fixed the issue. However, it was still cryptic to debug, and probably totally unnecessary as well.

  • FantasyPants

    I know this is not constructive, but it has to be said:

    setuptools is the cancer of python

  • Paul Moore

    Yet another project ends up using setuptools for unclear reasons, and hits trouble as a result. (Note: “to allow them to distribute as eggs” isn’t a clear reason – why do they want to distribute as eggs?)

    I could cry :-(

    I just hope they distribute eggs in addition to the current formats. If it’s instead of them, then I will cry.

  • Brandon Craig Rhodes

    Rene and Paul

    The reason to move to eggs is that the NLTK project depends on YAML, which they had previously been redistributing as part of their own source code distribution. But with eggs and setuptools, because they support dependencies, they could stop including another product in their own source tree and instead simply depend on the version of YAML already available on PyPI.

  • Paul Boddie

    I hope NLTK is still going to support vanilla distutils. There are plenty of ways of dealing with dependencies and setuptools is only one of them (and not the most appropriate for many of us, either).

  • Francis Ridder

    Not to troll or start a possibly nasty discussion, but has anyone tried to write an alternative to setuptools? I simply do not know the full history.

  • Jackieboy

    No offense to the masterminds of the egg thing, but what problem is it trying to solve again?

  • Martijn Faassen

    Note in case it’s unclear: it’s possible to generate just tarballs using setuptools and only upload those to PyPI. In fact I’d recommend doing that instead of uploading eggs, unless you’re distributing binaries to platforms that don’t have a C-compiler installed (Windows and perhaps Mac OS X).

    These will still have a setuptools dependency of course.

    I can’t live without setuptools myself: it can do a lot of very very useful things. It’s also annoyingly magic and obscure in places. Perhaps most importantly it’s my impression the setuptools project isn’t as community-driven as it could be, despite ample interest in the community to help improve it.

    On the longer term I hope that Tarek Ziade and others will improve the Python packaging situation. One useful step is to separate out metadata from packaging tool, opening the space to write more packaging tools. The whole distutils pattern requiring a ’setup.py’ at all leads to problems.

    To conclude: setuptools solves a lot of my packaging worries, but the situation isn’t perfect.

    See also this blog entry for some useful ideas:

  • Robert Redburn

    I just encountered the same frustrating problem checking a setuptools based project into svn. Until I found this post, I had no idea why my find_packages(exclude=[...]) stopped working. I assumed an interaction with svn and hunted for a disable everywhere — hours wasted for naught. I’ll try the suggested surgery, thanks.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>