I am providing a bit of assistance to the wonderful Natural Language Toolkit Project, who have implemented a wide array of language processing algorithms in Python atop a common set of very sleek — and cleanly Pythonic — data structures for representing natural language. If you are at all interested, check out their recent book, Natural Language Processing with Python, which does a great job of showing how the NLTK works at the same time as it explains the computer science concepts of language processing.
The NTLK project wants to support installing their package as a Python egg, so they asked me to tidy up their setup.py file and prepare everything for distribution via the Python Package Index.
As usual, my desire for simple and reproducible behavior when distributing Python packages has run aground on the tangled magic for which setuptools is so justly famous.
The NLTK package's original setup.py, powered by the standard distutils, produced a tidy source tar.gz of around 850 KB. My fancy egg-capable version, on the other hand, wound up disgorging a source archive of no less than 24.7 MB in size. The NLTK people added a note to our open issue that politely asked me why their source distribution was now thirty times its original size without the package's MANIFEST.in file having been changed in the least.
I recognized the problem immediately, of course, as will many of you who have transitioned a project from the distutils to setuptools before. In order to make our lives so vastly grand and convenient, setuptools includes in a source distribution not only all of the files described in the package's manifest, but also every file in the area that is currently checked into Subversion! This introduces a thicket of potential problems: a project started without version control will suddenly behave differently when checked into Subversion; files disappear from the source distribution if you upgrade from Subversion to a version control system of which setuptools is ignorant; and, of course, any attempt to build your project from a clean svn export can result in a source archive with hundreds of files missing.
Since the NLTK has been distributed successfully for years using only distutils and a well-written project manifest, I simply wanted to turn off the magic. It would be possible, I suppose, to instruct packagers to run svn export before running the sdist command, but it offends good sense that you should have to check out a project in some special way just to get predictable behavior. The whole point of version control is that it's supposed to be meta-data that keeps up with history while having no effect on how an application builds and runs.
Having been bitten by this setuptools feature several times in the past, I decided that, this time, I would actually learn how to disable the behavior instead of trying to work around it yet again. “Where,” I thought, “is the switch? Where is the option?” After an exhausting search of the setuptools documentation and then, finally, its source code, I am perplexed to report that the only recourse left to us seems to be violence! You must actually damage setuptools before running the setup() function if you want to prevent it from trying to muck about with Subversion. The cleanest attack on the problem — or should we say the “cleanest surgery?” — seems to be:
#!python # Add this to setup.py before you call setup() from setuptools.command import sdist del sdist.finders[:]
With this change, setuptools peacefully ignores whether the project happens to have been checked out of version control or not, and with a bit of tweaking the same source distribution can be produced as would have been generated by the distutils.
Now, at this point, you probably think that this blog post is shaping up to be a complaint against how setuptools pays attention to Subversion while providing no way to turn this intrusive feature off. But, in fact, that is all really just a side issue: it was a minor adventure that I wanted to share on the way toward my main point. I am not even sure whether the remedy suggested above is a good idea; I have asked the question on Stack Overflow, so we can see if someone else figures out a better approach.
The actual complaint of this blog post is that setuptools displays a troublesome, undocumented, and downright obfuscatory hysteresis with respect to the list of files that gets included in a source package. The fact is, after I discovered the sdist.finders trick quoted above, I actually thought it didn't work because the next source archive I created was exactly the same size as before.
The problem? It turns out that setuptools will choose to keep including a given file in your source archive, even if you have removed it from your manifest, until you force setuptools to rebuild the package's file list by destroying its SOURCES.txt file:
$ python setup.py sdist $ wc -c dist/nltk-2.0b4.tar.gz 25648109 $ rm nltk.egg-info/SOURCES.txt $ python setup.py sdist $ wc -c dist/nltk-2.0b4.tar.gz 868517
Only with this removal will files that are no longer listed in the manifest, or governed by version control, quietly and finally drop out of the source archives you are creating. That is my complaint: that setuptools refuses to remove files from your source distribution unless you know to go remove a magic file from under its nose.
As I now go to close out that NLTK ticket, should I explain that packagers always need to manually remove SOURCES.txt before preparing an official source distribution for upload to the Python Package Index? Or should I be brave and add an os.remove() call to setup.py myself that always destroys the file before the setup() call gets underway? Is there any danger to that approach? Leave a comment and let me know!