|14 May 2010
One of the great recent advances in the Python Standard Library is the addition of the multiprocessing module, maintained by Jesse Noller who has also blogged and written about several other concurrency approaches for Python — Kamaelia, Circuits, and Stackless Python.
I have wanted to try the multiprocessing module out for some time, and now have a consulting project that will really benefit from multiple processes: they will let our application run third-party plugins without having to worry that any bugs or indiscretions which they commit might damage or hang our main server, which can remain safe in another process.
First, one can only stand in awe at the achievement — and the amount of work — that the multiprocessing module represents. I cannot imagine the time that it would have taken our team to figure out all of the differences between Linux and Windows when it comes to processes, shared memory, and concurrency mechanisms. In fact, the approach we are taking might not even have been feasible under those circumstances. By figuring out how to get locks, queues, and shared data structures all working cleanly on such different architectures, the multiprocessing authors save Python programmers out on the street like me from reinventing a dozen wheels when we need to support multi-platform concurrency.
There is one rather startling difference which the multiprocessing module does not hide: the fact that while every Windows process must spin up independently of the parent process that created it, Linux supports the fork(2) system call that creates a child processes already in possession of exactly the same resources as its parent: every data structure, open file, and database connection that existed in the parent process is still sitting there, open and ready to use, in the child. Consider this small program:
from multiprocessing import Process
f = None
if __name__ == '__main__':
f = open('mp.py', 'r')
p = Process(target=child)
On Linux, the open file f keeps its value in the child process; the child has inherited an open connection from its parent:
$ python mp.py
<open file 'mp.py', mode 'r' at 0xb7734ac8>
Under Windows, however, where the multiprocessing module has to spawn a fresh copy of the Python interpreter to which it gives special instructions to just run the function f(), the module is a clean slate without an open file inside:
Now, my complaint is not exactly that the multiprocessing documentation is misleading on this point; under its section on Programming guidelines, it makes it quite clear that:
On Unix a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.
I have no quarrel with this advice; if I am careful to pass everything the child needs in its list of arguments, then I can be sure that my code will work under both Linux and Windows.
But I do wish that the multiprocessing module provided more support for testing this condition more rigorously under Linux. In particular, I wish that there were some way of turning the simple forking logic off — of saying, “Yes, I know that Linux will let you create a child process very simply using fork(2), but for my sanity would you please create the child process from scratch like you do under Windows so that I can test whether my code accidentally depends on residual state from the parent process that I did not see that I was using?” I looked at the multiprocessing "forking.py" module to see whether I could turn on the Windows-style process spawning even from inside of Linux, but the mechanism is chosen by a bare module-level check of "sys.platform" and if I overwrite that variable with 'win32' the code then dies when it tries to import "msvcrt" which is available only under Windows.
There is, thus, even in principle, no way that I can test my multiprocessing application under Linux which will give me any assurance that my child processes are not accidentally taking advantage of data structures and open connections left lying around by the parent process; only by actually moving over to Windows itself can I see how my child code really behaves on its own. I have created a feature request in the Python bug tracker to see whether this situation can be improved.
But even with this one inconvenience — which is troubling me much less, now that I at least understand why my application was behaving so differently under Windows — the multiprocessing module is still a huge leap forwards for Python programmers who need to run code in heavyweight processes with all of the isolation and safety that they provide. Thanks again to Jesse and the multiprocessing team!