One of the great recent advances in the Python Standard Library
is the addition of the
multiprocessing module,
maintained by Jesse Noller
who has also blogged and written
about several other concurrency approaches for Python —
Kamaelia,
Circuits,
and
Stackless Python.
I have wanted to try the multiprocessing module out for some time,
and now have a consulting project that will really benefit from multiple processes:
they will let our application run third-party plugins
without having to worry that any bugs or indiscretions which they commit
might damage or hang our main server,
which can remain safe in another process.
First, one can only stand in awe at the achievement —
and the amount of work —
that the multiprocessing module represents.
I cannot imagine the time that it would have taken our team
to figure out all of the differences between Linux and Windows
when it comes to processes, shared memory, and concurrency mechanisms.
In fact, the approach we are taking might not even have been feasible
under those circumstances.
By figuring out how to get locks, queues, and shared data structures
all working cleanly on such different architectures,
the multiprocessing authors
save Python programmers out on the street like me
from reinventing a dozen wheels
when we need to support multi-platform concurrency.
Well, almost.
There is one rather startling difference
which the multiprocessing module does not hide:
the fact that while every Windows process must spin up
independently of the parent process that created it,
Linux supports the fork(2) system call
that creates a child processes already in possession
of exactly the same resources as its parent:
every data structure, open file, and database connection
that existed in the parent process
is still sitting there, open and ready to use, in the child.
Consider this small program:
from multiprocessing import Process
f = None
def child():
print f
if __name__ == '__main__':
f = open('mp.py', 'r')
p = Process(target=child)
p.start()
p.join()
On Linux, the open file f keeps its value in the child process;
the child has inherited an open connection from its parent:
$ python mp.py
<open file 'mp.py', mode 'r' at 0xb7734ac8>
Under Windows, however, where the multiprocessing module
has to spawn a fresh copy of the Python interpreter
to which it gives special instructions
to just run the function f(),
the module is a clean slate without an open file inside:
C:\Users\brandon\dev>python mp.py
None
Now, my complaint is not exactly
that the multiprocessing documentation is misleading on this point;
under its section on
Programming guidelines,
it makes it quite clear that:
On Unix a child process can make use of a shared resource created in a
parent process using a global resource. However, it is better to pass
the object as an argument to the constructor for the child process.
I have no quarrel with this advice;
if I am careful to pass everything the child needs
in its list of arguments,
then I can be sure that my code will work under both Linux and Windows.
But I do wish that the multiprocessing module
provided more support for testing this condition
more rigorously under Linux.
In particular, I wish that there were some way of turning
the simple forking logic off —
of saying, “Yes, I know that Linux will let you create a child process
very simply using fork(2), but for my sanity would you please
create the child process from scratch like you do under Windows so
that I can test whether my code accidentally depends on residual
state from the parent process that I did not see that I was using?”
I looked at the multiprocessing "forking.py" module
to see whether I could turn on the Windows-style process spawning
even from inside of Linux,
but the mechanism is chosen
by a bare module-level check of "sys.platform"
and if I overwrite that variable with 'win32'
the code then dies when it tries to import "msvcrt"
which is available only under Windows.
There is, thus, even in principle, no way
that I can test my multiprocessing application under Linux
which will give me any assurance that my child processes
are not accidentally taking advantage of data structures
and open connections left lying around by the parent process;
only by actually moving over to Windows itself
can I see how my child code really behaves on its own.
I have created a feature request
in the Python bug tracker to see whether this situation can be improved.
But even with this one inconvenience —
which is troubling me much less, now that I at least understand
why my application was behaving so differently under Windows —
the multiprocessing module is still a huge leap forwards
for Python programmers who need to run code
in heavyweight processes with all of the isolation and safety
that they provide.
Thanks again to Jesse and the multiprocessing team!