Python, Linkers, and Virtual Memory



──────────────────
PyCon 2012
Brandon Rhodes
──────────────────
Imagine a computer
with 1,000 bytes of
addressable RAM
Memory     Byte
address   stored

  999       128
  998       255
  997       254
  996       128
    ·         ·
    ·         ·
    ·         ·
    3         2
    2         0
    1       104
    0        77
We split memory into
pages by grouping addresses
that share a common numeric prefix
If we defined our
page size = 100 bytes
then our 1,000 bytes of
main memory = 10 pages
and page 3 would include all
address in the 3xx range
   
  400
┌─399─┐
 398 
 397 
    
       Page 3
    
 302 
 301 
└─300─┘
  299
   

What is a page table?

   ┌─────┐
    CPU 
   └─────┘
      
┌────────────┐
 Page Table 
└────────────┘
      
   ┌─────┐
    RAM 
   └─────┘
rewrites the leading digits of
every processor memory access before
the RAM chip sees the request

Page Table

Virtual Physical
┌──────────────┐
 9--    59-- 
 8--    63-- 
 7--       - 
 6--    61-- 
 5--    60-- 
 4--       - 
 3--       - 
 2--       - 
 1--    36-- 
 0--    35-- 
└──────────────┘

“Read byte 821” ⇒ “Read byte 6321

  ┌──────────────┐
   9--    59-- 
  8--    63--  
   7--       - 
   6--    61-- 
   5--    60-- 
   4--       - 
   3--       - 
   2--       - 
   1--    36-- 
   0--    35-- 
  └──────────────┘

“Read byte 190” ⇒ “Read byte 3690

  ┌──────────────┐
   9--    59-- 
   8--    63-- 
   7--       - 
   6--    61-- 
   5--    60-- 
   4--       - 
   3--       - 
   2--       - 
  1--    36--  
   0--    35-- 
  └──────────────┘

“Write byte 522” ⇒ “Write byte 6022

  ┌──────────────┐
   9--    59-- 
   8--    63-- 
   7--       - 
   6--    61-- 
  5--    60--  
   4--       - 
   3--       - 
   2--       - 
   1--    36-- 
   0--    35-- 
  └──────────────┘

“Read byte 700” ⇒ Page fault

  ┌──────────────┐
   9--    59-- 
   8--    63-- 
  7--       -  
   6--    61-- 
   5--    60-- 
   4--       - 
   3--       - 
   2--       - 
   1--    36-- 
   0--    35-- 
  └──────────────┘

Page Fault

Your program is paused
and the OS regains control

Page Fault

We will see that the OS
has several ways to respond

Q:

What is stored in the
bytes of memory pages?

A:

Everything your program needs

  1. Executable code
  2. Stack — variables
  3. Heap — data structures
These resources tend to load
gradually as your program runs

Example: Running your editor

You say: “Run my editor!”

So the OS loads it into memory
along with any libraries it requires

Example: Your Editor

┌──────────────┐
 9--       - 
 8--       - 
 7--       - 
 6--       - 
 5--       - 
 4--       - 
 3--       - 
 2--       - 
 1--    36--  libcurses
 0--    35--  editor binary (program)
└──────────────┘
The editor's main() function
starts calling other functions
So the OS automatically starts
allocating new stack pages

Example: Your Editor

┌──────────────┐
 9--    59--  stack (bottom = oldest)
 8--    63--  stack (top = newest)
 7--       - 
 6--       - 
 5--       - 
 4--       - 
 3--       - 
 2--       - 
 1--    36--  libcurses
 0--    35--  editor binary (program)
└──────────────┘
Then the editor needs space
for data structures like
lists and dictionaries
So it says “I need more memory!”
and the OS says “Okay, have some!”

Example: Your Editor

┌──────────────┐
 9--    59--  stack (bottom = oldest)
 8--    63--  stack (top = newest)
 7--       - 
 6--    61--  heap (file being edited)
 5--    60--  heap (settings)
 4--       - 
 3--       - 
 2--       - 
 1--    36--  libcurses
 0--    35--  editor binary (program)
└──────────────┘

Page Table

So what are the benefits
of this level of indirection?

Security

Processes can't see each other's pages

Security

Processes can't see OS pages

Security

The OS wipes reallocated pages with 0s

Stability

Processes can't overwrite each other

Stability

Processes can't crash the OS

Now, let's add another dimension
of complexity: the idea of protection

Protection

Real page tables include
read and write bits
In general you can
write your own stack and heap
but only
read executables
and libraries

Protection

┌────────────────┐
 9--    59-- w  stack
 8--    63-- w  stack
 7--         - 
 6--    61-- w  heap
 5--    60-- w  heap
 4--         - 
 3--         - 
 2--         - 
 1--    36-- r  libcurses
 0--    35-- r  binary
└────────────────┘
If you illegally
read or write—
—then instead of responding
to your page fault with
more free memory—
—the OS will stop you
with a Segmentation Fault

“Read from 331

  ┌────────────────┐
   9--    59-- w 
   8--    63-- w 
   7--         - 
   6--    61-- w 
   5--    60-- w 
   4--         - 
  3--         -    SEGMENTATION FAULT
   2--         - 
   1--    36-- r 
   0--    35-- r 
  └────────────────┘

“Write to 101

  ┌────────────────┐
   9--    59-- w 
   8--    63-- w 
   7--         - 
   6--    61-- w 
   5--    60-- w 
   4--         - 
   3--         - 
   2--         - 
  1--    36-- r   SEGMENTATION FAULT
   0--    35-- r 
  └────────────────┘
Read-only pages give the
OS a new superpower:

Sharing!

Sharing

Read-only binaries and libraries
can be safely and securely shared
among many processes!
These two processes use different
pages for heap, stack, and binary
┌────────────────┐         ┌────────────────┐
 9--    59-- w  stack    9--    48-- w  stack
 8--    63-- w  stack    8--         - 
 7--         -           7--    51-- w  heap
 6--    61-- w  heap     6--    50-- w  heap
 5--    60-- w  heap     5--    46-- w  heap
 4--         -           4--         - 
 3--         -           3--    39-- r  libjson
 2--    39-- r  libjson  2--    48-- r  libX11
 1--    36-- r  libc     1--    36-- r  libc
 0--    35-- r  python   0--    47-- r  firefox
└────────────────┘         └────────────────┘
But they safely share a single
copy of libc and libjson
┌────────────────┐         ┌────────────────┐
                                         
                                         
                                         
                                         
                                         
                                         
                          3--    39-- r  libjson
 2--    39-- r  libjson                 
 1--    36-- r  libc     1--    36-- r  libc
                                         
└────────────────┘         └────────────────┘
Physical RAM page 36 can be re-used
for every process needing libc
┌────────────────┐         ┌────────────────┐
                                         
                                         
                                         
                                         
                                         
                                         
                          3--    39-- r  libjson
 2--    39-- r  libjson                 
 1--    36-- r  libc     1--    36-- r  libc
                                         
└────────────────┘         └────────────────┘
And Physical RAM page 39 can be
re-used for every process needing libjson
┌────────────────┐         ┌────────────────┐
                                         
                                         
                                         
                                         
                                         
                                         
                          3--    39-- r  libjson
 2--    39-- r  libjson                 
 1--    36-- r  libc     1--    36-- r  libc
                                         
└────────────────┘         └────────────────┘
This means that the total memory
consumed by process A and process B
is rarely the sum

(A's memory use) + (B's memory use)

Q:

So how can you tell how much memory
an additional worker thread or process
will consume on one of your servers?

A:

Stop looking at the quantity
“How much RAM does my Python program use?”
Start looking at the delta

Δ memory

Ask, “How much memory does each
additional process / worker / thread cost
(and when will that run me out of RAM)?”
How should you measure that?
By actual load and resource tests
Because as we will see,
memory usage is complicated

Problem: Python Code

Q: Where in virtual memory
does Python code live?
Python starts running with
most RAM pages sharable
┌────────────────┐
 9--    59-- w  stack
 8--         - 
 7--         - 
 6--         - 
 5--         - 
 4--         - 
 3--         - 
 2--         - 
 1--    36-- r  libc
 0--    35-- r  python
└────────────────┘
But Python runs read() on foo.py creating
a writable, un-sharable page on the heap
┌────────────────┐
 9--    59-- w  stack
 8--         - 
 7--         - 
 6--         - 
 5--    60-- w  foo.py   
 4--         - 
 3--         - 
 2--         - 
 1--    36-- r  libc
 0--    35-- r  python
└────────────────┘
Then Python compiles foo.py to a code
object on another read-write page
┌────────────────┐
 9--    59-- w  stack
 8--         - 
 7--         - 
 6--    61-- w  codeobj  
 5--    60-- w  foo.py
 4--         - 
 3--         - 
 2--         - 
 1--    36-- r  libc
 0--    35-- r  python
└────────────────┘
Only then does foo.py start up and
create legitimately unique program data
┌────────────────┐
 9--    59-- w  stack
 8--         - 
 7--    63-- w  data     
 6--    61-- w  codeobj
 5--    60-- w  foo.py
 4--         - 
 3--         - 
 2--         - 
 1--    36-- r  libc
 0--    35-- r  python
└────────────────┘
The heap winds up as a mix of actual
unique data, together with shared code
┌────────────────┐
 9--    59-- w  stack
 8--         - 
 7--    63-- w  data     
 6--    61-- w  codeobj
 5--    60-- w  foo.py
 4--         - 
 3--         - 
 2--         - 
 1--    36-- r  libc
 0--    35-- r  python
└────────────────┘
Now it just so happens that every
Python X.Y process loading foo.py will
create the exact same code object—
—but the OS does not know about this
because each Python process builds
its code objects separately
To the OS, Python code just looks like heap
┌────────────────┐         ┌────────────────┐
 9--    59-- w          9--    48-- w  stack
 8--         -           8--         - 
 7--    63-- w          7--    71-- w  data
 6--    61-- w          6--    69-- w  codeobj
 5--    60-- w          5--    62-- w  foo.py
 4--         -           4--         - 
 3--         -           3--         - 
 2--         -           2--         - 
 1--    36-- r           1--    36-- r  libc
 0--    35-- r           0--    35-- r  python
└────────────────┘         └────────────────┘

Q: Why not share code objects?

Q: Why not make them read-only?

A: Reference Counts

Standard C Python does not
support read-only code objects
because it constantly fiddles
with their reference counts

What does this code do?

f(5)
f(5)
  1. Looks up f
  2. Gets back a code object
  3. Increments its reference count
  4. Calls it with (5,)
  5. Decrements its reference count

Why increment the count?

To prevent other threads
from de-allocating and overwriting
f() while this thread is busy running it
Lesson: the OS offers cool optimization
when processes need to share code, but
reference-counted code objects cannot
take advantage of this

Now, another basic concept—

Dynamic Linking

Take another look
at our Python process
┌────────────────┐
 9--    59-- w  stack
 8--    63-- w  stack
 7--         - 
 6--    61-- w  heap
 5--    60-- w  heap
 4--         - 
 3--         - 
 2--    39-- r  libjson
 1--    36-- r  libc.so
 0--    35-- r  python
└────────────────┘

Dynamic Linking

The python binary and libc.so library
are separate files sitting on disk
┌────────────────┐
                
                
                
                
                
                
                
                
 1--    36-- r  libc.so
 0--    35-- r  python
└────────────────┘

Dynamic Linking

How does the Python interpreter find
the routines it needs in libc.so?
┌────────────────┐
                
                
                
                
                
                
                
                
 1--    36-- r  libc.so
 0--    35-- r  python
└────────────────┘

The Old Days

prog.c  prog

But then prog.c got really big

  compile  link

cmdn.c  cmdn.o 
opts.c  opts.o  prog
slen.c  slen.o 
So you only needed to re-compile
the source files you just edited

Q:

How does the linker
connect the module that calls foo()
with the module that defines it?
  compile  link

cmdn.c  cmdn.o 
opts.c  opts.o  prog
slen.c  slen.o 

A:

Every .o file has a table
of names that it defines and names
that it needs someone else to define
$ nm p_move.o
         U _nc_panelhook
         U is_linetouched
00000000 T move_panel
         U mvwin
         U wtouchln
The linker plays matchmaker
between names needed
and names provided
Linking returns an error
if a name needed cannot be found
$ ld -o prog foo.o bar.o baz.o

foo.o: In function `main':
foo.c:(.text+0x7): undefined reference to `haute'
collect2: ld returned 1 exit status
People started sharing useful .o
object files with their friends
So they used the ar archive tool
to bundle together related .o
files into single .a files
You can still find .a files
on OS X and Linux today
$ ar t /usr/lib/libpanel.a
panel.o
p_above.o
p_below.o
p_bottom.o
p_delete.o
p_hide.o
p_hidden.o
p_move.o
p_new.o
p_replace.o
p_show.o
p_top.o
p_update.o
p_user.o
p_win.o
So the linker was given the power
to search an .a archive and link
your program with the extra .o
files that it needs
The .a file full of .o files
came to be called a “library”

Problem:

If everyone lets the linker
copy printf.o into their program,
then memory has to hold many separate
copies of each popular library

Solution:

The invention of .so
“shared object” files that
can be added to the page table
of every program that needs them
┌────────────────┐
       .        
       .        
       .        
 1--    36-- r  libc.so
 0--    35-- r  python
└────────────────┘
The last-minute linking
that takes place at runtime
to connect python to libc.so
is called Dynamic Linking
┌────────────────┐
       .        
       .        
       .        
 1--    36-- r  libc.so
 0--    35-- r  python
└────────────────┘

Oddly enough

The .so shared library is all
you need to run a program
but
you still need that old-fashioned
.a library to compile the program
That's why you need to install libxml2-dev
to compile the Python lxml module, even if
/usr/lib/libxml2.so.2.7.8
already exists on your system

Anyway

You can see the libraries that a
program like python needs with ldd
$ ldd /usr/bin/python2.7
      linux-gate.so.1
      libpthread.so.0
      libdl.so.2
      libutil.so.1
      libssl.so.1.0.0
      libcrypto.so.1.0.0
      libz.so.1
      libm.so.6
      libc.so.6
      /lib/ld-linux.so.2
You can see the specific names
it needs with nm (“names”)
$ nm -D /usr/bin/python2.7
           .
           .
           .
        U unlink
        U unsetenv
        U utime
        U utimes
        U wait
        U wait3
        U wait4
        U waitpid
        U wcscoll
        U writ
Not only is it usual for the
Python binary to be dynamically linked
but Python extension modules
sometimes link against
shared libraries too
the lxml.etree Python module
is famous for this
$ ldd /usr/lib/pyshared/python2.7/lxml/etree.so
       linux-gate.so.1
       libxslt.so.1
       libexslt.so.0
       libxml2.so.2
       libpthread.so.0
       libc.so.6
       libm.so.6
       libgcrypt.so.11
       libdl.so.2
       libz.so.1
       /lib/ld-linux.so.2
       libgpg-error.so.0

Shared libraries cause problems

When a program chooses not to carry a
static copy of every library that it needs,
there is the danger that a library it
needs will be upgraded or removed
As an example, let us build:
  1. a small shared library
  2. a Python module that uses it

Shared library: libtiny.so

/* libtiny.c */
helper() { return 42; }

gcc libtiny.c -shared -o libtiny.so

Python module: tinymodule.so

/* tinymodule.c */

#include <python2.7/Python.h>

static PyMethodDef TinyMethods[] = {
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC inittiny(void)
{
    helper();  /* the call into libtiny.so */
    (void) Py_InitModule("tiny", TinyMethods);
}
gcc tinymodule.c -L. -ltiny -shared \
-o tinymodule.so
Having created these two *.so files we
need to add the current directory to
the OS shared library search path
$ export LD_LIBRARY_PATH=.
So the module tinymodule.so
needs the library libtiny.so
$ ldd tinymodule.so
       linux-gate.so.1
       libtiny.so
       libc.so.6
       /lib/ld-linux.so.2

Shared library present, compatible

$ ls
libtiny.c   tinymodule.c
libtiny.so  tinymodule.so

$ python -c 'import tiny'

$ echo $?
0
What if Python
can find tinymodule.so
but the OS cannot find
its dependency libtiny.so?

Shared library missing

$ rm libtiny.so
$ python -c 'import tiny'
ImportError: libtiny.so:
    cannot open shared object file:
    No such file or directory

Incompatible shared library

/* libtiny.c */
different_helper() { return 42; }

gcc libtiny.c -shared -o libtiny.so

Incompatible shared library

$ python -c 'import tiny'
ImportError: ./tinymodule.so:
    undefined symbol: helper
$ nm tinymodule.so
...
00000354 T _init
         U helper
         0000047c T inittiny

$ nm libtiny.so
...
000002f4 T _init
0000041c T another_helper
“Cannot open shared object”
and
“undefined symbol”
are possible because OS tries to
link a binary (tinymodule.so) to
its dependencies (libtiny.so)
at runtime
And now, for another puzzle piece:

Demand Paging

Demand Paging

The OS does not load pages
from disk until your program
reads or writes from them
Imagine a program that
has just started running

Files:

python — 3 pages
libc.so — 2 pages
At first, only one page of
the python binary is loaded
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--         - 
 6--    61-- w   heap
 5--         - 
 4--         -  (libc.so)
 3--         -  (libc.so)
 2--         -  (python)
 1--         -  (python)
 0--    35-- r   python  LOAD FROM DISK
└────────────────┘
Then Python tries to run the instruction
at 212 so that page gets loaded too
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--         - 
 6--    61-- w   heap
 5--         - 
 4--         -  (libc.so)
 3--         -  (libc.so)
 2--    37-- r   python  LOAD FROM DISK
 1--         -  (python)
 0--    35-- r   python
└────────────────┘
Then Python calls a function at libc
offset 178 thus 300 + 178 = 478
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--         - 
 6--    61-- w   heap
 5--         - 
 4--    41-- r   libc.so  LOAD FROM DISK
 3--         -  (libc.so)
 2--    37-- r   python
 1--         -  (python)
 0--    35-- r   python
└────────────────┘
If Python keeps calling the same routines,
then only those 3 pages ever get loaded
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--         - 
 6--    61-- w   heap
 5--         - 
 4--    41-- r   libc.so
 3--         -  (libc.so)
 2--    37-- r   python
 1--         -  (python)
 0--    35-- r   python
└────────────────┘
So pages are loaded
from disk lazily
Heap allocation works
the same way
Q: How many RAM pages are allocated if you
ask the OS for three new memory pages?
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--         - 
 6--         - 
 5--         - 
 4--    61-- w   heap
 3--         - 
 2--    41-- r   libc.so
 1--         -  (python)
 0--    35-- r   python
└────────────────┘
A:
None!
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--         -  (heap)
 6--         -  (heap)
 5--         -  (heap)
 4--    61-- w   heap
 3--         - 
 2--    41-- r   libc.so
 1--         -  (python)
 0--    35-- r   python
└────────────────┘
The OS allocates no pages but simply makes
an adjustment to its record-keeping
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--         -  (heap)  
 6--         -  (heap)  
 5--         -  (heap)  
 4--    61-- w   heap
 3--         - 
 2--    41-- r   libc.so
 1--         -  (python)
 0--    35-- r   python
└────────────────┘
The OS will now respond to page faults
at 500–799 by allocating a new page,
not raising a Segmentation Fault
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--         -  (heap)  
 6--         -  (heap)  
 5--         -  (heap)  
 4--    61-- w   heap
 3--         - 
 2--    41-- r   libc.so
 1--         -  (python)
 0--    35-- r   python
└────────────────┘
But no pages are actually allocated until
a read or write shows they are needed
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--         -  (heap)  
 6--         -  (heap)  
 5--         -  (heap)  
 4--    61-- w   heap
 3--         - 
 2--    41-- r   libc.so
 1--         -  (python)
 0--    35-- r   python
└────────────────┘

That is the difference

between VIRT and RES
when you run top

top

                    

  PID USER    VIRT  RES  SHR S %CPU %MEM COMMAND
 1217 root    195m  94m  53m S    2  2.6 Xorg
 7304 brandon 183m  45m  19m S    2  1.3 chromium
 2027 brandon 531m 162m  37m S    1  4.5 chromium
 2319 brandon 179m  58m  23m S    1  1.6 chromium
18841 brandon 2820 1188  864 R    1  0.0 top
 9776 brandon 175m  49m  12m S    0  1.4 chromium

                    
VIRT
Virtual Image
All memory pages that promise not
to return a Segmentation Fault
RES
Resident Size
Only memory pages that have
actually been allocated
VIRT — 8 pages
RES — 4 pages
┌────────────────┐         VIRT  RES
 9--    59-- w   stack        
 8--         - 
 7--         -  (heap)    
 6--         -  (heap)    
 5--         -  (heap)    
 4--    61-- w   heap         
 3--         - 
 2--    41-- r   libc.so      
 1--         -  (python)  
 0--    35-- r   python       
└────────────────┘
Only RES can actually
run you out of memory
Look at RES when you wonder
where all of your RAM is going
                     

  PID USER    VIRT  RES  SHR S %CPU %MEM COMMAND
 1217 root    195m  94m  53m S    2  2.6 Xorg
 7304 brandon 183m  45m  19m S    2  1.3 chromium
 2027 brandon 531m 162m  37m S    1  4.5 chromium
 2319 brandon 179m  58m  23m S    1  1.6 chromium
18841 brandon 2820 1188  864 R    1  0.0 top
 9776 brandon 175m  49m  12m S    0  1.4 chromium

                     

/proc/<pid>/smaps

Shows whether physical RAM pages
have been allocated to a segment
Here are some segments of a new python
process sitting quietly at its prompt
08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 896 kB

b73b9000-b752f000 r-xp .../libc-2.13.so
Size:               1496 kB
Rss:                 528 kB

bfc48000-bfc69000 rw-p [stack]
Size:                136 kB
Rss:                 104 kB

Demand Paging: Example

Calling a Python function that
was not called as Python started up
08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 896 kB

>>> frozenset()

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB

916 - 896 = 20 new kilobytes

So invoking new sections of a binary
or library pulls more pages into RAM
But since binary and library
code can always be reloaded from disk,
the OS can also discard those pages
$ python -c 'range(500000000)'
This allocates lots of memory
As it runs, your OS will
dump information overboard to
reclaim every RAM page it can
What did this memory hog do
to the other Python process that
was sitting quietly at its prompt?
08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB
before the hog ran

after
08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 464 kB
08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB

Look at the change in Rss!

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 464 kB
08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB

916 - 464 = 452k were thrown overboard!

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 464 kB
08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB

but python will never know, right?

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 464 kB

(well, there is one difference)

The process will slow down as
de-allocated pages are re-loaded
like leaving the building to roam
the earth for one year and three months.
— Gustavo Duarte
That's why your web browser takes
so long to start the first time

And now, a Unix superpower!

Forking

On primitive operating systems the
only way to create a new process
is to start a whole new program
The Python interpreter requires
several steps to get going —
A

B

C

D
— then to start a second worker process
you have to send it through the same startup
Process 1  A
           
           B
           
           C
           
           D  os.spawn()  A  Process 2
                           
           .                B
           .                
           .                C
                           
           .                D
But Linux and OS X support fork()!
“If you come to a fork
in the road, take it!”
fork() creates a child process that continues
on from the parent's current state
Process 1  A
           
           B
           
           C
           
           D  os.fork()   Process 2
                        
           E             E
                        
           F             F
                        
The OS could implement fork() naively
by copying every writeable page in the parent
so that the child had its own copy
  fork() parent              fork() child
┌────────────────┐         ┌─────────────────┐
 9--    59-- w  stack    9--   new copy 
 8--    63-- w  stack    8--   new copy 
 7--         -           7--          - 
 6--    61-- w  heap     6--   new copy 
 5--    60-- w  heap     5--   new copy 
 4--         -           4--          - 
 3--         -           3--          - 
 2--    39-- r  libjson  2--    (same)  
 1--    36-- r  libc     1--    (same)  
 0--    35-- r  python   0--    (same)  
└────────────────┘         └─────────────────┘
But immediately copying every page could
take a long time for a large process,
and during the copy both parent and
child would be hung waiting!
Instead, as you might guess,
the OS implements fork()
lazily,
copying pages on-demand
when the parent or child
performs a write
So the OS starts fork() by copying the page
table, marking all pages read-only in both
processes, then letting them keep running
  fork() parent              fork() child
┌────────────────┐         ┌────────────────┐
 9--    59-- r  stack    9--    59-- r 
 8--    63-- r  stack    8--    63-- r 
 7--         -           7--         - 
 6--    61-- r  heap     6--    61-- r 
 5--    60-- r  heap     5--    60-- r 
 4--         -           4--         - 
 3--         -           3--         - 
 2--    39-- r  libjson  2--    (same) 
 1--    36-- r  libc     1--    (same) 
 0--    35-- r  python   0--    (same) 
└────────────────┘         └────────────────┘
As writes cause page faults the OS makes copies.
Here page 63 is copied to 77 and marked w.
    fork() parent                 fork() child
  ┌────────────────┐           ┌────────────────┐
   9--    59-- r    stack    9--    59-- r 
  8--    63-- w   stack   8--    77-- w  
   7--         -             7--         - 
   6--    61-- r    heap     6--    61-- r 
   5--    60-- r    heap     5--    60-- r 
   4--         -             4--         - 
   3--         -             3--         - 
   2--    39-- r    libjson  2--    (same) 
   1--    36-- r    libc     1--    (same) 
   0--    35-- r    python   0--    (same) 
  └────────────────┘           └────────────────┘
Q: How well does this usually work?
A: It works great!
Only the differences that develop
between the parent and child memory
images incur additional storage
Q: But how does work for Python?
A: Terribly!
Thanks to reference counts,
merely glancing at data with C Python
forces the OS to create a separate copy
# Create a list

biglist = ['foo']
for n in range(1, 8460000):
    biglist.append(n)   # ~100 MB

# put fork() here

# Iterate across the list

t0 = time.time()
all(n for n in biglist)
t1 = time.time()

C Python

Before C Python iterates

$ cat /proc/22364/smaps | awk '/heap/,/Private_D/'
08d60000-0ef90000 rw-p 00000000 00:00 0     [heap]
Size:             100544 kB
Rss:              100544 kB
Pss:               50294 kB
Shared_Clean:          0 kB
Shared_Dirty:     100500 kB
Private_Clean:         0 kB
Private_Dirty:        44 kB

After C Python iterates

$ cat /proc/22364/smaps | awk '/heap/,/Private_D/'
08d60000-0ef90000 rw-p 00000000 00:00 0     [heap]
Size:             100544 kB
Rss:              100544 kB
Pss:              100274 kB
Shared_Clean:          0 kB
Shared_Dirty:        540 kB
Private_Clean:         0 kB
Private_Dirty:    100004 kB

C Python

At finish: 0.5% of heap shared

PyPy 2.8

Before PyPy iterates

$ cat /proc/22385/smaps | awk '/heap/,/Private_D/'
0a5c9000-11fba000 rw-p 00000000 00:00 0     [heap]
Size:             124868 kB
Rss:              124540 kB
Pss:               62274 kB
Shared_Clean:          0 kB
Shared_Dirty:     124532 kB
Private_Clean:         0 kB
Private_Dirty:         8 kB

After PyPy iterates

$ cat /proc/22385/smaps | awk '/heap/,/Private_D/'
0a5c9000-11fba000 rw-p 00000000 00:00 0     [heap]
Size:             124868 kB
Rss:              124868 kB
Pss:               63160 kB
Shared_Clean:          0 kB
Shared_Dirty:     123416 kB
Private_Clean:         0 kB
Private_Dirty:      1452 kB

PyPy 2.8

At finish: 96.1% of heap shared

Lesson:

Forked worker processes in Unix
share memory when the parent process
builds read-only data structures
but not in C Python

Explicitly Sharing Memory

How can two Python procedures
share writeable memory
to collaborate?
  1. Threads
  2. Memory maps

1. Threads

Creating a thread is like fork()
except that the heap remains shared
between the two threads of control
Python supports threads
on both Unix and Windows
import threading
t = threading.Thread(target=myfunc)
t.start()
Each thread gets its own stack so the two
threads can call different functions, but
all other data structures are shared
    main thread               child thread
┌────────────────┐         ┌────────────────┐
 9--    59-- r   stack   9--    59-- r 
 8--    63-- w   stack   8--    77-- w 
└────────────┬───┴─────────┴──┬─────────────┘
              7--         - 
              6--    61-- w  heap
              5--    60-- w  heap
              4--         - 
              3--         - 
              2--    39-- r  libjson
              1--    36-- r  libc
              0--    35-- r  python
             └────────────────┘
Since threads
share every data structure
they have to be very careful

2. Memory Maps

Using mmap() a parent process
can create shared memory that will be
inherited by all the child workers it forks
import mmap, os

map = mmap.mmap(-1, 100)
os.fork()
...
  fork() parent                 fork() child
┌────────────────┐           ┌────────────────┐
 9--    59-- r    stack    9--    59-- r 
 8--    63-- w    stack    8--    77-- w 
└─────────────┬──┴───────────┴─┬──────────────┘
               7--    88-- -  mmap segment
┌─────────────┴──┬───────────┬─┴──────────────┐
 6--    61-- r    heap     6--    61-- r 
 5--    60-- r    heap     5--    60-- r 
 4--         -             4--         - 
 3--         -             3--         - 
 2--    39-- r    libjson  2--    (same) 
 1--    36-- r    libc     1--    (same) 
 0--    35-- r    python   0--    (same) 
└────────────────┘           └────────────────┘
This supports very fast RAM-based
communication between processes
without requiring
every data structure on the heap
to carry locks or other protection

Another mmap() capability

Remember how the OS loads pages
from binary programs like python
and shared libraries like libc.so
on-demand, not all at once?
Well, with mmap() you can do
that yourself, with normal files!
mmap(myfile.fileno(), 0) lets you
replace seek(), read(), and write() calls
and simply access it like an array!
┌────────────────┐
 9--    59-- w   stack
 8--         - 
 7--    91-- -   mmap[1]  myfile
 6--    90-- -   mmap[0]  myfile
 5--         - 
 4--    61-- w   heap
 3--         - 
 2--    41-- r   libc.so
 1--         -  (python)
 0--    35-- r   python
└────────────────┘

So, there you have it

Your processes all access memory
through the mediation of a page table
Page tables power all
kinds of fun RAM optimizations:
  1. Demand loading
  2. Shared libraries
  3. Memory maps
  4. fork()
  5. Threads

And

PyPy lacks reference counts
which lets the OS do its magic
and conserve resources

Thank you!