Python, Linkers, and Virtual Memory

──────────────────
PyCon 2012
Brandon Rhodes
──────────────────

Imagine a computer
with 1,000 bytes of
addressable RAM

Memory     Byte
address   stored

  999       128
  998       255
  997       254
  996       128
    ·         ·
    ·         ·
    ·         ·
    3         2
    2         0
    1       104
    0        77

We split memory into
pages by grouping addresses
that share a common numeric prefix

If we defined our

page size = 100 bytes

then our 1,000 bytes of

main memory = 10 pages

and page 3 would include all

address in the 3xx range

   •
  400
┌─399─┐
│ 398 │
│ 397 │
│  •  │
│  •  │   Page 3
│  •  │
│ 302 │
│ 301 │
└─300─┘
  299
   •

What is a page table?

   ┌─────┐
   │ CPU │
   └─────┘
      ↕
┌────────────┐
│ Page Table │
└────────────┘
      ↕
   ┌─────┐
   │ RAM │
   └─────┘

rewrites the leading digits of
every processor memory access before
the RAM chip sees the request

Page Table

Virtual Physical
┌──────────────┐
│ 9--  →  59-- │
│ 8--  →  63-- │
│ 7--  →     - │
│ 6--  →  61-- │
│ 5--  →  60-- │
│ 4--  →     - │
│ 3--  →     - │
│ 2--  →     - │
│ 1--  →  36-- │
│ 0--  →  35-- │
└──────────────┘

“Read byte 821” ⇒ “Read byte 6321”

  ┌──────────────┐
  │ 9--  →  59-- │
→ │ 8--  →  63-- │ ←
  │ 7--  →     - │
  │ 6--  →  61-- │
  │ 5--  →  60-- │
  │ 4--  →     - │
  │ 3--  →     - │
  │ 2--  →     - │
  │ 1--  →  36-- │
  │ 0--  →  35-- │
  └──────────────┘

“Read byte 190” ⇒ “Read byte 3690”

  ┌──────────────┐
  │ 9--  →  59-- │
  │ 8--  →  63-- │
  │ 7--  →     - │
  │ 6--  →  61-- │
  │ 5--  →  60-- │
  │ 4--  →     - │
  │ 3--  →     - │
  │ 2--  →     - │
→ │ 1--  →  36-- │ ←
  │ 0--  →  35-- │
  └──────────────┘

“Write byte 522” ⇒ “Write byte 6022”

  ┌──────────────┐
  │ 9--  →  59-- │
  │ 8--  →  63-- │
  │ 7--  →     - │
  │ 6--  →  61-- │
→ │ 5--  →  60-- │ ←
  │ 4--  →     - │
  │ 3--  →     - │
  │ 2--  →     - │
  │ 1--  →  36-- │
  │ 0--  →  35-- │
  └──────────────┘

“Read byte 700” ⇒ Page fault

  ┌──────────────┐
  │ 9--  →  59-- │
  │ 8--  →  63-- │
→ │ 7--  →     - │ ←
  │ 6--  →  61-- │
  │ 5--  →  60-- │
  │ 4--  →     - │
  │ 3--  →     - │
  │ 2--  →     - │
  │ 1--  →  36-- │
  │ 0--  →  35-- │
  └──────────────┘

Page Fault

Your program is paused

and the OS regains control

Page Fault

We will see that the OS

has several ways to respond

Q:

What is stored in the

bytes of memory pages?

A:

Everything your program needs

Executable code
Stack — variables
Heap — data structures

These resources tend to load

gradually as your program runs

Example: Running your editor

You say: “Run my editor!”

So the OS loads it into memory

along with any libraries it requires

Example: Your Editor

┌──────────────┐
│ 9--  →     - │
│ 8--  →     - │
│ 7--  →     - │
│ 6--  →     - │
│ 5--  →     - │
│ 4--  →     - │
│ 3--  →     - │
│ 2--  →     - │
│ 1--  →  36-- │ libcurses
│ 0--  →  35-- │ editor “binary” (program)
└──────────────┘

The editor's main() function

starts calling other functions

So the OS automatically starts

allocating new stack pages

Example: Your Editor

┌──────────────┐
│ 9--  →  59-- │ stack (“bottom” = oldest)
│ 8--  →  63-- │ stack (“top” = newest)
│ 7--  →     - │
│ 6--  →     - │
│ 5--  →     - │
│ 4--  →     - │
│ 3--  →     - │
│ 2--  →     - │
│ 1--  →  36-- │ libcurses
│ 0--  →  35-- │ editor “binary” (program)
└──────────────┘

Then the editor needs space
for data structures like
lists and dictionaries

So it says “I need more memory!”

and the OS says “Okay, have some!”

Example: Your Editor

┌──────────────┐
│ 9--  →  59-- │ stack (“bottom” = oldest)
│ 8--  →  63-- │ stack (“top” = newest)
│ 7--  →     - │
│ 6--  →  61-- │ heap (file being edited)
│ 5--  →  60-- │ heap (settings)
│ 4--  →     - │
│ 3--  →     - │
│ 2--  →     - │
│ 1--  →  36-- │ libcurses
│ 0--  →  35-- │ editor “binary” (program)
└──────────────┘

Page Table

So what are the benefits

of this level of indirection?

Security

Processes can't see each other's pages

Security

Processes can't see OS pages

Security

The OS wipes reallocated pages with 0s

Stability

Processes can't overwrite each other

Stability

Processes can't crash the OS

Now, let's add another dimension

of complexity: the idea of protection

Protection

Real page tables include

read and write bits

In general you can

write your own stack and heap

but only

read executables

and libraries

Protection

┌────────────────┐
│ 9--  →  59-- w │ stack
│ 8--  →  63-- w │ stack
│ 7--  →       - │
│ 6--  →  61-- w │ heap
│ 5--  →  60-- w │ heap
│ 4--  →       - │
│ 3--  →       - │
│ 2--  →       - │
│ 1--  →  36-- r │ libcurses
│ 0--  →  35-- r │ binary
└────────────────┘

If you illegally

read or write—

—then instead of responding
to your page fault with
more free memory—

—the OS will stop you

with a Segmentation Fault

“Read from 331”

  ┌────────────────┐
  │ 9--  →  59-- w │
  │ 8--  →  63-- w │
  │ 7--  →       - │
  │ 6--  →  61-- w │
  │ 5--  →  60-- w │
  │ 4--  →       - │
→ │ 3--  →       - │  → SEGMENTATION FAULT
  │ 2--  →       - │
  │ 1--  →  36-- r │
  │ 0--  →  35-- r │
  └────────────────┘

“Write to 101”

  ┌────────────────┐
  │ 9--  →  59-- w │
  │ 8--  →  63-- w │
  │ 7--  →       - │
  │ 6--  →  61-- w │
  │ 5--  →  60-- w │
  │ 4--  →       - │
  │ 3--  →       - │
  │ 2--  →       - │
→ │ 1--  →  36-- r │ → SEGMENTATION FAULT
  │ 0--  →  35-- r │
  └────────────────┘

Read-only pages give the

OS a new superpower:

Sharing!

These two processes use different

pages for heap, stack, and binary

┌────────────────┐         ┌────────────────┐
│ 9--  →  59-- w │ stack   │ 9--  →  48-- w │ stack
│ 8--  →  63-- w │ stack   │ 8--  →       - │
│ 7--  →       - │         │ 7--  →  51-- w │ heap
│ 6--  →  61-- w │ heap    │ 6--  →  50-- w │ heap
│ 5--  →  60-- w │ heap    │ 5--  →  46-- w │ heap
│ 4--  →       - │         │ 4--  →       - │
│ 3--  →       - │         │ 3--  →  39-- r │ libjson
│ 2--  →  39-- r │ libjson │ 2--  →  48-- r │ libX11
│ 1--  →  36-- r │ libc    │ 1--  →  36-- r │ libc
│ 0--  →  35-- r │ python  │ 0--  →  47-- r │ firefox
└────────────────┘         └────────────────┘

But they safely share a single

copy of libc and libjson

┌────────────────┐         ┌────────────────┐
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │ 3--  →  39-- r │ libjson
│ 2--  →  39-- r │ libjson │                │
│ 1--  →  36-- r │ libc    │ 1--  →  36-- r │ libc
│                │         │                │
└────────────────┘         └────────────────┘

Physical RAM page 36 can be re-used

for every process needing libc

┌────────────────┐         ┌────────────────┐
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │ 3--  →  39-- r │ libjson
│ 2--  →  39-- r │ libjson │                │
│ 1--  →  36-- r │ libc    │ 1--  →  36-- r │ libc
│                │         │                │
└────────────────┘         └────────────────┘

And Physical RAM page 39 can be

re-used for every process needing libjson

┌────────────────┐         ┌────────────────┐
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │                │
│                │         │ 3--  →  39-- r │ libjson
│ 2--  →  39-- r │ libjson │                │
│ 1--  →  36-- r │ libc    │ 1--  →  36-- r │ libc
│                │         │                │
└────────────────┘         └────────────────┘

This means that the total memory
consumed by process A and process B
is rarely the sum

(A's memory use) + (B's memory use)

Q:

So how can you tell how much memory
an additional worker thread or process
will consume on one of your servers?

A:

Stop looking at the quantity

“How much RAM does my Python program use?”

Start looking at the delta

Δ memory

Ask, “How much memory does each
additional process / worker / thread cost
(and when will that run me out of RAM)?”

How should you measure that?

By actual load and resource tests

Because as we will see,

memory usage is complicated

Problem: Python Code

Q: Where in virtual memory

does Python code live?

Python starts running with

most RAM pages sharable

┌────────────────┐
│ 9--  →  59-- w │ stack
│ 8--  →       - │
│ 7--  →       - │
│ 6--  →       - │
│ 5--  →       - │
│ 4--  →       - │
│ 3--  →       - │
│ 2--  →       - │
│ 1--  →  36-- r │ libc
│ 0--  →  35-- r │ python
└────────────────┘

But Python runs read() on foo.py creating

a writable, un-sharable page on the heap

┌────────────────┐
│ 9--  →  59-- w │ stack
│ 8--  →       - │
│ 7--  →       - │
│ 6--  →       - │
│ 5--  →  60-- w │ foo.py   ←
│ 4--  →       - │
│ 3--  →       - │
│ 2--  →       - │
│ 1--  →  36-- r │ libc
│ 0--  →  35-- r │ python
└────────────────┘

Then Python compiles foo.py to a code

object on another read-write page

┌────────────────┐
│ 9--  →  59-- w │ stack
│ 8--  →       - │
│ 7--  →       - │
│ 6--  →  61-- w │ codeobj  ←
│ 5--  →  60-- w │ foo.py
│ 4--  →       - │
│ 3--  →       - │
│ 2--  →       - │
│ 1--  →  36-- r │ libc
│ 0--  →  35-- r │ python
└────────────────┘

Only then does foo.py start up and

create legitimately unique program data

┌────────────────┐
│ 9--  →  59-- w │ stack
│ 8--  →       - │
│ 7--  →  63-- w │ data     ←
│ 6--  →  61-- w │ codeobj
│ 5--  →  60-- w │ foo.py
│ 4--  →       - │
│ 3--  →       - │
│ 2--  →       - │
│ 1--  →  36-- r │ libc
│ 0--  →  35-- r │ python
└────────────────┘

The heap winds up as a mix of actual

unique data, together with shared code

┌────────────────┐
│ 9--  →  59-- w │ stack
│ 8--  →       - │
│ 7--  →  63-- w │ data     ←
│ 6--  →  61-- w │ codeobj
│ 5--  →  60-- w │ foo.py
│ 4--  →       - │
│ 3--  →       - │
│ 2--  →       - │
│ 1--  →  36-- r │ libc
│ 0--  →  35-- r │ python
└────────────────┘

Now it just so happens that every
Python X.Y process loading foo.py will
create the exact same code object—

—but the OS does not know about this
because each Python process builds
its code objects separately

To the OS, Python code just looks like heap

┌────────────────┐         ┌────────────────┐
│ 9--  →  59-- w │    ≠    │ 9--  →  48-- w │ stack
│ 8--  →       - │         │ 8--  →       - │
│ 7--  →  63-- w │    ≠    │ 7--  →  71-- w │ data
│ 6--  →  61-- w │    ≠    │ 6--  →  69-- w │ codeobj
│ 5--  →  60-- w │    ≠    │ 5--  →  62-- w │ foo.py
│ 4--  →       - │         │ 4--  →       - │
│ 3--  →       - │         │ 3--  →       - │
│ 2--  →       - │         │ 2--  →       - │
│ 1--  →  36-- r │         │ 1--  →  36-- r │ libc
│ 0--  →  35-- r │         │ 0--  →  35-- r │ python
└────────────────┘         └────────────────┘

Q: Why not share code objects?

Q: Why not make them read-only?

A: Reference Counts

Standard C Python does not
support read-only code objects
because it constantly fiddles
with their reference counts

What does this code do?

f(5)

f(5)

Looks up f
Gets back a code object
Increments its reference count
Calls it with (5,)
Decrements its reference count

Why increment the count?

To prevent other threads
from de-allocating and overwriting
f() while this thread is busy running it

Lesson: the OS offers cool optimization
when processes need to share code, but
reference-counted code objects cannot
take advantage of this

Now, another basic concept—

Dynamic Linking

Take another look

at our Python process

┌────────────────┐
│ 9--  →  59-- w │ stack
│ 8--  →  63-- w │ stack
│ 7--  →       - │
│ 6--  →  61-- w │ heap
│ 5--  →  60-- w │ heap
│ 4--  →       - │
│ 3--  →       - │
│ 2--  →  39-- r │ libjson
│ 1--  →  36-- r │ libc.so
│ 0--  →  35-- r │ python
└────────────────┘

Dynamic Linking

The python binary and libc.so library

are separate files sitting on disk

┌────────────────┐
│                │
│                │
│                │
│                │
│                │
│                │
│                │
│                │
│ 1--  →  36-- r │ libc.so
│ 0--  →  35-- r │ python
└────────────────┘

Dynamic Linking

How does the Python interpreter find

the routines it needs in libc.so?

┌────────────────┐
│                │
│                │
│                │
│                │
│                │
│                │
│                │
│                │
│ 1--  →  36-- r │ libc.so
│ 0--  →  35-- r │ python
└────────────────┘

The Old Days

prog.c → prog

But then prog.c got really big

  “compile”  “link”

cmdn.c → cmdn.o ↘
opts.c → opts.o → prog
slen.c → slen.o ↗

So you only needed to re-compile

the source files you just edited

Q:

How does the linker
connect the module that calls foo()
with the module that defines it?

  “compile”  “link”

cmdn.c → cmdn.o ↘
opts.c → opts.o → prog
slen.c → slen.o ↗

A:

Every .o file has a table
of names that it defines and names
that it needs someone else to define

$ nm p_move.o
         U _nc_panelhook
         U is_linetouched
00000000 T move_panel
         U mvwin
         U wtouchln

The linker plays matchmaker
between names needed
and names provided

Linking returns an error

if a name needed cannot be found

$ ld -o prog foo.o bar.o baz.o

foo.o: In function `main':
foo.c:(.text+0x7): undefined reference to `haute'
collect2: ld returned 1 exit status

People started sharing useful .o

object files with their friends

So they used the ar archive tool
to bundle together related .o
files into single .a files

You can still find .a files

on OS X and Linux today

$ ar t /usr/lib/libpanel.a
panel.o
p_above.o
p_below.o
p_bottom.o
p_delete.o
p_hide.o
p_hidden.o
p_move.o
p_new.o
p_replace.o
p_show.o
p_top.o
p_update.o
p_user.o
p_win.o

So the linker was given the power
to search an .a archive and link
your program with the extra .o
files that it needs

The .a file full of .o files

came to be called a “library”

Problem:

If everyone lets the linker
copy printf.o into their program,
then memory has to hold many separate
copies of each popular library

Solution:

The invention of .so
“shared object” files that
can be added to the page table
of every program that needs them

┌────────────────┐
│       .        │
│       .        │
│       .        │
│ 1--  →  36-- r │ libc.so
│ 0--  →  35-- r │ python
└────────────────┘

The last-minute linking
that takes place at runtime
to connect python to libc.so
is called Dynamic Linking

┌────────────────┐
│       .        │
│       .        │
│       .        │
│ 1--  →  36-- r │ libc.so
│ 0--  →  35-- r │ python
└────────────────┘

Oddly enough

The .so shared library is all

you need to run a program

but

you still need that old-fashioned

.a library to compile the program

That's why you need to install libxml2-dev

to compile the Python lxml module, even if

/usr/lib/libxml2.so.2.7.8

already exists on your system

Anyway

You can see the libraries that a

program like python needs with ldd

$ ldd /usr/bin/python2.7
      linux-gate.so.1
      libpthread.so.0
      libdl.so.2
      libutil.so.1
      libssl.so.1.0.0
      libcrypto.so.1.0.0
      libz.so.1
      libm.so.6
      libc.so.6
      /lib/ld-linux.so.2

You can see the specific names

it needs with nm (“names”)

$ nm -D /usr/bin/python2.7
           .
           .
           .
        U unlink
        U unsetenv
        U utime
        U utimes
        U wait
        U wait3
        U wait4
        U waitpid
        U wcscoll
        U writ

Not only is it usual for the

Python binary to be dynamically linked

but Python extension modules
sometimes link against
shared libraries too

the lxml.etree Python module

is famous for this

$ ldd /usr/lib/pyshared/python2.7/lxml/etree.so
       linux-gate.so.1
       libxslt.so.1
       libexslt.so.0
       libxml2.so.2
       libpthread.so.0
       libc.so.6
       libm.so.6
       libgcrypt.so.11
       libdl.so.2
       libz.so.1
       /lib/ld-linux.so.2
       libgpg-error.so.0

Shared libraries cause problems

When a program chooses not to carry a

static copy of every library that it needs,

there is the danger that a library it

needs will be upgraded or removed

As an example, let us build:

a small shared library
a Python module that uses it

Shared library: `libtiny.so`

/* libtiny.c */
helper() { return 42; }

gcc libtiny.c -shared -o libtiny.so

Python module: `tinymodule.so`

/* tinymodule.c */

#include <python2.7/Python.h>

static PyMethodDef TinyMethods[] = {
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC inittiny(void)
{
    helper();  /* the call into libtiny.so */
    (void) Py_InitModule("tiny", TinyMethods);
}

gcc tinymodule.c -L. -ltiny -shared \

-o tinymodule.so

Having created these two *.so files we
need to add the current directory to
the OS shared library search path

$ export LD_LIBRARY_PATH=.

So the module tinymodule.so

needs the library libtiny.so

$ ldd tinymodule.so
       linux-gate.so.1
       libtiny.so
       libc.so.6
       /lib/ld-linux.so.2

Shared library present, compatible

$ ls
libtiny.c   tinymodule.c
libtiny.so  tinymodule.so

$ python -c 'import tiny'

$ echo $?
0

What if Python

can find tinymodule.so

but the OS cannot find

its dependency libtiny.so?

Shared library missing

$ rm libtiny.so
$ python -c 'import tiny'
ImportError: libtiny.so:
    cannot open shared object file:
    No such file or directory

Incompatible shared library

/* libtiny.c */
different_helper() { return 42; }

gcc libtiny.c -shared -o libtiny.so

Incompatible shared library

$ python -c 'import tiny'
ImportError: ./tinymodule.so:
    undefined symbol: helper

$ nm tinymodule.so
...
00000354 T _init
         U helper
         0000047c T inittiny

$ nm libtiny.so
...
000002f4 T _init
0000041c T another_helper

“Cannot open shared object”
and
“undefined symbol”

are possible because OS tries to
link a binary (tinymodule.so) to
its dependencies (libtiny.so)
at runtime

And now, for another puzzle piece:

Demand Paging

The OS does not load pages
from disk until your program
reads or writes from them

Imagine a program that

has just started running

Files:

python — 3 pages

libc.so — 2 pages

At first, only one page of

the python binary is loaded

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →       - │
│ 6--  →  61-- w │  heap
│ 5--  →       - │
│ 4--  →       - │ (libc.so)
│ 3--  →       - │ (libc.so)
│ 2--  →       - │ (python)
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python ← LOAD FROM DISK
└────────────────┘

Then Python tries to run the instruction

at 212 so that page gets loaded too

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →       - │
│ 6--  →  61-- w │  heap
│ 5--  →       - │
│ 4--  →       - │ (libc.so)
│ 3--  →       - │ (libc.so)
│ 2--  →  37-- r │  python ← LOAD FROM DISK
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python
└────────────────┘

Then Python calls a function at libc

offset 178 thus 300 + 178 = 478

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →       - │
│ 6--  →  61-- w │  heap
│ 5--  →       - │
│ 4--  →  41-- r │  libc.so ← LOAD FROM DISK
│ 3--  →       - │ (libc.so)
│ 2--  →  37-- r │  python
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python
└────────────────┘

If Python keeps calling the same routines,

then only those 3 pages ever get loaded

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →       - │
│ 6--  →  61-- w │  heap
│ 5--  →       - │
│ 4--  →  41-- r │  libc.so
│ 3--  →       - │ (libc.so)
│ 2--  →  37-- r │  python
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python
└────────────────┘

So pages are loaded

from disk lazily

Heap allocation works

the same way

Q: How many RAM pages are allocated if you

ask the OS for three new memory pages?

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →       - │
│ 6--  →       - │
│ 5--  →       - │
│ 4--  →  61-- w │  heap
│ 3--  →       - │
│ 2--  →  41-- r │  libc.so
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python
└────────────────┘

None!

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →       - │ (heap)
│ 6--  →       - │ (heap)
│ 5--  →       - │ (heap)
│ 4--  →  61-- w │  heap
│ 3--  →       - │
│ 2--  →  41-- r │  libc.so
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python
└────────────────┘

The OS allocates no pages but simply makes

an adjustment to its record-keeping

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →       - │ (heap)  ←
│ 6--  →       - │ (heap)  ←
│ 5--  →       - │ (heap)  ←
│ 4--  →  61-- w │  heap
│ 3--  →       - │
│ 2--  →  41-- r │  libc.so
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python
└────────────────┘

The OS will now respond to page faults
at 500–799 by allocating a new page,
not raising a Segmentation Fault

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →       - │ (heap)  ←
│ 6--  →       - │ (heap)  ←
│ 5--  →       - │ (heap)  ←
│ 4--  →  61-- w │  heap
│ 3--  →       - │
│ 2--  →  41-- r │  libc.so
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python
└────────────────┘

But no pages are actually allocated until

a read or write shows they are needed

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →       - │ (heap)  ←
│ 6--  →       - │ (heap)  ←
│ 5--  →       - │ (heap)  ←
│ 4--  →  61-- w │  heap
│ 3--  →       - │
│ 2--  →  41-- r │  libc.so
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python
└────────────────┘

That is the difference

between VIRT and RES

when you run top

top

               ↓     ↓

  PID USER    VIRT  RES  SHR S %CPU %MEM COMMAND
 1217 root    195m  94m  53m S    2  2.6 Xorg
 7304 brandon 183m  45m  19m S    2  1.3 chromium
 2027 brandon 531m 162m  37m S    1  4.5 chromium
 2319 brandon 179m  58m  23m S    1  1.6 chromium
18841 brandon 2820 1188  864 R    1  0.0 top
 9776 brandon 175m  49m  12m S    0  1.4 chromium

               ↑     ↑

VIRT

Virtual Image

All memory pages that promise not

to return a Segmentation Fault

RES

Resident Size

Only memory pages that have

actually been allocated

VIRT — 8 pages

RES — 4 pages

┌────────────────┐         VIRT  RES
│ 9--  →  59-- w │  stack    ✓    ✓
│ 8--  →       - │
│ 7--  →       - │ (heap)    ✓
│ 6--  →       - │ (heap)    ✓
│ 5--  →       - │ (heap)    ✓
│ 4--  →  61-- w │  heap     ✓    ✓
│ 3--  →       - │
│ 2--  →  41-- r │  libc.so  ✓    ✓
│ 1--  →       - │ (python)  ✓
│ 0--  →  35-- r │  python   ✓    ✓
└────────────────┘

Only RES can actually

run you out of memory

Look at RES when you wonder

where all of your RAM is going

                     ↓

  PID USER    VIRT  RES  SHR S %CPU %MEM COMMAND
 1217 root    195m  94m  53m S    2  2.6 Xorg
 7304 brandon 183m  45m  19m S    2  1.3 chromium
 2027 brandon 531m 162m  37m S    1  4.5 chromium
 2319 brandon 179m  58m  23m S    1  1.6 chromium
18841 brandon 2820 1188  864 R    1  0.0 top
 9776 brandon 175m  49m  12m S    0  1.4 chromium

                     ↑

/proc/<pid>/smaps

Shows whether physical RAM pages

have been allocated to a segment

Here are some segments of a new python

process sitting quietly at its prompt

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 896 kB

b73b9000-b752f000 r-xp .../libc-2.13.so
Size:               1496 kB
Rss:                 528 kB

bfc48000-bfc69000 rw-p [stack]
Size:                136 kB
Rss:                 104 kB

Demand Paging: Example

Calling a Python function that

was not called as Python started up

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 896 kB

>>> frozenset()

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB

916 - 896 = 20 new kilobytes

So invoking new sections of a binary

or library pulls more pages into RAM

But since binary and library
code can always be reloaded from disk,
the OS can also discard those pages

$ python -c 'range(500000000)'

This allocates lots of memory

As it runs, your OS will
dump information overboard to
reclaim every RAM page it can

What did this memory hog do
to the other Python process that
was sitting quietly at its prompt?

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB

↑
before the hog ran

after
↓

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 464 kB

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB

↑

Look at the change in Rss!

↓

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 464 kB

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB

↑

916 - 464 = 452k were thrown overboard!

↓

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 464 kB

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 916 kB

↑

but python will never know, right?

↓

08048000-08238000 r-xp .../bin/python
Size:               1984 kB
Rss:                 464 kB

Right!

Those 452k pages will be re-loaded,
on-demand, from disk if this Python
process tries to access them again

(well, there is one difference)

The process will slow down as

de-allocated pages are re-loaded

L1 cache — 3s grabbing a piece of paper
L2 cache — 14s picking a book from a shelf
System RAM — 4-minute walk down the hall
Hard drive seek —

“like leaving the building to roam

the earth for one year and three months.”

— Gustavo Duarte

That's why your web browser takes

so long to start the first time

And now, a Unix superpower!

Forking

On primitive operating systems the
only way to create a new process
is to start a whole new program

The Python interpreter requires

several steps to get going —

A
↓
B
↓
C
↓
D

— then to start a second worker process

you have to send it through the same startup

Process 1  A
           ↓
           B
           ↓
           C
           ↓
           D — os.spawn() → A  Process 2
           ↓                ↓
           .                B
           .                ↓
           .                C
           ↓                ↓
           .                D

But Linux and OS X support fork()!

“If you come to a fork

in the road, take it!”

fork() creates a child process that continues

on from the parent's current state

Process 1  A
           ↓
           B
           ↓
           C
           ↓
           D — os.fork() ↴  Process 2
           ↓             ↓
           E             E
           ↓             ↓
           F             F
           ↓             ↓

The OS could implement fork() naively
by copying every writeable page in the parent
so that the child had its own copy

  fork() parent              fork() child
┌────────────────┐         ┌─────────────────┐
│ 9--  →  59-- w │ stack   │ 9--  → new copy │
│ 8--  →  63-- w │ stack   │ 8--  → new copy │
│ 7--  →       - │         │ 7--  →        - │
│ 6--  →  61-- w │ heap    │ 6--  → new copy │
│ 5--  →  60-- w │ heap    │ 5--  → new copy │
│ 4--  →       - │         │ 4--  →        - │
│ 3--  →       - │         │ 3--  →        - │
│ 2--  →  39-- r │ libjson │ 2--  →  (same)  │
│ 1--  →  36-- r │ libc    │ 1--  →  (same)  │
│ 0--  →  35-- r │ python  │ 0--  →  (same)  │
└────────────────┘         └─────────────────┘

But immediately copying every page could

take a long time for a large process,

and during the copy both parent and

child would be hung waiting!

Instead, as you might guess,

the OS implements fork()

lazily,

copying pages on-demand
when the parent or child
performs a write

So the OS starts fork() by copying the page
table, marking all pages read-only in both
processes, then letting them keep running

  fork() parent              fork() child
┌────────────────┐         ┌────────────────┐
│ 9--  →  59-- r │ stack   │ 9--  →  59-- r │
│ 8--  →  63-- r │ stack   │ 8--  →  63-- r │
│ 7--  →       - │         │ 7--  →       - │
│ 6--  →  61-- r │ heap    │ 6--  →  61-- r │
│ 5--  →  60-- r │ heap    │ 5--  →  60-- r │
│ 4--  →       - │         │ 4--  →       - │
│ 3--  →       - │         │ 3--  →       - │
│ 2--  →  39-- r │ libjson │ 2--  →  (same) │
│ 1--  →  36-- r │ libc    │ 1--  →  (same) │
│ 0--  →  35-- r │ python  │ 0--  →  (same) │
└────────────────┘         └────────────────┘

As writes cause page faults the OS makes copies.

Here page 63 is copied to 77 and marked w.

    fork() parent                 fork() child
  ┌────────────────┐           ┌────────────────┐
  │ 9--  →  59-- r │   stack   │ 9--  →  59-- r │
→ │ 8--  →  63-- w │ ← stack → │ 8--  →  77-- w │ ←
  │ 7--  →       - │           │ 7--  →       - │
  │ 6--  →  61-- r │   heap    │ 6--  →  61-- r │
  │ 5--  →  60-- r │   heap    │ 5--  →  60-- r │
  │ 4--  →       - │           │ 4--  →       - │
  │ 3--  →       - │           │ 3--  →       - │
  │ 2--  →  39-- r │   libjson │ 2--  →  (same) │
  │ 1--  →  36-- r │   libc    │ 1--  →  (same) │
  │ 0--  →  35-- r │   python  │ 0--  →  (same) │
  └────────────────┘           └────────────────┘

Q: How well does this usually work?

A: It works great!

Only the differences that develop
between the parent and child memory
images incur additional storage

Q: But how does work for Python?

A: Terribly!

Thanks to reference counts,
merely glancing at data with C Python
forces the OS to create a separate copy

# Create a list

biglist = ['foo']
for n in range(1, 8460000):
    biglist.append(n)   # ~100 MB

# put fork() here

# Iterate across the list

t0 = time.time()
all(n for n in biglist)
t1 = time.time()

C Python

Before C Python iterates

$ cat /proc/22364/smaps | awk '/heap/,/Private_D/'
08d60000-0ef90000 rw-p 00000000 00:00 0     [heap]
Size:             100544 kB
Rss:              100544 kB
Pss:               50294 kB
Shared_Clean:          0 kB
Shared_Dirty:     100500 kB
Private_Clean:         0 kB
Private_Dirty:        44 kB

After C Python iterates

$ cat /proc/22364/smaps | awk '/heap/,/Private_D/'
08d60000-0ef90000 rw-p 00000000 00:00 0     [heap]
Size:             100544 kB
Rss:              100544 kB
Pss:              100274 kB
Shared_Clean:          0 kB
Shared_Dirty:        540 kB
Private_Clean:         0 kB
Private_Dirty:    100004 kB

C Python

At finish: 0.5% of heap shared

PyPy 2.8

Before PyPy iterates

$ cat /proc/22385/smaps | awk '/heap/,/Private_D/'
0a5c9000-11fba000 rw-p 00000000 00:00 0     [heap]
Size:             124868 kB
Rss:              124540 kB
Pss:               62274 kB
Shared_Clean:          0 kB
Shared_Dirty:     124532 kB
Private_Clean:         0 kB
Private_Dirty:         8 kB

After PyPy iterates

$ cat /proc/22385/smaps | awk '/heap/,/Private_D/'
0a5c9000-11fba000 rw-p 00000000 00:00 0     [heap]
Size:             124868 kB
Rss:              124868 kB
Pss:               63160 kB
Shared_Clean:          0 kB
Shared_Dirty:     123416 kB
Private_Clean:         0 kB
Private_Dirty:      1452 kB

PyPy 2.8

At finish: 96.1% of heap shared

Lesson:

Forked worker processes in Unix
share memory when the parent process
builds read-only data structures

—but not in C Python

How can two Python procedures
share writeable memory
to collaborate?

Threads
Memory maps

1. Threads

Creating a thread is like fork()
except that the heap remains shared
between the two threads of control

Python supports threads

on both Unix and Windows

import threading
t = threading.Thread(target=myfunc)
t.start()

Each thread gets its own stack so the two
threads can call different functions, but
all other data structures are shared

    main thread               child thread
┌────────────────┐         ┌────────────────┐
│ 9--  →  59-- r │  stack  │ 9--  →  59-- r │
│ 8--  →  63-- w │  stack  │ 8--  →  77-- w │
└────────────┬───┴─────────┴──┬─────────────┘
             │ 7--  →       - │
             │ 6--  →  61-- w │ heap
             │ 5--  →  60-- w │ heap
             │ 4--  →       - │
             │ 3--  →       - │
             │ 2--  →  39-- r │ libjson
             │ 1--  →  36-- r │ libc
             │ 0--  →  35-- r │ python
             └────────────────┘

Since threads
share every data structure
they have to be very careful

2. Memory Maps

Using mmap() a parent process
can create shared memory that will be
inherited by all the child workers it forks

import mmap, os

map = mmap.mmap(-1, 100)
os.fork()
...

  fork() parent                 fork() child
┌────────────────┐           ┌────────────────┐
│ 9--  →  59-- r │   stack   │ 9--  →  59-- r │
│ 8--  →  63-- w │   stack   │ 8--  →  77-- w │
└─────────────┬──┴───────────┴─┬──────────────┘
              │ 7--  →  88-- - │ mmap segment
┌─────────────┴──┬───────────┬─┴──────────────┐
│ 6--  →  61-- r │   heap    │ 6--  →  61-- r │
│ 5--  →  60-- r │   heap    │ 5--  →  60-- r │
│ 4--  →       - │           │ 4--  →       - │
│ 3--  →       - │           │ 3--  →       - │
│ 2--  →  39-- r │   libjson │ 2--  →  (same) │
│ 1--  →  36-- r │   libc    │ 1--  →  (same) │
│ 0--  →  35-- r │   python  │ 0--  →  (same) │
└────────────────┘           └────────────────┘

This supports very fast RAM-based

communication between processes

without requiring
every data structure on the heap
to carry locks or other protection

Another mmap() capability

Remember how the OS loads pages
from binary programs like python
and shared libraries like libc.so
on-demand, not all at once?

Well, with mmap() you can do

that yourself, with normal files!

mmap(myfile.fileno(), 0) lets you
replace seek(), read(), and write() calls
and simply access it like an array!

┌────────────────┐
│ 9--  →  59-- w │  stack
│ 8--  →       - │
│ 7--  →  91-- - │  mmap[1] ↔ myfile
│ 6--  →  90-- - │  mmap[0] ↔ myfile
│ 5--  →       - │
│ 4--  →  61-- w │  heap
│ 3--  →       - │
│ 2--  →  41-- r │  libc.so
│ 1--  →       - │ (python)
│ 0--  →  35-- r │  python
└────────────────┘

So, there you have it

Your processes all access memory

through the mediation of a page table

Page tables power all

kinds of fun RAM optimizations:

Demand loading
Shared libraries
Memory maps
fork()
Threads

And

PyPy lacks reference counts
which lets the OS do its magic
and conserve resources

Python, Linkers, and Virtual Memory

What is a page table?

Page Table

Page Fault

Page Fault

Q:

A:

Example: Running your editor

Example: Your Editor

Example: Your Editor

Example: Your Editor

Page Table

Security

Security

Security

Stability

Stability

Protection

Protection

Sharing

Q:

A:

Problem: Python Code

Why increment the count?

Dynamic Linking

Dynamic Linking

Dynamic Linking

The Old Days

Q:

A:

Problem:

Solution:

Oddly enough

Anyway

Shared libraries cause problems

Shared library: libtiny.so

Python module: tinymodule.so

Shared library present, compatible

Shared library missing

Incompatible shared library

Incompatible shared library

Demand Paging

Demand Paging

That is the difference

top

/proc/<pid>/smaps

Demand Paging: Example

Right!

Forking

C Python

Before C Python iterates

After C Python iterates

C Python

PyPy 2.8

Before PyPy iterates

After PyPy iterates

PyPy 2.8

Lesson:

Explicitly Sharing Memory

1. Threads

2. Memory Maps

Another mmap() capability

And

Thank you!

Shared library: `libtiny.so`

Python module: `tinymodule.so`