All Your Ducks In A Row
Data Structures in the
Standard Library and Beyond
@brandon_rhodes
PyCon 2014
Montréal

This talk

This talk

How data structures work

This talk

How data structures work
What they can do efficiently

This talk

How data structures work
What they can do efficiently
Which data structure is Python’s
most dangerous

list

Computer memory
is a byte array
Bytes are named
by sequential integers
called addresses
 0 1 2 3 4 5 6 7 8 ...
┌─┬─┬─┬─┬─┬─┬─┬─┬─
          ...
└─┴─┴─┴─┴─┴─┴─┴─┴─
To save space, I will
write them in rows of 8
┌─┬─┬─┬─┬─┬─┬─┬─┐
               
├─┼─┼─┼─┼─┼─┼─┼─┤
               
├─┼─┼─┼─┼─┼─┼─┼─┤
               
├─┼─┼─┼─┼─┼─┼─┼─┤
      ...      
RAM chips are massively parallel
devices that provide random access

This talk ignores

Why do I ignore those issues?

Why do I ignore those issues?

The Standard Library does

Addresses

You can add and subtract
from one address to visit
other addresses near it
Address arithmetic supports:
Records
Arrays

Record

Sequence of fields
in an agreed order
┌─┬─┬─┬─┬─┬─┬─┬─┐
reference count
├─┼─┼─┼─┼─┼─┼─┼─┤
address of type  ...
├─┼─┼─┼─┼─┼─┼─┼─┤
       ...
┌─┬─┬─┬─┬─┬─┬─┬─┐
reference count
├─┼─┼─┼─┼─┼─┼─┼─┤
address of type  <type 'int'>
├─┼─┼─┼─┼─┼─┼─┼─┤
      123      
└─┴─┴─┴─┴─┴─┴─┴─┘
┌─┬─┬─┬─┬─┬─┬─┬─┐
reference count
├─┼─┼─┼─┼─┼─┼─┼─┤
address of type  <type 'float'>
├─┼─┼─┼─┼─┼─┼─┼─┤
2.997924580e+08
└─┴─┴─┴─┴─┴─┴─┴─┘
┌─┬─┬─┬─┬─┬─┬─┬─┐
reference count
├─┼─┼─┼─┼─┼─┼─┼─┤
address of type  <type 'str'>
├─┼─┼─┼─┼─┼─┼─┼─┤
   length = 6  
├─┼─┼─┼─┼─┼─┼─┼─┤
     hash      
├─┼─┼─┼─┼─┼─┼─┼─┤
     flags     
├─┼─┼─┼─┼─┼─┼─┼─┤
  address = 0  
├─┼─┼─┼─┼─┼─┼─┴─┘
H e l l o !│
└─┴─┴─┴─┴─┴─┘
You retrieve a field
from a record by adding:
record’s start address
+
field’s offset
       ┌─┬─┬─┬─┬─┬─┬─┬─┐
a + 0  reference count
       ├─┼─┼─┼─┼─┼─┼─┼─┤
a + 8  address of type  <type 'int'>
       ├─┼─┼─┼─┼─┼─┼─┼─┤
a + 16       123      
       └─┴─┴─┴─┴─┴─┴─┴─┘

struct

>>> from struct import pack, unpack
>>> data = pack('IIf', 2, 9, 20.0)
>>> len(data)
12
>>> data.encode('hex')
'02000000090000000000a041'
>>> unpack('IIf', data)
(2, 9, 20.0)
A struct can be useful
for binary conversations
with C libraries or I/O

Array

a + 0 1 2 3 4 5
   ┌─┬─┬─┬─┬─┬─┐
   H e l l o !│
   └─┴─┴─┴─┴─┴─┘
Character i
lives at (a + i)
           ┌─┬─┬─┬─┬─┬─┬─┬─┐
a+0  a+4   h      |e      |
           ├─┼─┼─┼─┼─┼─┼─┼─┤
a+8  a+12  l      |l      
           ├─┼─┼─┼─┼─┼─┼─┼─┤
a+16 a+20  o      |!      
           └─┴─┴─┴─┴─┴─┴─┴─┘
Character i
lives at (a + 4i)
Given array items b bytes long,
item i lives at (a + bi)
Record — addition, heterogeneous
Array — multiplication, homogenous
Both give you immediate
access to the data you want

array

>>> from array import array
>>> a = array('d', [1.0, 2.0, 3.0, 4.0, 5.0])
>>> a[3]
4.0
>>> a[3] = 44.0
>>> a
array('d', [1.0, 2.0, 3.0, 44.0, 5.0])
>>> a.itemsize
8
>>> len(a)
5
>>> len(a) * a.itemsize
40
An array is useful
for binary conversations
with C libraries or I/O
It also packs data
very tightly in RAM

Specialized arrays

str
unicode
StringIO
memoryview
bytearray
bytes

But

Accessing its items
from Python requires
repeated object building
┌─┬─┬─┬─┬─┬─┬─┬─┐
reference count
├─┼─┼─┼─┼─┼─┼─┼─┤
address of type  <type 'float'>
├─┼─┼─┼─┼─┼─┼─┼─┤
   1.234 e+00  
└─┴─┴─┴─┴─┴─┴─┴─┘

↑ ↓

┌─┬─┬─┬─┬─┬─┬─┬─┐
 1.234  5.678 
├─┼─┼─┼─┼─┼─┼─┼─┤
For example,
to sum() an array of 100 floats,
Python builds >100 float objects

NumPy

Python array — for C or I/O
NumPy array — for math!
+ - * / // % divmod() ** pow()
<< >> & | ^ < <= > >= == |=
NumPy arrays support
math operations directly
without building Python
number objects!
>>> import numpy as np
>>> a = np.arange(0, 5)
>>> print(a)
[0 1 2 3 4]

>>> a.sum()
10

>>> print(a * a)
[ 0  1  4  9 16]

>>> print(a + 10)
[10 11 12 13 14]
Use NumPy or Pandas to do
math on millions of numbers
without needing millions
of Python objects
(see also Blaze)
Note that PyPy can auto-detects
a list with only int or float items
and can substitute a fast array
Python itself is
Dynamic language”
“Everything is an object
So Python’s main data
structures are going to be general
capable of holding any objects

tuple

>>> tup = (1, 2.0, 'Hello!')
>>> tup = (1, 2.0, 'Hello!')
Wait!
Those three objects
are going to be of
different sizes!
>>> tup = (1, 2.0, 'Hello!')
Array address math a + bi
depends upon every element
being b bytes long
All problems in computer
science can be solved
by another level
of indirection”
— David Wheeler
┌─┬─┬─┬─┬─┬─┬─┬─┐
reference count
├─┼─┼─┼─┼─┼─┼─┼─┤
address of type  <type tuple>
├─┼─┼─┼─┼─┼─┼─┼─┤
       3       
├─┼─┼─┼─┼─┼─┼─┼─┤
    address      <int 1>
├─┼─┼─┼─┼─┼─┼─┼─┤
    address      <float 2.0>
├─┼─┼─┼─┼─┼─┼─┼─┤
    address      <str 'Hello!'>
└─┴─┴─┴─┴─┴─┴─┴─┘

indirection

“Python: the language
that gives you data structures,
without any of the actual data”

indirection

Moving an item in a general
purpose Python data structure means
merely copying an address

indirection

There is only one copy of an object
regardless of how many times it appears
in a tuple list or dict
len(tup)
tup[i]
Both length and get-item require
only a single fetch of 8 bytes
>>> from collections import namedtuple
>>> City = namedtuple('City', 'name province')
>>> m = City('Montréal', 'Québec')

>>> m.name
'Montréal'
>>> m.province
'Québec'

tuple

Note that a tuple
knows its length
This supports bounds checking!
>>> t = (1, 2, 3)
>>> len(t)
3
>>> t[9]
Traceback (most recent call last):
  ...
IndexError: tuple index out of range

list

Lists are interesting
because they can grow

Problem:

To grow, the list
might need to move
if it’s hemmed in
─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─
#######│     list      │#######
─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─

But

Python objects can never
change address!
      names
    references
        
─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─
#######│     list      │#######
─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─
All problems in computer
science can be solved
by another level
of indirection”
— David Wheeler
┌─┬─┬─┬─┬─┬─┬─┬─┐
reference count
├─┼─┼─┼─┼─┼─┼─┼─┤
address of type  <type list>
├─┼─┼─┼─┼─┼─┼─┼─┤
       3       
├─┼─┼─┼─┼─┼─┼─┼─┤   ┌─┬─┬─┬─┬─┬─┬─┬─┐
 array address       address     
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
       4              address     
└─┴─┴─┴─┴─┴─┴─┴─┘   ├─┼─┼─┼─┼─┼─┼─┼─┤
                        address     
                    ├─┼─┼─┼─┼─┼─┼─┼─┤
                           -       
                    └─┴─┴─┴─┴─┴─┴─┴─┘

Q:

Why does the list have
an extra fourth slot?
┌─┬─┬─┬─┬─┬─┬─┬─┐
reference count
├─┼─┼─┼─┼─┼─┼─┼─┤
address of type  <type list>
├─┼─┼─┼─┼─┼─┼─┼─┤
       3       
├─┼─┼─┼─┼─┼─┼─┼─┤   ┌─┬─┬─┬─┬─┬─┬─┬─┐
 array address       address     
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
       4              address     
└─┴─┴─┴─┴─┴─┴─┴─┘   ├─┼─┼─┼─┼─┼─┼─┼─┤
                        address     
                    ├─┼─┼─┼─┼─┼─┼─┼─┤
                           -       
                    └─┴─┴─┴─┴─┴─┴─┴─┘

A:

If a Python list reserved
no extra room then every append
would require reallocation!
Reallocation is expensive:
each item address to be
copied to its new home
So a reallocation on every
append() would quickly add up
>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> sum(range(10))
45
>>> sum(range(1000))
499500
>>> sum(range(1000000))
499999500000
This is my standard
thousand-and-million
thought experiment!
“What is the cost of doing this
operation a thousand times?
And a million times?”
If the answer grows zeros at twice
the rate of the question, I lose
>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> sum(range(10))
45
>>> sum(range(1000))
499500
>>> sum(range(1000000))
499999500000
thousandmillion
O(n) — 3 zeros
O(n lg n) — 4 zeros
O(n²) — 6 zeros

Million seconds

12 days

Billion seconds

31 years

Trillion seconds

31,688 years
So append() on a full list
asks for extra item slots
Sizes: 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
Designing a data structure
this way is called amortization
because it spreads a cost over time,
just like a mortgage does for a loan
Appends Reallocations Address copies
1,000 27 8,603
1,000,000 85 9,692,995
1,000,000,000 143 9,058,551,346
Amortized costs tend to be bumpy
You add the 991st item,
the list grows to 1,120 slots,
990 item addresses get copied
Painful
But, in return, the next 130 items
append at no additional cost

amortization

The Python list illustrates a trade-off:
to save time we must waste space
with all of those unused list slots
┌───────────────────────┬───────────┐
   991 item addresses   129 slots 
└───────────────────────┴───────────┘

Speed vs space

Python lists, on average, use
94% of their slots — consuming
6% more space than they use
In return, they are fast!

list

Python’s
most dangerous
data structure!
# Single-item:

s[i]
s[i] = v
del s[i]
s.append(v)
s.insert(i, v)
s.pop(i)
v in s
# Several-item:

t = s[7:-7]
s == t
s.count('Montréal')
s.index('Montréal')
s.remove('Montréal')
for item in s: ...

problem

# Single-item operations that touch:
# 1 item              # n items

s[i]
s[i] = v
del s[len(s)]         del s[0]
s.append(v)
s.insert(len(s), v)   s.insert(0, v)
s.pop()               s.pop(0)
                      v in s
# Reverse order

while mylist:
    item = list.pop(0)
    print item
Million items?
Half trillion operations!
for item in reversed(mylist):
    print item

list

Python’s
most dangerous
data structure!

slicing

copy vs view

Making a slice of size n
costs n address copies
t = s[400:500]
┌─────────────────────────┐
      1,000 item list    
└─────────────────────────┘
 8 * 100 = 800 bytes copied
             
     ┌───────────────┐
      100 item list 
     └───────────────┘
Expensive slicing annoys both
scientists with data and data scientists
So NumPy and Pandas slices
are simple views that copy no data
t = s[400:500]
┌────────────────────────┐
      1,000 floats      
└────────────────────────┘
             
        ┌─────────┐
          slice  
        +400 <500
        └─────────┘

dict

A list does not let you name slots
but imposes an integer index
      ┌─┬─┬─┬─┬─┬─┬─┬─┐
s[0]      address      item
      ├─┼─┼─┼─┼─┼─┼─┼─┤
s[1]      address      item
      ├─┼─┼─┼─┼─┼─┼─┼─┤
s[2]      address      item
      ├─┼─┼─┼─┼─┼─┼─┼─┤
s[3]      address      item
      ├─┼─┼─┼─┼─┼─┼─┼─┤
              
But a dict lets
you choose keys
>>> d = {}
>>> d[1] = 'un'
>>> d[2.0] = 'deux'
>>> d['three'] = 'trois'
Behind each dict is an array
in which keys are stored at indexes
according to their hash value
The means that each slot
must store both key and value
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
      hash       key address   value address 
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
      hash       key address   value address 
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
      hash       key address   value address 
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
      ...            ...            ...      

see

“The Mighty Dictionary”
PyCon 2010
A dict grows by doubling
or quadrupling its size,
so resize cost is amortized
A dict resizes at ⅔rds full
to avoid too many collisions
Only ever ⅓–⅔ full

Speed vs space

Given a key, a dict
hashes it and can usually
jump right to its slot
All dictionary operations
are fast and therefore safe
for beginning programmers
# Single item   # Several item

d[key]          d == d2
d[key] = v      d.update(d2)
key in d        list(d)
d.get(key)      for k in d.keys(): ...
d.pop(key)      for k in d.values(): ...
del d[key]      for k, v in d.items(): ...
Common
dict-powered
tasks
>>> from collections import defaultdict
>>>
>>> animals = defaultdict(list)
>>> animals['Mammal'].append('dog')
>>> animals['Reptile'].append('lizard')
>>> animals['Mammal'].append('cat')
{'Mammal': ['dog', 'cat'],
 'Reptile': ['lizard']}
>>> # Most common 2 letters in a phrase
>>>
>>> from collections import Counter
>>> c = Counter('château frontenac, quebec')
>>> c.most_common(2)
[('e', 4), ('c', 3)]
Remember: you can build
compound dict keys with a tuple
{('google.com', 80): 7125773,
 ('google.com', 443): 1520951,
 ('heroku.com', 22): 245080,
 ('linkedin.com', 443): 5724124,
 ('rackspace.com', 443): 5002620}
Use the frozenset if the
data in your keys is unordered
project_languages = {
    frozenset(['python']): 328,
    frozenset(['python', 'js']): 217,
    frozenset(['ruby']): 274,
    frozenset(['ruby', 'js']): 261,
    frozenset(['js']): 79,
    }
Because the dict assigns
key to array indexes by hash,
it iterates in arbitrary order
list — maintains order
dict — maintains association
Two data structures, two superpowers
list — maintains order
dict — maintains association
But what if we want both?

OrderedDict

>>> from collections import OrderedDict
>>>
>>> d = dict.fromkeys('Montréal')
>>> list(d.keys())
['a', 'r', 't', 'é', 'M', 'l', 'o', 'n']
>>>
>>> d = OrderedDict.fromkeys('Montréal')
>>> list(d.keys())
['M', 'o', 'n', 't', 'r', 'é', 'a', 'l']

OrderedDict

Email headers
HTTP headers
JSON fields

set

set

A dict with only
keys but no values
# Fast 1-member ops

member in t
t.add(member)
t.remove(member)

# Fast n-member ops

t | u
t & u
t - u

classes

By default, each attribute
is a key in obj.__dict__
But classes that pre-specify
__slots__ are implemented
as a struct (record) instead

What’s left?

Fetching a range

Use bisect if your data is already
sorted in a list or array
>>> a = [1.0, 2.2, 3.0, 3.4, 3.5, 4.0, 4.5]
>>> import bisect
>>> left = bisect.bisect_left(a, 3.0)
>>> right = bisect.bisect_right(a, 4.0)
>>> a[left:right]
[3.0, 3.4, 3.5, 4.0]

Double-ended list

Use the deque

The list supports
efficient modification
at its right end
append()
pop()
These list operations
provide Python with its
native stack
append()
pop()
“We all have ‘stacks’ in our minds
when we are solving problems”
— Donald Knuth
append()
pop()
But the deque provides quick
operations at both ends
# Efficient operations:

>>> from collections import deque
>>> d = deque()
>>> d.append(5)
>>> d.appendleft(4)
>>> d.pop()
5
>>> d.popleft()
4
item → deque → item
item ← deque ← item
# Great for threading
# Great for multiprocessing
from queue import Queue
item → Queue → item

Popularity contest

The heapq lets you
fetch top t of n items
in O(t + n) time!
>>> import heapq
>>> a = [6, 2, 4, 3, 1, 5, 7]
>>> heapq.heapify(a)
>>> a
[1, 2, 4, 3, 6, 5, 7]
>>> print('     {}\n   {}\n{}'
...       .format(a[0:1], a[1:3], a[3:7]))
     [1]
   [2, 4]
[3, 6, 5, 7]
Push and pop in any order —
the smallest is always returned!
>>> heapq.heappop(a)
1
>>> heapq.heappop(a)
2
>>> heapq.heappush(a, 0)
>>> heapq.heappop(a)
0
Also nlargest() and nsmallest()
which avoid sorting entire sequence
PriorityQueue
A heapq with locks,
so threads can cooperate
on a prioritized work pile
sched.scheduler
A heapq that tells you when
the next task is due to run

So

Q:

Where are the old-fashioned
data structures based upon
chains of addresses?

A:

They make little sense
when our data structures
no longer contain data
but only addresses
# Linked list in the old days

┌─┬─┬─┬─┬─┬─┬─┬─┐   ┌─┬─┬─┬─┬─┬─┬─┬─┐
  next address     next address  
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
      data              data     
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
      ...               ...      
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
      ...               ...      
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
      ...               ...      
└─┴─┴─┴─┴─┴─┴─┴─┘   └─┴─┴─┴─┴─┴─┴─┴─┘
# Linked list when "everything is an object"

┌─┬─┬─┬─┬─┬─┬─┬─┐   ┌─┬─┬─┬─┬─┬─┬─┬─┐
  next address     next address  
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
object address    object address 
└─┴─┴─┴─┴─┴─┴─┴─┘   └─┴─┴─┴─┴─┴─┴─┴─┘
                           
    <object>            <object>
Both deque and OrderedDict
have doubly-linked lists inside
But most other linked
lists in Python are hidden
in the memory allocation layer
CPython → C malloc()
IronPython → CLR
Jython → JVM
Since your Python
objects are already inside of
malloc()-style linked lists,
adding a further level of
linking is usually redundant

Trees

Trees

Used extensively for data
that is actually a tree!
email.message.MIMEPart
xml.etree.ElementTree
logging.handlers

Q:

But what about the
traditional uses of trees?

A:

We have outsourced those
tasks to our persistence layers!
PostgreSQL index? B-tree!
MongoDB index? B-tree!
CouchDB index? B-tree!
SQLite index? B-tree!
The data systems of yore
were long-running and might
keep a data structure around for weeks
Our programs today
instead tend to be episodic and
concerned with individual events
This turns us towards
a functional programming style
that involves quick write-once
throw-away data structures
If simple data-centric code is the future,
per Uncle Bob Martin, Gary Bernhardt —

see

“The Clean Architecture in Python”
PyCon Ireland 2013
— then simple, elegant list and dict
are exactly the data structures
that will be taking us there

Thank you very much!

@brandon_rhodes