All Your Ducks In A Row

Data Structures in the

Standard Library and Beyond

@brandon_rhodes
PyCon 2014
Montréal

This talk

How data structures work

This talk

How data structures work

What they can do efficiently

This talk

How data structures work

What they can do efficiently

Which data structure is Python’s

most dangerous

list

Computer memory

is a byte array

Bytes are named
by sequential integers
called addresses

 0 1 2 3 4 5 6 7 8 ...
┌─┬─┬─┬─┬─┬─┬─┬─┬─
│ │ │ │ │ │ │ │ │  ...
└─┴─┴─┴─┴─┴─┴─┴─┴─

To save space, I will

write them in rows of 8

┌─┬─┬─┬─┬─┬─┬─┬─┐
│               │
├─┼─┼─┼─┼─┼─┼─┼─┤
│               │
├─┼─┼─┼─┼─┼─┼─┼─┤
│               │
├─┼─┼─┼─┼─┼─┼─┼─┤
│      ...      │

RAM chips are massively parallel

devices that provide random access

This talk ignores

Data structure compression
Memory hierarchy (but see Gustavo Duarte’s What Your Computer Does While You Wait and Dan Luu’s How Misaligning Data Can Increase Performance 12×)

Why do I ignore those issues?

The Standard Library does

Addresses

You can add and subtract
from one address to visit
other addresses near it

Address arithmetic supports:

Records

Arrays

Record

Sequence of fields

in an agreed order

┌─┬─┬─┬─┬─┬─┬─┬─┐
│reference count│
├─┼─┼─┼─┼─┼─┼─┼─┤
│address of type│ → ...
├─┼─┼─┼─┼─┼─┼─┼─┤
       ...

┌─┬─┬─┬─┬─┬─┬─┬─┐
│reference count│
├─┼─┼─┼─┼─┼─┼─┼─┤
│address of type│ → <type 'int'>
├─┼─┼─┼─┼─┼─┼─┼─┤
│      123      │
└─┴─┴─┴─┴─┴─┴─┴─┘

┌─┬─┬─┬─┬─┬─┬─┬─┐
│reference count│
├─┼─┼─┼─┼─┼─┼─┼─┤
│address of type│ → <type 'float'>
├─┼─┼─┼─┼─┼─┼─┼─┤
│2.997924580e+08│
└─┴─┴─┴─┴─┴─┴─┴─┘

┌─┬─┬─┬─┬─┬─┬─┬─┐
│reference count│
├─┼─┼─┼─┼─┼─┼─┼─┤
│address of type│ → <type 'str'>
├─┼─┼─┼─┼─┼─┼─┼─┤
│   length = 6  │
├─┼─┼─┼─┼─┼─┼─┼─┤
│     hash      │
├─┼─┼─┼─┼─┼─┼─┼─┤
│     flags     │
├─┼─┼─┼─┼─┼─┼─┼─┤
│  address = 0  │
├─┼─┼─┼─┼─┼─┼─┴─┘
│H e l l o !│
└─┴─┴─┴─┴─┴─┘

You retrieve a field

from a record by adding:

record’s start address
+
field’s offset

       ┌─┬─┬─┬─┬─┬─┬─┬─┐
a + 0  │reference count│
       ├─┼─┼─┼─┼─┼─┼─┼─┤
a + 8  │address of type│ → <type 'int'>
       ├─┼─┼─┼─┼─┼─┼─┼─┤
a + 16 │      123      │
       └─┴─┴─┴─┴─┴─┴─┴─┘

struct

>>> from struct import pack, unpack
>>> data = pack('IIf', 2, 9, 20.0)
>>> len(data)
12
>>> data.encode('hex')
'02000000090000000000a041'
>>> unpack('IIf', data)
(2, 9, 20.0)

A struct can be useful
for binary conversations
with C libraries or I/O

Array

a + 0 1 2 3 4 5
   ┌─┬─┬─┬─┬─┬─┐
   │H e l l o !│
   └─┴─┴─┴─┴─┴─┘

Character i

lives at (a + i)

           ┌─┬─┬─┬─┬─┬─┬─┬─┐
a+0  a+4   │h      |e      |
           ├─┼─┼─┼─┼─┼─┼─┼─┤
a+8  a+12  │l      |l      │
           ├─┼─┼─┼─┼─┼─┼─┼─┤
a+16 a+20  │o      |!      │
           └─┴─┴─┴─┴─┴─┴─┴─┘

Character i

lives at (a + 4i)

Given array items b bytes long,

item i lives at (a + bi)

Record — addition, heterogeneous

Array — multiplication, homogenous

Both give you immediate

access to the data you want

array

>>> from array import array
>>> a = array('d', [1.0, 2.0, 3.0, 4.0, 5.0])
>>> a[3]
4.0
>>> a[3] = 44.0
>>> a
array('d', [1.0, 2.0, 3.0, 44.0, 5.0])
>>> a.itemsize
8
>>> len(a)
5
>>> len(a) * a.itemsize
40

An array is useful
for binary conversations
with C libraries or I/O

It also packs data

very tightly in RAM

Specialized arrays

str
unicode
StringIO
memoryview
bytearray
bytes

But

Accessing its items
from Python requires
repeated object building

┌─┬─┬─┬─┬─┬─┬─┬─┐
│reference count│
├─┼─┼─┼─┼─┼─┼─┼─┤
│address of type│ → <type 'float'>
├─┼─┼─┼─┼─┼─┼─┼─┤
│   1.234 e+00  │
└─┴─┴─┴─┴─┴─┴─┴─┘

↑ ↓

┌─┬─┬─┬─┬─┬─┬─┬─┐
│ 1.234 │ 5.678 │
├─┼─┼─┼─┼─┼─┼─┼─┤

For example,
to sum() an array of 100 floats,
Python builds >100 float objects

NumPy

Python array — for C or I/O

NumPy array — for math!

+ - * / // % divmod() ** pow()

<< >> & | ^ < <= > >= == |=

NumPy arrays support
math operations directly
without building Python
number objects!

>>> import numpy as np
>>> a = np.arange(0, 5)
>>> print(a)
[0 1 2 3 4]

>>> a.sum()
10

>>> print(a * a)
[ 0  1  4  9 16]

>>> print(a + 10)
[10 11 12 13 14]

Use NumPy or Pandas to do

math on millions of numbers

without needing millions

of Python objects

(see also Blaze)

Note that PyPy can auto-detects
a list with only int or float items
and can substitute a fast array

Python itself is

“Dynamic language”

“Everything is an object”

So Python’s main data
structures are going to be general —
capable of holding any objects

tuple

>>> tup = (1, 2.0, 'Hello!')

>>> tup = (1, 2.0, 'Hello!')

Wait!

Those three objects
are going to be of
different sizes!

>>> tup = (1, 2.0, 'Hello!')

Array address math a + bi
depends upon every element
being b bytes long

“All problems in computer
science can be solved
by another level
of indirection”

— David Wheeler

┌─┬─┬─┬─┬─┬─┬─┬─┐
│reference count│
├─┼─┼─┼─┼─┼─┼─┼─┤
│address of type│ → <type tuple>
├─┼─┼─┼─┼─┼─┼─┼─┤
│       3       │
├─┼─┼─┼─┼─┼─┼─┼─┤
│    address    │ → <int 1>
├─┼─┼─┼─┼─┼─┼─┼─┤
│    address    │ → <float 2.0>
├─┼─┼─┼─┼─┼─┼─┼─┤
│    address    │ → <str 'Hello!'>
└─┴─┴─┴─┴─┴─┴─┴─┘

indirection

“Python: the language
that gives you data structures,
without any of the actual data”

indirection

Moving an item in a general
purpose Python data structure means
merely copying an address

indirection

There is only one copy of an object
regardless of how many times it appears
in a tuple list or dict

len(tup)
tup[i]

Both length and get-item require

only a single fetch of 8 bytes

>>> from collections import namedtuple
>>> City = namedtuple('City', 'name province')
>>> m = City('Montréal', 'Québec')

>>> m.name
'Montréal'
>>> m.province
'Québec'

tuple

Read-only
sequence
of objects
0-based

Note that a tuple

knows its length

This supports bounds checking!

>>> t = (1, 2, 3)
>>> len(t)
3
>>> t[9]
Traceback (most recent call last):
  ...
IndexError: tuple index out of range

list

Lists are interesting

because they can grow

Problem:

To grow, the list
might need to move
if it’s hemmed in

─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─
#######│     list      │#######
─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─

But

Python objects can never

change address!

      names
    references
        ↓
─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─
#######│     list      │#######
─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─

“All problems in computer
science can be solved
by another level
of indirection”

— David Wheeler

┌─┬─┬─┬─┬─┬─┬─┬─┐
│reference count│
├─┼─┼─┼─┼─┼─┼─┼─┤
│address of type│ → <type list>
├─┼─┼─┼─┼─┼─┼─┼─┤
│       3       │
├─┼─┼─┼─┼─┼─┼─┼─┤   ┌─┬─┬─┬─┬─┬─┬─┬─┐
│ array address │ → │    address    │ →
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
│       4       │   │    address    │ →
└─┴─┴─┴─┴─┴─┴─┴─┘   ├─┼─┼─┼─┼─┼─┼─┼─┤
                    │    address    │ →
                    ├─┼─┼─┼─┼─┼─┼─┼─┤
                    │       -       │
                    └─┴─┴─┴─┴─┴─┴─┴─┘

Q:

Why does the list have

an extra fourth slot?

┌─┬─┬─┬─┬─┬─┬─┬─┐
│reference count│
├─┼─┼─┼─┼─┼─┼─┼─┤
│address of type│ → <type list>
├─┼─┼─┼─┼─┼─┼─┼─┤
│       3       │
├─┼─┼─┼─┼─┼─┼─┼─┤   ┌─┬─┬─┬─┬─┬─┬─┬─┐
│ array address │ → │    address    │ →
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
│       4       │   │    address    │ →
└─┴─┴─┴─┴─┴─┴─┴─┘   ├─┼─┼─┼─┼─┼─┼─┼─┤
                    │    address    │ →
                    ├─┼─┼─┼─┼─┼─┼─┼─┤
                    │       -       │
                    └─┴─┴─┴─┴─┴─┴─┴─┘

A:

If a Python list reserved
no extra room then every append
would require reallocation!

Reallocation is expensive:
each item address to be
copied to its new home

So a reallocation on every

append() would quickly add up

>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> sum(range(10))
45
>>> sum(range(1000))
499500
>>> sum(range(1000000))
499999500000

This is my standard
thousand-and-million
thought experiment!

“What is the cost of doing this
operation a thousand times?
And a million times?”

If the answer grows zeros at twice

the rate of the question, I lose

>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> sum(range(10))
45
>>> sum(range(1000))
499500
>>> sum(range(1000000))
499999500000

thousand → million

O(n) — 3 zeros
O(n lg n) — 4 zeros
O(n²) — 6 zeros

Million seconds

12 days

Billion seconds

31 years

Trillion seconds

31,688 years

So append() on a full list

asks for extra item slots

Sizes: 4, 8, 16, 25, 35, 46, 58, 72, 88, ...

Designing a data structure

this way is called amortization

because it spreads a cost over time,

just like a mortgage does for a loan

Appends	Reallocations	Address copies
1,000	27	8,603
1,000,000	85	9,692,995
1,000,000,000	143	9,058,551,346

Amortized costs tend to be bumpy

You add the 991st item,
the list grows to 1,120 slots,
990 item addresses get copied

Painful

But, in return, the next 130 items

append at no additional cost

amortization

The Python list illustrates a trade-off:
to save time we must waste space
with all of those unused list slots

┌───────────────────────┬───────────┐
│   991 item addresses  │ 129 slots │
└───────────────────────┴───────────┘

Speed vs space

Python lists, on average, use
94% of their slots — consuming
6% more space than they use

In return, they are fast!

list

Python’s
most dangerous
data structure!

# Single-item:

s[i]
s[i] = v
del s[i]
s.append(v)
s.insert(i, v)
s.pop(i)
v in s

# Several-item:

t = s[7:-7]
s == t
s.count('Montréal')
s.index('Montréal')
s.remove('Montréal')
for item in s: ...

problem

# Single-item operations that touch:
# 1 item              # n items

s[i]
s[i] = v
del s[len(s)]         del s[0]
s.append(v)
s.insert(len(s), v)   s.insert(0, v)
s.pop()               s.pop(0)
                      v in s

# Reverse order

while mylist:
    item = list.pop(0)
    print item

Million items?

Half trillion operations!

for item in reversed(mylist):
    print item

list

Python’s
most dangerous
data structure!

slicing

copy vs view

Making a slice of size n

costs n address copies

t = s[400:500]

┌─────────────────────────┐
│      1,000 item list    │
└─────────────────────────┘
 8 * 100 = 800 bytes copied
             ↓
     ┌───────────────┐
     │ 100 item list │
     └───────────────┘

Expensive slicing annoys both

scientists with data and data scientists

So NumPy and Pandas slices

are simple views that copy no data

t = s[400:500]

┌────────────────────────┐
│      1,000 floats      │
└────────────────────────┘
             ↑
        ┌─────────┐
        │  slice  │
        │+400 <500│
        └─────────┘

dict

A list does not let you name slots

but imposes an integer index

      ┌─┬─┬─┬─┬─┬─┬─┬─┐
s[0]  │    address    │ → item
      ├─┼─┼─┼─┼─┼─┼─┼─┤
s[1]  │    address    │ → item
      ├─┼─┼─┼─┼─┼─┼─┼─┤
s[2]  │    address    │ → item
      ├─┼─┼─┼─┼─┼─┼─┼─┤
s[3]  │    address    │ → item
      ├─┼─┼─┼─┼─┼─┼─┼─┤
              ⋮

But a dict lets

you choose keys

>>> d = {}
>>> d[1] = 'un'
>>> d[2.0] = 'deux'
>>> d['three'] = 'trois'

Behind each dict is an array
in which keys are stored at indexes
according to their hash value

The means that each slot

must store both key and value

┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
│      hash     │  key address  │ value address │
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│      hash     │  key address  │ value address │
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│      hash     │  key address  │ value address │
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
│      ...      │      ...      │      ...      │

see

“The Mighty Dictionary”

PyCon 2010

A dict grows by doubling

or quadrupling its size,

so resize cost is amortized

A dict resizes at ⅔rds full

to avoid too many collisions

Only ever ⅓–⅔ full

Speed vs space

Given a key, a dict
hashes it and can usually
jump right to its slot

All dictionary operations
are fast and therefore safe
for beginning programmers

# Single item   # Several item

d[key]          d == d2
d[key] = v      d.update(d2)
key in d        list(d)
d.get(key)      for k in d.keys(): ...
d.pop(key)      for k in d.values(): ...
del d[key]      for k, v in d.items(): ...

Common
dict-powered
tasks

Joining
Counting
Grouping

>>> from collections import defaultdict
>>>
>>> animals = defaultdict(list)
>>> animals['Mammal'].append('dog')
>>> animals['Reptile'].append('lizard')
>>> animals['Mammal'].append('cat')

{'Mammal': ['dog', 'cat'],
 'Reptile': ['lizard']}

>>> # Most common 2 letters in a phrase
>>>
>>> from collections import Counter
>>> c = Counter('château frontenac, quebec')
>>> c.most_common(2)
[('e', 4), ('c', 3)]

Remember: you can build

compound dict keys with a tuple

{('google.com', 80): 7125773,
 ('google.com', 443): 1520951,
 ('heroku.com', 22): 245080,
 ('linkedin.com', 443): 5724124,
 ('rackspace.com', 443): 5002620}

Use the frozenset if the

data in your keys is unordered

project_languages = {
    frozenset(['python']): 328,
    frozenset(['python', 'js']): 217,
    frozenset(['ruby']): 274,
    frozenset(['ruby', 'js']): 261,
    frozenset(['js']): 79,
    }

Because the dict assigns
key to array indexes by hash,
it iterates in arbitrary order

list — maintains order

dict — maintains association

Two data structures, two superpowers

list — maintains order

dict — maintains association

But what if we want both?

OrderedDict

>>> from collections import OrderedDict
>>>
>>> d = dict.fromkeys('Montréal')
>>> list(d.keys())
['a', 'r', 't', 'é', 'M', 'l', 'o', 'n']
>>>
>>> d = OrderedDict.fromkeys('Montréal')
>>> list(d.keys())
['M', 'o', 'n', 't', 'r', 'é', 'a', 'l']

OrderedDict

Email headers
HTTP headers
JSON fields

set

A dict with only

keys but no values

# Fast 1-member ops

member in t
t.add(member)
t.remove(member)

# Fast n-member ops

t | u
t & u
t - u

classes

By default, each attribute

is a key in obj.__dict__

But classes that pre-specify
__slots__ are implemented
as a struct (record) instead

What’s left?

Fetching a range
Double-ended list
Live popularity contest

Fetching a range

Use bisect if your data is already

sorted in a list or array

>>> a = [1.0, 2.2, 3.0, 3.4, 3.5, 4.0, 4.5]
>>> import bisect
>>> left = bisect.bisect_left(a, 3.0)
>>> right = bisect.bisect_right(a, 4.0)
>>> a[left:right]
[3.0, 3.4, 3.5, 4.0]

Double-ended list

Use the deque

The list supports
efficient modification
at its right end

append()

pop()

These list operations
provide Python with its
native stack

append()

pop()

“We all have ‘stacks’ in our minds
when we are solving problems”
— Donald Knuth

append()

pop()

But the deque provides quick

operations at both ends

# Efficient operations:

>>> from collections import deque
>>> d = deque()
>>> d.append(5)
>>> d.appendleft(4)
>>> d.pop()
5
>>> d.popleft()
4

item → deque → item

item ← deque ← item

# Great for threading
# Great for multiprocessing
from queue import Queue

item → Queue → item

Popularity contest

The heapq lets you
fetch top t of n items
in O(t + n) time!

>>> import heapq
>>> a = [6, 2, 4, 3, 1, 5, 7]
>>> heapq.heapify(a)
>>> a
[1, 2, 4, 3, 6, 5, 7]
>>> print('     {}\n   {}\n{}'
...       .format(a[0:1], a[1:3], a[3:7]))
     [1]
   [2, 4]
[3, 6, 5, 7]

Push and pop in any order —

the smallest is always returned!

>>> heapq.heappop(a)
1
>>> heapq.heappop(a)
2
>>> heapq.heappush(a, 0)
>>> heapq.heappop(a)
0

Also nlargest() and nsmallest()

which avoid sorting entire sequence

PriorityQueue

A heapq with locks,
so threads can cooperate
on a prioritized work pile

sched.scheduler

A heapq that tells you when

the next task is due to run

So

Q:

Where are the old-fashioned
data structures based upon
chains of addresses?

Linked list
Doubly-linked list
Trees of all kinds

A:

They make little sense
when our data structures
no longer contain data
but only addresses

# Linked list in the old days

┌─┬─┬─┬─┬─┬─┬─┬─┐   ┌─┬─┬─┬─┬─┬─┬─┬─┐
│  next address │ → │  next address │ →
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
│      data     │   │      data     │
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
│      ...      │   │      ...      │
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
│      ...      │   │      ...      │
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
│      ...      │   │      ...      │
└─┴─┴─┴─┴─┴─┴─┴─┘   └─┴─┴─┴─┴─┴─┴─┴─┘

# Linked list when "everything is an object"

┌─┬─┬─┬─┬─┬─┬─┬─┐   ┌─┬─┬─┬─┬─┬─┬─┬─┐
│  next address │ → │  next address │ →
├─┼─┼─┼─┼─┼─┼─┼─┤   ├─┼─┼─┼─┼─┼─┼─┼─┤
│object address │   │object address │
└─┴─┴─┴─┴─┴─┴─┴─┘   └─┴─┴─┴─┴─┴─┴─┴─┘
        ↓                   ↓
    <object>            <object>

Both deque and OrderedDict

have doubly-linked lists inside

But most other linked
lists in Python are hidden
in the memory allocation layer

CPython → C malloc()
IronPython → CLR
Jython → JVM

Since your Python
objects are already inside of
malloc()-style linked lists,

adding a further level of

linking is usually redundant

Trees

Used extensively for data

that is actually a tree!

email.message.MIMEPart
xml.etree.ElementTree
logging.handlers

Q:

But what about the

traditional uses of trees?

Keep data ordered as it changes
Return all values within a range

A:

We have outsourced those

tasks to our persistence layers!

PostgreSQL index? B-tree!
MongoDB index? B-tree!
CouchDB index? B-tree!
SQLite index? B-tree!

The data systems of yore
were long-running and might
keep a data structure around for weeks

Our programs today
instead tend to be episodic and
concerned with individual events

view
update
report

This turns us towards
a functional programming style
that involves quick write-once
throw-away data structures

List comprehensions
Dict comprehensions
Set comprehensions

If simple data-centric code is the future,

per Uncle Bob Martin, Gary Bernhardt —

see

“The Clean Architecture in Python”

PyCon Ireland 2013

— then simple, elegant list and dict
are exactly the data structures
that will be taking us there

Thank you very much!

@brandon_rhodes