Flexing SQLAlchemy's Relational Power

─────────────────────────────────────
PyCon — Brandon Rhodes — 9 March 2012
─────────────────────────────────────

De-normalized data is tempting

View can read a single chunk
Easy to shard and scale reads

But writes become costly

MongoDB supports de-normalization

{name: "Inception",
 year: 2010,
 actors: [
  {
   name: "Leonardo DiCaprio",
   born: 1974
  },
  {
   name: "Joseph Gordon-Levitt",
   born: 1981
  }
 ]}

But costs space, makes writes expensive

{name: "Inception",
 year: 2010,
 actors: [
  {
   name: "Leonardo DiCaprio",
   born: 1974
  },
  {
   name: "Joseph Gordon-Levitt",
   born: 1981
  }
 ]}

So we often do

some normalization

More normalized MongoDB data

{name: "Inception",
 year: 2010,
 actors: [59382, 48311]}

{actor_id: 59382,
 name: "Leonardo DiCaprio",
 born: 1974}

{actor_id: 48311,
 name: "Joseph Gordon-Levitt",
 born: 1981}

Relational: Design

Relational databases take this

to its logical extreme

Relational: Design

Tables have fixed set of columns
No composite data types
Thus, no inline lists

                     Role
              ┌────────┬────────┐
              │movie_id│actor_id│
              │    .   │    .   │
              │  391853│  11425 │
              │  391853│   2217 │
              │    .   │    .   │
              └────────┴────────┘
       Movie                    Actor
┌──────┬──────┬────┐┌─────┬──────────────────┐
│ id   │ name │year││  id │       name       │
│   .  │  .   │    ││   . │         .        │
│391853│Sleuth│1972││ 2217│Caine, Michael (I)│
│   .  │  .   │    ││11425│Olivier, Laurence │
└──────┴──────┴────┘│   . │         .        │
                    └─────┴──────────────────┘

Relational: Consequences

A view often needs several data stores.
But you can now always scale writes

Q:

So how do we pull together
the related information that a
relational database atomizes?

A:

The Relational Algebra

SQL

SELECT movie.year, actor.name
  FROM movie
  JOIN role ON (movie.id = role.movie_id)
  JOIN actor ON (actor.id = role.actor_id)
  WHERE movie.name = 'Inception';

Storage is slow

Q:

How can we gather several

pieces of data efficiently?

A:

Indexes save us

(but query plans become complex!)

Neat blog post:

“What Your Computer Does While You Wait”

— Gustavo Duarte

  ┌─────┐
  │ CPU │
  └─────┘
     ↕
*information*

  ┌─────┐
  │ CPU │
  └─────┘
     ↕
┌──────────┐
│ L1 Cache │  3 seconds: paper on desk
└──────────┘

  ┌─────┐
  │ CPU │
  └─────┘
     ↕
┌──────────┐
│ L1 Cache │   3 seconds: paper on desk
└──────────┘
     ↕
┌──────────┐
│ L2 Cache │  14 seconds: book from shelf
└──────────┘

  ┌─────┐
  │ CPU │
  └─────┘
     ↕
┌──────────┐
│ L1 Cache │   3 seconds: paper on desk
└──────────┘
     ↕
┌──────────┐
│ L2 Cache │  14 seconds: book from shelf
└──────────┘
     ↕
  ┌─────┐
  │ RAM │  4 minutes: walk down the hall
  └─────┘

     ↕
┌──────────┐
│ L1 Cache │   3 seconds: paper on desk
└──────────┘
     ↕
┌──────────┐
│ L2 Cache │  14 seconds: book from shelf
└──────────┘
     ↕
  ┌─────┐
  │ RAM │  4 minutes: walk down the hall
  └─────┘
     ↕
 ┌───────┐             Disk seek:
 │ Disks │  “like leaving the building to roam
 └───────┘   the earth for 1 year 3 months”

So disk access must be minimized

Keeping records sorted on disk

is not feasible

Too costly to insert new records
You can only sort by one attribute

So we use indexes

INDEX          TABLE          INDEX

               movie
        ┌────────┬────────┐
        │ title  │  year  │
    ┌   │        │        │   ┐
   ┌┤   │        │        │   ├┐
  ┌┤╞   │        │        │   ╡├┐
  │└┤   │        │        │   ├┘│
 ─┤ ╞   │        │        │   ╡ ├─
  │┌┤   │        │        │   ├┐│
  └┤╞   │        │        │   ╡├┘
   └┤   │        │        │   ├┘
    └   │        │        │   ┘
        └────────┴────────┘

Q:

“If you have to move the disk head
around to several indexes and tables
doesn't that kill performance?”

A:

one query

    fetch               fetch
    index               table
    block               block
      ↓                   ↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ Disk
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ blocks

         →  disk head  →

A:

n queries:

    many                many
    index               table
    blocks              blocks
   ↓↓↓ ↓↓ ↓         ↓↓ ↓↓↓↓ ↓↓ ↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ Disk
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ blocks

         →  disk head  →
          single sweep!

Most relational databases always

append new data to the end of a table

      older data           new data →
                            ↓↓↓↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ DB
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ table

Even when you update an already existing row

the database appends the new copy to the end

      older data           new data →
                            ↓↓↓↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ DB
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ table

Speed: disk head can stay in place

Transactions: several writes “go live” at once

      older data           new data →
                            ↓↓↓↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ DB
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ table

Q:

How do we write our code to play

to a relational DB's strengths?

Optimization

Asking the DB a single question
Doing EXPLAIN and adding indexes

Asking the database a single question

At first, queries might look obvious

SELECT * FROM movie WHERE title = 'Inception';

Obviously, use title index

Q:

But which index do we use here?

SELECT * FROM movie

  FROM movie
  JOIN role ON (movie.id = role.movie_id)
  JOIN actor ON (role.actor_id = actor.id)

  WHERE movie.title = 'Inception'
    AND actor.born = 1974;

Do we start with the title

or with the year?

A:

Query optimization

WHERE movie.title = 'Inception'
  AND actor.born = 1974;

You have domain knowledge!

You know that few movies have this title

but that many actors were born in 1974

Guess what?

Postgres knows too!

A good relational database
keeps statistics on the distribution
of values in every column of every table

So the database can choose
from among the fastest ways
to answer your query

If—

—if you ask the whole question

as a single query

This can be optimized

SELECT * FROM movie
  FROM movie JOIN role … JOIN actor …
  WHERE movie.title = 'Inception'
    AND actor.born = 1974;

This cannot be optimized

SELECT id FROM actor
  WHERE actor.born = 1974;

SELECT id FROM movie
  WHERE movie.title = 'Inception';

SELECT * FROM role
  WHERE actor_id IN (4182, 4343, …)
    AND movie_id IN (79463);

“But wait,”

I hear you cry,

“Why would someone write so

many tiny queries?”

A:

Because they are using an ORM

and never learn about joins

They learn the simplest call possible

to pull objects from the database—

—and try to survive forever with nothing
more complex than single-mode calls to
query().filter_by() followed
by attribute accesses

So they write code like this—

actor = db.query(Actor).filter_by(born=1974)
for role in actor.roles:
    movie = role.movie
    if movie.title = 'Inception':
        print actor.name

The “O” Error

is thinking that you can take the “O”

part of “ORM” and ignore the rest

The “O” Error

is assuming that
“Object-Relational Mapper”
really means
“Relational Hider/Concealer/Absolver”

“Mapper”

maps things

“Mapper”

Maps relational concepts
into a non-relational language
so that relations become visible and
manipulable within its native syntax

When using an ORM
you still need to
understand relations

At least a little bit

So what's the point?

Why use an ORM if you still

have to understand relations?

So what's the point?

What should you be expecting

your ORM to accomplish for you?

Some advantages of an ORM

Access fields like attributes
Access relations like lists
Build query from many filters
Provide powerful unit of work

cursor.execute(
    "SELECT title FROM movie WHERE title = :t",
    {t=title_field})
for row in cursor.fetchall():
    print 'Title:', row[0]

↑
The ORM's job is to turn this
into this
↓

query = db.query(Movie).filter_by(title=title)
for movie in query:
    print 'Title:', movie.title

Game Plan

Gain some basic SELECT and JOIN skills
Learn how to express JOIN in Python
Learn to EXPLAIN problem queries
Learn to fix them with indexes

Use logging

Connecting through SQLAlchemy with echo=True

can let you investigate during development

You can also log queries at runtime

Use EXPLAIN

sqlite> EXPLAIN QUERY PLAN SELECT * FROM movie
  JOIN role ON (movie.id = movie_id)
  JOIN actor ON (actor.id = actor_id)
  WHERE movie.year = 1974
    AND actor.name = 'Connery, Sean';

0|0|2|SEARCH TABLE actor
   USING INDEX idx_actor_name (name=?) (~2 rows)
0|1|1|SEARCH TABLE role
   USING INDEX i1 (actor_id=?) (~4 rows)
0|2|0|SEARCH TABLE movie
   USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)

Learn to identify common patterns

A common problem: cavalier access of an

attribute that kicks off a new query

movies = db.query(Movie).filter_by(year=1981)
for movie in movies:
    if len(movie.roles) == 1:
        print movie.title

That code produces:

SELECT * FROM movie WHERE year = 1981;
SELECT * FROM role WHERE movie_id = 82;
SELECT * FROM role WHERE movie_id = 123;
SELECT * FROM role WHERE movie_id = 127;
SELECT * FROM role WHERE movie_id = 131;
SELECT * FROM role WHERE movie_id = 282;
...

The programmer should really do an eager load

movies = db.query(Movie).filter_by(year=1981) \
    .options(subqueryload(Movie.roles))
...

movies = db.query(Movie).filter_by(year=1981) \
    .options(joinedload(Movie.roles))
...

The eager load lets SQLAlchemy know

everything you need ahead of time

SELECT *
  FROM movie JOIN role ON (movie_id = movie.id)
  WHERE movie.year = 1981;

Another pattern:
Programmers should aggregate all
conditions into a single query

db.query(Movie).join(Role).join(Actor)\
    .filter_by(Movie.year==1974)\
    .filter_by(Actor.name=='Connery, Sean')

Outer Join

A left outer join returns a row
for each model on the left side,
even those with zero matching
objects on the right

db.query(Movie)\
    .filter_by(Movie.year=1975)\
    .outerjoin(Role)\
    .filter_by(Role.name='Batman')

Caveat

An ORM makes using a relation

look native and pretty in Python

But

an ORM introduces complexities

and surprises of its own

Two example surprises:

Django ORM lacks a unit-of-work pattern
SQLAlchemy has a unit-of-work pattern

I am excited about NoSQL

Why?

Because I see programmers
sitting down and reading database
documentation and understanding it

Remember: the first breakout-hit database

among Web programmers was a NoSQL solution!

When it first came out:

No transactions
No joins
No sub-queries
That's NoSQL!

Was a simple key-value store
supporting compound static
types for the values

If a document database
or a key-value store
fits your brain

behaves like you expect

then use it!

Don't

use a relational database
because you think doing so is somehow
“right” or “noble”

you hope that your ORM will

let you hide from how relations work

Use a data store that—

—fits your application
—that fits your data
—that you understand

And

Relational databases are awesome
SQLAlchemy is awesome
Python is awesome

Thank you!

Local variables: mode:text fill-column:59 End: