Flexing SQLAlchemy's Relational Power







─────────────────────────────────────
PyCon — Brandon Rhodes — 9 March 2012
─────────────────────────────────────
De-normalized data is tempting
  1. View can read a single chunk
  2. Easy to shard and scale reads
But writes become costly

MongoDB supports de-normalization

{name: "Inception",
 year: 2010,
 actors: [
  {
   name: "Leonardo DiCaprio",
   born: 1974
  },
  {
   name: "Joseph Gordon-Levitt",
   born: 1981
  }
 ]}

But costs space, makes writes expensive

{name: "Inception",
 year: 2010,
 actors: [
  {
   name: "Leonardo DiCaprio",
   born: 1974
  },
  {
   name: "Joseph Gordon-Levitt",
   born: 1981
  }
 ]}
So we often do
some normalization

More normalized MongoDB data

{name: "Inception",
 year: 2010,
 actors: [59382, 48311]}

{actor_id: 59382,
 name: "Leonardo DiCaprio",
 born: 1974}

{actor_id: 48311,
 name: "Joseph Gordon-Levitt",
 born: 1981}

Relational: Design

Relational databases take this
to its logical extreme

Relational: Design

  1. Tables have fixed set of columns
  2. No composite data types
  3. Thus, no inline lists
                     Role
              ┌────────┬────────┐
              movie_idactor_id
                  .       .   
                391853  11425 
                391853   2217 
                  .       .   
              └────────┴────────┘
       Movie                    Actor
┌──────┬──────┬────┐┌─────┬──────────────────┐
 id    name year││  id        name       
   .    .       ││   .          .        
391853Sleuth1972││ 2217Caine, Michael (I)
   .    .       ││11425Olivier, Laurence 
└──────┴──────┴────┘│   .          .        
                    └─────┴──────────────────┘

Relational: Consequences

  1. A view often needs several data stores.
  2. But you can now always scale writes

Q:

So how do we pull together
the related information that a
relational database atomizes?

A:

The Relational Algebra

SQL

SELECT movie.year, actor.name
  FROM movie
  JOIN role ON (movie.id = role.movie_id)
  JOIN actor ON (actor.id = role.actor_id)
  WHERE movie.name = 'Inception';
Storage is slow

Q:

How can we gather several
pieces of data efficiently?

A:

Indexes save us

(but query plans become complex!)

Neat blog post:

“What Your Computer Does While You Wait”

— Gustavo Duarte

  ┌─────┐
   CPU 
  └─────┘
     
*information*
  ┌─────┐
   CPU 
  └─────┘
     
┌──────────┐
 L1 Cache   3 seconds: paper on desk
└──────────┘
  ┌─────┐
   CPU 
  └─────┘
     
┌──────────┐
 L1 Cache    3 seconds: paper on desk
└──────────┘
     
┌──────────┐
 L2 Cache   14 seconds: book from shelf
└──────────┘
  ┌─────┐
   CPU 
  └─────┘
     
┌──────────┐
 L1 Cache    3 seconds: paper on desk
└──────────┘
     
┌──────────┐
 L2 Cache   14 seconds: book from shelf
└──────────┘
     
  ┌─────┐
   RAM   4 minutes: walk down the hall
  └─────┘
     
┌──────────┐
 L1 Cache    3 seconds: paper on desk
└──────────┘
     
┌──────────┐
 L2 Cache   14 seconds: book from shelf
└──────────┘
     
  ┌─────┐
   RAM   4 minutes: walk down the hall
  └─────┘
     
 ┌───────┐             Disk seek:
  Disks   like leaving the building to roam
 └───────┘   the earth for 1 year 3 months

So disk access must be minimized

Keeping records sorted on disk
is not feasible
  1. Too costly to insert new records
  2. You can only sort by one attribute

So we use indexes

INDEX          TABLE          INDEX

               movie
        ┌────────┬────────┐
         title    year  
                          
   ┌┤                      ├┐
  ┌┤╞                      ╡├┐
  │└┤                      ├┘│
 ─┤                        ├─
  │┌┤                      ├┐│
  └┤╞                      ╡├┘
   └┤                      ├┘
                          
        └────────┴────────┘

Q:

“If you have to move the disk head
around to several indexes and tables
doesn't that kill performance?”

A:

one query

    fetch               fetch
    index               table
    block               block
                         
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ Disk
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ blocks

           disk head  

A:

n queries:

    many                many
    index               table
    blocks              blocks
   ↓↓↓ ↓↓          ↓↓ ↓↓↓↓ ↓↓ ↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ Disk
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ blocks

           disk head  
          single sweep!
Most relational databases always
append new data to the end of a table
      older data           new data 
                            ↓↓↓↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ DB
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ table
Even when you update an already existing row
the database appends the new copy to the end
      older data           new data 
                            ↓↓↓↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ DB
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ table
Speed: disk head can stay in place
Transactions: several writes “go live” at once
      older data           new data 
                            ↓↓↓↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ DB
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ table

Q:

How do we write our code to play
to a relational DB's strengths?

Optimization

  1. Asking the DB a single question
  2. Doing EXPLAIN and adding indexes

Asking the database a single question

At first, queries might look obvious

SELECT * FROM movie WHERE title = 'Inception';

Obviously, use title index

Q:

But which index do we use here?

SELECT * FROM movie

  FROM movie
  JOIN role ON (movie.id = role.movie_id)
  JOIN actor ON (role.actor_id = actor.id)

  WHERE movie.title = 'Inception'
    AND actor.born = 1974;
Do we start with the title
or with the year?

A:

Query optimization

WHERE movie.title = 'Inception'
  AND actor.born = 1974;

You have domain knowledge!

You know that few movies have this title
but that many actors were born in 1974

Guess what?

Postgres knows too!

A good relational database
keeps statistics on the distribution
of values in every column of every table
So the database can choose
from among the fastest ways
to answer your query

If—

if you ask the whole question
as a single query

This can be optimized

SELECT * FROM movie
  FROM movie JOIN role  JOIN actor 
  WHERE movie.title = 'Inception'
    AND actor.born = 1974;

This cannot be optimized

SELECT id FROM actor
  WHERE actor.born = 1974;

SELECT id FROM movie
  WHERE movie.title = 'Inception';

SELECT * FROM role
  WHERE actor_id IN (4182, 4343, )
    AND movie_id IN (79463);
“But wait,”
I hear you cry,
“Why would someone write so
many tiny queries?”

A:

Because they are using an ORM
and never learn about joins
They learn the simplest call possible
to pull objects from the database—
—and try to survive forever with nothing
more complex than single-mode calls to
query().filter_by() followed
by attribute accesses

So they write code like this—

actor = db.query(Actor).filter_by(born=1974)
for role in actor.roles:
    movie = role.movie
    if movie.title = 'Inception':
        print actor.name

The “O” Error

is thinking that you can take the “O”
part of “ORM” and ignore the rest

The “O” Error

is assuming that
“Object-Relational Mapper
really means
“Relational Hider/Concealer/Absolver

“Mapper”

maps things

“Mapper”

Maps relational concepts
into a non-relational language
so that relations become visible and
manipulable within its native syntax
When using an ORM
you still need to
understand relations
At least a little bit

So what's the point?

So what's the point?

Why use an ORM if you still
have to understand relations?

So what's the point?

What should you be expecting
your ORM to accomplish for you?

Some advantages of an ORM

  1. Access fields like attributes
  2. Access relations like lists
  3. Build query from many filters
  4. Provide powerful unit of work
cursor.execute(
    "SELECT title FROM movie WHERE title = :t",
    {t=title_field})
for row in cursor.fetchall():
    print 'Title:', row[0]
The ORM's job is to turn this
into this
query = db.query(Movie).filter_by(title=title)
for movie in query:
    print 'Title:', movie.title

Game Plan

  1. Gain some basic SELECT and JOIN skills
  2. Learn how to express JOIN in Python
  3. Learn to EXPLAIN problem queries
  4. Learn to fix them with indexes

Use logging

Connecting through SQLAlchemy with echo=True
can let you investigate during development
You can also log queries at runtime

Use EXPLAIN

sqlite> EXPLAIN QUERY PLAN SELECT * FROM movie
  JOIN role ON (movie.id = movie_id)
  JOIN actor ON (actor.id = actor_id)
  WHERE movie.year = 1974
    AND actor.name = 'Connery, Sean';

0|0|2|SEARCH TABLE actor
   USING INDEX idx_actor_name (name=?) (~2 rows)
0|1|1|SEARCH TABLE role
   USING INDEX i1 (actor_id=?) (~4 rows)
0|2|0|SEARCH TABLE movie
   USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)

Learn to identify common patterns

A common problem: cavalier access of an
attribute that kicks off a new query
movies = db.query(Movie).filter_by(year=1981)
for movie in movies:
    if len(movie.roles) == 1:
        print movie.title

That code produces:

SELECT * FROM movie WHERE year = 1981;
SELECT * FROM role WHERE movie_id = 82;
SELECT * FROM role WHERE movie_id = 123;
SELECT * FROM role WHERE movie_id = 127;
SELECT * FROM role WHERE movie_id = 131;
SELECT * FROM role WHERE movie_id = 282;
...

The programmer should really do an eager load

movies = db.query(Movie).filter_by(year=1981) \
    .options(subqueryload(Movie.roles))
...
movies = db.query(Movie).filter_by(year=1981) \
    .options(joinedload(Movie.roles))
...
The eager load lets SQLAlchemy know
everything you need ahead of time
SELECT *
  FROM movie JOIN role ON (movie_id = movie.id)
  WHERE movie.year = 1981;
Another pattern:
Programmers should aggregate all
conditions into a single query
db.query(Movie).join(Role).join(Actor)\
    .filter_by(Movie.year==1974)\
    .filter_by(Actor.name=='Connery, Sean')

Outer Join

A left outer join returns a row
for each model on the left side,
even those with zero matching
objects on the right
db.query(Movie)\
    .filter_by(Movie.year=1975)\
    .outerjoin(Role)\
    .filter_by(Role.name='Batman')

Caveat

An ORM makes using a relation
look native and pretty in Python

But

an ORM introduces complexities
and surprises of its own

Two example surprises:

  1. Django ORM lacks a unit-of-work pattern
  2. SQLAlchemy has a unit-of-work pattern

I am excited about NoSQL

Why?

Because I see programmers
sitting down and reading database
documentation and understanding it
Remember: the first breakout-hit database
among Web programmers was a NoSQL solution!
../../2011-09-pyconpl/mysql-logo.jpg
../../2011-09-pyconpl/mysql-logo.jpg

When it first came out:

../../2011-09-pyconpl/mysql-logo.jpg
Was a simple key-value store
supporting compound static
types for the values
If a document database
or a key-value store
fits your brain
behaves like you expect
then use it!
Don't
use a relational database
because you think doing so is somehow
“right” or “noble”
if
you hope that your ORM will
let you hide from how relations work

Use a data store that—

—fits your application
—that fits your data
—that you understand

And

Relational databases are awesome
SQLAlchemy is awesome
Python is awesome

Thank you!

Local variables: mode:text fill-column:59 End: