{name: "Inception",
year: 2010,
actors: [
{
name: "Leonardo DiCaprio",
born: 1974
},
{
name: "Joseph Gordon-Levitt",
born: 1981
}
]}
But costs space, makes writes expensive
{name: "Inception",
year: 2010,
actors: [
{
name: "Leonardo DiCaprio",
born: 1974
},
{
name: "Joseph Gordon-Levitt",
born: 1981
}
]}
{name: "Inception",
year: 2010,
actors: [59382, 48311]}
{actor_id: 59382,
name: "Leonardo DiCaprio",
born: 1974}
{actor_id: 48311,
name: "Joseph Gordon-Levitt",
born: 1981}
Role
┌────────┬────────┐
│movie_id│actor_id│
│ . │ . │
│ 391853│ 11425 │
│ 391853│ 2217 │
│ . │ . │
└────────┴────────┘
Movie Actor
┌──────┬──────┬────┐┌─────┬──────────────────┐
│ id │ name │year││ id │ name │
│ . │ . │ ││ . │ . │
│391853│Sleuth│1972││ 2217│Caine, Michael (I)│
│ . │ . │ ││11425│Olivier, Laurence │
└──────┴──────┴────┘│ . │ . │
└─────┴──────────────────┘
The Relational Algebra
SELECT movie.year, actor.name
FROM movie
JOIN role ON (movie.id = role.movie_id)
JOIN actor ON (actor.id = role.actor_id)
WHERE movie.name = 'Inception';
Indexes save us
(but query plans become complex!)
Neat blog post:
“What Your Computer Does While You Wait”
— Gustavo Duarte
┌─────┐
│ CPU │
└─────┘
↕
*information*
┌─────┐
│ CPU │
└─────┘
↕
┌──────────┐
│ L1 Cache │ 3 seconds: paper on desk
└──────────┘
┌─────┐
│ CPU │
└─────┘
↕
┌──────────┐
│ L1 Cache │ 3 seconds: paper on desk
└──────────┘
↕
┌──────────┐
│ L2 Cache │ 14 seconds: book from shelf
└──────────┘
┌─────┐
│ CPU │
└─────┘
↕
┌──────────┐
│ L1 Cache │ 3 seconds: paper on desk
└──────────┘
↕
┌──────────┐
│ L2 Cache │ 14 seconds: book from shelf
└──────────┘
↕
┌─────┐
│ RAM │ 4 minutes: walk down the hall
└─────┘
↕
┌──────────┐
│ L1 Cache │ 3 seconds: paper on desk
└──────────┘
↕
┌──────────┐
│ L2 Cache │ 14 seconds: book from shelf
└──────────┘
↕
┌─────┐
│ RAM │ 4 minutes: walk down the hall
└─────┘
↕
┌───────┐ Disk seek:
│ Disks │ “like leaving the building to roam
└───────┘ the earth for 1 year 3 months”
So disk access must be minimized
So we use indexes
INDEX TABLE INDEX
movie
┌────────┬────────┐
│ title │ year │
┌ │ │ │ ┐
┌┤ │ │ │ ├┐
┌┤╞ │ │ │ ╡├┐
│└┤ │ │ │ ├┘│
─┤ ╞ │ │ │ ╡ ├─
│┌┤ │ │ │ ├┐│
└┤╞ │ │ │ ╡├┘
└┤ │ │ │ ├┘
└ │ │ │ ┘
└────────┴────────┘
one query
fetch fetch
index table
block block
↓ ↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ Disk
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ blocks
→ disk head →
n queries:
many many
index table
blocks blocks
↓↓↓ ↓↓ ↓ ↓↓ ↓↓↓↓ ↓↓ ↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ Disk
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ blocks
→ disk head →
single sweep!
older data new data →
↓↓↓↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ DB
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ table
older data new data →
↓↓↓↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ DB
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ table
older data new data →
↓↓↓↓↓↓
┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐ DB
└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘ table
At first, queries might look obvious
SELECT * FROM movie WHERE title = 'Inception';
Obviously, use title index
But which index do we use here?
SELECT * FROM movie
FROM movie
JOIN role ON (movie.id = role.movie_id)
JOIN actor ON (role.actor_id = actor.id)
WHERE movie.title = 'Inception'
AND actor.born = 1974;
Query optimization
WHERE movie.title = 'Inception'
AND actor.born = 1974;
You have domain knowledge!
Postgres knows too!
If—
SELECT * FROM movie
FROM movie JOIN role … JOIN actor …
WHERE movie.title = 'Inception'
AND actor.born = 1974;
SELECT id FROM actor
WHERE actor.born = 1974;
SELECT id FROM movie
WHERE movie.title = 'Inception';
SELECT * FROM role
WHERE actor_id IN (4182, 4343, …)
AND movie_id IN (79463);
So they write code like this—
actor = db.query(Actor).filter_by(born=1974)
for role in actor.roles:
movie = role.movie
if movie.title = 'Inception':
print actor.name
maps things
cursor.execute(
"SELECT title FROM movie WHERE title = :t",
{t=title_field})
for row in cursor.fetchall():
print 'Title:', row[0]
query = db.query(Movie).filter_by(title=title)
for movie in query:
print 'Title:', movie.title
sqlite> EXPLAIN QUERY PLAN SELECT * FROM movie
JOIN role ON (movie.id = movie_id)
JOIN actor ON (actor.id = actor_id)
WHERE movie.year = 1974
AND actor.name = 'Connery, Sean';
0|0|2|SEARCH TABLE actor
USING INDEX idx_actor_name (name=?) (~2 rows)
0|1|1|SEARCH TABLE role
USING INDEX i1 (actor_id=?) (~4 rows)
0|2|0|SEARCH TABLE movie
USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
movies = db.query(Movie).filter_by(year=1981)
for movie in movies:
if len(movie.roles) == 1:
print movie.title
That code produces:
SELECT * FROM movie WHERE year = 1981;
SELECT * FROM role WHERE movie_id = 82;
SELECT * FROM role WHERE movie_id = 123;
SELECT * FROM role WHERE movie_id = 127;
SELECT * FROM role WHERE movie_id = 131;
SELECT * FROM role WHERE movie_id = 282;
...
The programmer should really do an eager load
movies = db.query(Movie).filter_by(year=1981) \
.options(subqueryload(Movie.roles))
...
movies = db.query(Movie).filter_by(year=1981) \
.options(joinedload(Movie.roles))
...
SELECT *
FROM movie JOIN role ON (movie_id = movie.id)
WHERE movie.year = 1981;
db.query(Movie).join(Role).join(Actor)\
.filter_by(Movie.year==1974)\
.filter_by(Actor.name=='Connery, Sean')
db.query(Movie)\
.filter_by(Movie.year=1975)\
.outerjoin(Role)\
.filter_by(Role.name='Batman')
But
Why?
When it first came out:
Use a data store that—
Local variables: mode:text fill-column:59 End: