Django: a Data Shovel
With a Future
@brandon_rhodes
DjangoCon 2014
complexity


    time
              
            
          
        
      
    
start
         working
          code
          
        
      
    
start
Once my code
is finally working

Once my code
is finally working
it is probably better
described as barely working

guideline

Once new code works,
I am probably about
halfway done
         working
          code
            
              
                good
                code
start
Web framework
history follows a
similar pattern
              
            
          
        
      
    
start
            Zope
            
          
        
      
    
start
            Zope
             
               CherryPy
                 Django
      
    
start
            Zope
             
               CherryPy
                 Django
                   
                 µ frameworks
start
            Zope
             
               CherryPy
                 Django   µ + ORM
                        
                 µ frameworks
start
            Zope
                       
               CherryPy   Pyramid
                 Django
                   
                 µ frameworks
start
What is a
web framework?

A denormalization engine

A web framework
denormalizes data
to documents

Q:

How much data
should you deliver
to the client?
minimum   ↔   maximum

imagine

If hard drives
kept growing

If hard drives
kept growing
30 years
If hard drives
kept growing
$$ {{\rm 8\,TB} \over \rm{20\,MB}} = {{\rm ?} \over \rm{8\,TB}} $$
if hard drives
kept growing
$$ {{\rm 8\,TB} \over \rm{20\,MB}} = {{\rm 3.2\,EB} \over \rm{8\,TB}} $$
Why not have
Stack Overflow on
your hard drive?
Why not have
the Library of Congress
on your hard drive?
$$ {\rm 0.003\,EB} $$
Why not have
every movie ever made,
in High Definition?
$$ {\rm 0.008\,EB} $$
Subversion     git / hg
minimum   ↔   maximum
Django is flexible and
not opinionated about
where my views land
minimum   ↔   maximum

A story

newfs-logo.png
jazkarta.gif

Problem:

dichotomous keys

choose-adventure.jpg
gymnosperms.jpg
User-directed search
gobotany-logo.png
User-directed search
“choose a filter”


User-directed search
red flowers
“choose a filter”

User-directed search
red flowers
smooth leaves
“choose a filter”
User-directed search
red flowers
smooth leaves
Vermont
how-many-leaves.jpg
55-matching-species.jpg

First iteration

1 : 1
User action - API call
red flowers
smooth leaves
Vermont
/q ? flower=red & leaf=smooth & state=vt

Q:

These searches
get expensive
will caching search
results help?

A:

No
? flower=red & leaf=smooth & state=vt
? flower=red & state=vt & leaf=smooth
? leaf=smooth & flower=red & state=vt
? leaf=smooth & state=vt & flower=red
? state=vt & flower=red & leaf=smooth
? state=vt & leaf=smooth & flower=red
canonicalization
Same meaning → same text
“Order filters alphabetically”

‘f’‘l’‘s’

? flower=red & leaf=smooth & state=vt

where?

GET /q?state=vt&flower=red HTTP/1.1
Host: gobotany.newfs.org

               

HTTP/1.1 301 Moved Permanently
Location: /q?flower=red&state=vt
Transform is purely textual!
With Cache-Control,
301s can be pushed to:
varnish
CDN
In general, caching
non-200 results can
be a big deal

The Onion Uses Django, And Why It Matters To Us

“And the biggest
performance
boost of all:
“—caching 404s and sending
Cache-Control headers
to the CDN on 404.”
HTTP/1.1 404 Not Found
Cache-control: max-age=3600
Outgoing bandwidth
↓ 66%


Outgoing bandwidth
↓ 66%
Load average
↓ 50%
Had we wanted,
we could have cached
normalizations
HTTP/1.1 301 Moved Permanently
Location: /q?flower=red&state=vt
Cache-control: max-age=3600
? flower=red & leaf=smooth & state=vt
If we cache these
URLs, how often will
they be revisited?
Imagine 100 filters in
use by students who apply
5 filters per search
$$\left( 100 \atop 5 \right)$$
$$\left( 100 \atop 5 \right) = 75,287,520$$

step back

? flower=red & leaf=smooth & state=vt
Django n-way join
against plant ↔ feature
binary relation

3 phases

SELECT *
 FROM plant_feature f1
 JOIN plant_feature f2 ON plant_id
 JOIN plant_feature f3 ON plant_id
  WHERE f1 = 'form=tree'
    AND f2 = 'leaf=smooth'
    AND f3 = 'state=vt'
        plant_feature
-----------------------------
Acer rubrum       form=tree
Acer rubrum       leaves=fall
Acer rubrum       twig=gray
Acer rubrum       twig=red
Acer saccharinum  form=tree
Acer saccharinum  leaves=fall
              

plant list

FROM plant_feature f1
 ... WHERE f1 = 'form=tree'
Acer rubrum       form=tree
Acer saccharinum  form=tree
              

intersection

What plants are in lists
f1 and f2 and f3

pivot

design1.svg
design2.svg

/api/feature/leaf_edge

{'smooth': {'text': 'Smooth leaves',
            'plants': [13, 15, 16, ]},
 'serrated': {'text':  'Jagged leaves',
              'plants': [3, 14, 17, ]},
 'lobed': {'text': 'Lobed leaves',
           'plants': [6, 7, 8, 10, ]}}
_.intersection(a1, a2, a3)

Gone!

? flower=red & leaf=smooth & state=vt
/api/feature/flower
/api/feature/leaf
/api/feature/state
55-matching-species.jpg
n features?
Only n URLs!
gobotany-logo.png
Giving the client more
data meant less work
for our servers!

Data

Be aware of Python’s
growing data toolset
NumPy arrays
Pandas dataframes
person[person.age >= 21]

Blaze

person[person.age >= 21]

SELECT * FROM person p
  WHERE p.age >= 21;
person[person.age >= 21]

db.people.find({age: {$gte: 21}})
Python is becoming
an important tool for
large data sets
Watch for how these
data tools might fit
into your projects

Django

Flexible
denormalization
No dependencies
URLs are simply text

codedata

Views can be
simple procedures
Privileges
relational
databases
Finding the simplest
possible thing is hard

Django:

First-to-market
with simplicity


Django:

First-to-market
with simplicity
Thank you!
@brandon_rhodes