Know Thy Database
──────────────────
Brandon Rhodes
PyCon PL
Mąchocice, Poland
2011 September 24
──────────────────
The Argument.
Easier to read data than code
Your data should be visible
Your data should be easy to read
Your data should be easy to change
NoSQL document databases can help
The Test
You have 5 seconds to study
the following Python code
that creates a dict
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname . split ( '.' )
>>> for i in range ( len ( parts )):
... subparts = parts [ i :]
... subname = '.' . join ( subparts )
... map1 [ subname ] = subparts
So — do you know how the
resulting dict looks?
I think the code was clear
But just in case,
here is different code
that creates the same dict
(You will have 5 seconds)
>>> map2 = {}
>>> hostname = 'www.google.com'
>>> while '.' in hostname :
... map2 [ hostname ] = hostname . split ( '.' )
... hostname = hostname . split ( '.' , 1 )[ 1 ]
... else :
... map2 [ hostname ] = [ hostname ]
Do you know how the
resulting dict looks?
Finally, here is a third
example of how to build the dict
(You will have 5 seconds)
>>> map3 = {
... 'www.google.com' : [ 'www' , 'google' , 'com' ],
... 'google.com' : [ 'google' , 'com' ],
... 'com' : [ 'com' ],
... }
All right!
You had equal time
with all three slides
Those three slides
describe the exact same data
>>> map1 == map2 == map3
True
What is this data structure?
A dict whose keys are the hostname
www.google.com and the domains above it,
and whose values are the '.' separated
components of each domain name
>>> map3 = {
... 'www.google.com' : [ 'www' , 'google' , 'com' ],
... 'google.com' : [ 'google' , 'com' ],
... 'com' : [ 'com' ],
... }
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname . split ( '.' )
>>> for i in range ( len ( parts )):
... subparts = parts [ i :]
... subname = '.' . join ( subparts )
... map1 [ subname ] = subparts
Compare the code ↑ with the data ↓
>>> map3 = {
... 'www.google.com' : [ 'www' , 'google' , 'com' ],
... 'google.com' : [ 'google' , 'com' ],
... 'com' : [ 'com' ],
... }
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname . split ( '.' )
>>> for i in range ( len ( parts )):
... subparts = parts [ i :]
... subname = '.' . join ( subparts )
... map1 [ subname ] = subparts
The code and the data are isomorphic
>>> map3 = {
... 'www.google.com' : [ 'www' , 'google' , 'com' ],
... 'google.com' : [ 'google' , 'com' ],
... 'com' : [ 'com' ],
... }
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname . split ( '.' )
>>> for i in range ( len ( parts )):
... subparts = parts [ i :]
... subname = '.' . join ( subparts )
... map1 [ subname ] = subparts
But the data can be read more quickly!
>>> map3 = {
... 'www.google.com' : [ 'www' , 'google' , 'com' ],
... 'google.com' : [ 'google' , 'com' ],
... 'com' : [ 'com' ],
... }
The Mythical Man-Month by Fred Brooks
Famous book about software project
management, famous for several quotes
The Mythical Man-Month by Fred Brooks
“The bearing of a child takes nine months,
no matter how many women are assigned.”
Fred Books also talks
about code and data
The Mythical Man-Month by Fred Brooks
“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.
Show me your tables ,
and I won't usually need your flowchart;
it'll be obvious.”
What is a flowchart?
What is a table?
A picture of how your data
is laid out in memory or disk
The Mythical Man-Month by Fred Brooks
“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.
Show me your tables ,
and I won't usually need your flowchart;
it'll be obvious.”
I will, in fact, claim that the difference
between a bad programmer and a good one
is whether they consider their code or
their data structures more important.
Bad programmers worry about the code.
Good programmers worry about data structures
and their relationships.
— Linus Torvalds
Interesting Fact:
Fred Brooks was comparing
a picture (“your flowchart”)
with another picture (“your tables”)
Back then, programmers worked
with three things at once!
What does this mean?
Code was so ugly in
Assembler, Fortran, PL/I, and COBOL
that you needed separate pictures
of both your algorithm
and your data!
Fred Brooks talks about
a flowchart and tables
but
Our example simply
used Python and Python !
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname . split ( '.' )
>>> for i in range ( len ( parts )):
... subparts = parts [ i :]
... subname = '.' . join ( subparts )
... map1 [ subname ] = subparts
>>> map3 = {
... 'www.google.com' : [ 'www' , 'google' , 'com' ],
... 'google.com' : [ 'google' , 'com' ],
... 'com' : [ 'com' ],
... }
Python is so beautiful
that you can use it to draw
pretty pictures of both
algorithms and data
Beautiful ideas are important
Feynman
../2011-09-pyatl/Feynmann_Diagram_Gluon_Radiation.svg
Feynman diagram
The Nobel Prize in Physics 1965
Sin-Itiro Tomonaga
Julian Schwinger
Richard P. Feynman
was awarded jointly
“for their fundamental work
in quantum electrodynamics (QED),
with deep-ploughing consequences for the
physics of elementary particles.”
All three scientists
had the same idea
the same math
But physics students still use
Feynman diagrams
Why?
Because all three scientists
were working on the same ideas,
but Feynman made those
ideas easy to see
Visualization is a
key component of thinking
See the recent book
Creating Scientific Concepts
by Nancy Nersessian
Nersessian:
To produce abstract differential equations
Maxwell had to use visual physical analogies
Important Python feature
You do not have to leave
Python to see your data
Python offers
syntactic support
for fundamental containers
tuple , list , set , dict
Right in your code,
you can draw picture of the
data structure you want Python to build
>>> map3 = {
... 'www.google.com' : [ 'www' , 'google' , 'com' ],
... 'google.com' : [ 'google' , 'com' ],
... 'com' : [ 'com' ],
... }
Python programmers share a common picture
in their heads of how the data types look
we all share a common visual language
for our articles, diagrams, and debugging
“Declarative”
Means showing what you want
instead of writing a procedure
that produces what you want
It is the difference between:
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname . split ( '.' )
>>> for i in range ( len ( parts )):
... subparts = parts [ i :]
... subname = '.' . join ( subparts )
... map1 [ subname ] = subparts
↑ procedural / declarative ↓
>>> map3 = {
... 'www.google.com' : [ 'www' , 'google' , 'com' ],
... 'google.com' : [ 'google' , 'com' ],
... 'com' : [ 'com' ],
... }
“Declarative”
Do you want to be popular?
Write an API
that lets programmers
draw a picture with code
SQLAlchemy
Once upon a time,
you had to build tables and classes
then connect them up with code
Even back then, table definitions
were pretty declarative
users_table = Table ( 'users' , metadata ,
Column ( 'id' , Integer , primary_key = True ),
Column ( 'name' , String ),
Column ( 'password' , String ),
Column ( 'group_id' , Integer ),
)
But using the ORM to connect tables
to classes was not declarative
users_table = Table ( 'users' , metadata ,
Column ( 'id' , Integer , primary_key = True ),
Column ( 'name' , String ),
Column ( 'password' , String ),
Column ( 'group_id' , Integer ),
)
class User ( object ):
pass
mapper ( User , users_table , properties = {
'group' : relation ( Group ),
})
Jonathan LaCour became famous
for Exilir , a declarative layer
written on top of SQLAlchemy
So Mike Bayer added a declarative
layer to SQLAlchemy itself!
from sqlalchemy.ext.declarative \
import declarative_base
Base = declarative_base ()
class User ( Base ):
__tablename__ = 'users'
id = Column ( Integer , primary_key = True )
name = Column ( String )
password = Column ( String )
group_id = Column ( Integer )
group = relation ( Group )
It seems that
everyone wants a picture
not a procedure
Declarative syntax
People are excited about it
SQLAlchemy has a declarative syntax
Django ORM has a declarative syntax
But Python has been supporting declarative since 1991!
This is really a declaration :
mydict = { 'Fibonacci' : [ 1 , 1 , 2 , 3 , 5 ]}
vs languages that force you to write:
mylist = list ()
mylist . append ( 1 )
mylist . append ( 1 )
mylist . append ( 2 )
mylist . append ( 3 )
mylist . append ( 5 )
mydict = dict ()
mydict [ 'Fibonacci' ] = mylist
Example comments
from Stack Overflow:
# I did this the other way
# I don't know why I need this, but it
# stops the people being upside-down
x = - x
# If this comment is removed
# the program will blow up
# Always returns true.
#
def isAvailable ():
return False
Comments are controversial
Extreme Programming (XP)
“If a method needs comments it needs to be rewritten”
“Good code is self-documenting”
“Comments lie, code doesn’t”
This Extreme Programming advice
is good if it makes you think about
how your code might be clearer
# the zero means "not found"
day_of_week = 0
...
if not day_of_week :
...
↑ less clear / more clear ↓
NOT_FOUND = 0
day_of_week = NOT_FOUND
...
if day_of_week == NOT_FOUND :
...
But contra the XP people,
I think comments are sometimes very valuable!
I have two favorite uses for comments
Comments record important knowledge
that I learned while using a
library someone else wrote
thirdparty . init ( encoding = 'utf-8' )
db = thirdparty . open ( dbpath )
# Drat. "thirdparty" ignores the
# global encoding on old version-3
# databases; so we set it again:
db . set_encoding ( 'utf-8' )
But even more important:
comments can draw pictures of data
Very simple code, easy to read,
but without pictures:
>>> def simplify_name ( name ):
... w = name . split ()
... while w and '.' in w [ 0 ]:
... del w [ 0 ]
... while w and '.' in w [ - 1 ]:
... del w [ - 1 ]
... return ' ' . join ( w )
The operations are simple and clear
but the purpose of the code is not!
>>> def simplify_name ( name ):
... w = name . split () # 'Dr.' 'Ed' 'Smith' 'Jr.'
... while w and '.' in w [ 0 ]: # 'Dr.'
... del w [ 0 ]
... while w and '.' in w [ - 1 ]: # 'Jr.'
... del w [ - 1 ]
... return ' ' . join ( w ) # 'Ed Smith'
>>> simplify_name ( 'Mr. Gregory M. Jones Sr.' )
'Gregory M. Jones'
You can also use a docstring:
>>> def simplify_name ( name ):
... """Remove titles and suffixes from a name.
...
... >>> simplify_name('Dr. Ed Smith Jr.')
... 'Ed Smith'
...
... """
... w = name . split ()
... while w and '.' in w [ 0 ]: # 'Dr.'
... del w [ 0 ]
... while w and '.' in w [ - 1 ]: # 'Jr.'
... del w [ - 1 ]
... return ' ' . join ( w )
return ' ' . join ( w ) # 'Ed Smith'
Note that these comments use the
canonical repr() syntax for strings
Because Python gives us a standard way
to represent data, our comments will mean
the same thing to all Python programmers
So Python's powerful data syntax
leads both to clear code and to
clear communication about code
So let us return to Fred Brooks
The Mythical Man-Month by Fred Brooks
“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.
Show me your tables ,
and I won't usually need your flowchart;
it'll be obvious.”
What about your database?
Can you honestly say this?
Show me your code,
and I shall be mystified.
Show me your database,
and your code will be obvious!
Is your database clear and simple?
If Fred Brooks is right,
data
should be the simplest
part of your software
Do you understand your database?
Do you dare to look inside?
Do you consult its schema?
Can you load data directly?
My claim:
SQL scares people away
from their own data
The SQL Inversion
Python code — readable but complex
ORM — declarative and simple
Database — complex mystery!
SQL
Powerful language
Excellent support for relational thinking
Declarative (!) description of query
I am a SQL expert,
I have used it for 10 years,
I can do things with SQL that
I cannot do in other languages!
I ♥ SQL
But SQL:
Different paradigm than Python
Uses a completely different syntax
Makes you learn
a whole new language
if you want direct access
to your own data
No wonder people avoid SQL!
But hiding behind an ORM deprives
you of your database's true power
Cannot do bulk operations
Cannot see raw table rows
Cannot use other tools
What about ZODB?
I owe an apology
to Martijn Faassen
I owe an apology
to Martijn Faassen
(And since I will be
in Holland on Monday,
I had better give my
apology now !)
Grok
Declarative (!) web framework built
on top of Zope Component Framework
(“Bluebream”)
I was so active in Grok,
Martijn invited me to the week-long
with other developers at his house!
Martijn Faassen
Good programmer
Wonderful host
Very generous
But I never used Grok again
Sorry, Martijn
Why?
Easy excuses:
Not popular in United States
Not a good fit for my projects
Projects had legacy databases
But what is the real reason?
Too Much Magic
At the sprint I watched
Christian Theune, a very good programmer,
fix a problem in a client's ZODB
It was terrible
His code looked like line noise
Why?
Because at the foundation of the ZODB
is the complexity of Python objects
(__dict__ , __getattr__ , …)
the ZODB's own C-language extensions
that make objects even more complex
What is beneath your objects?
C code like:
if ( self -> cache ) {
/* Create a node in the ring
for this unghostified object . */
self -> cache -> non_ghost_count ++ ;
self -> cache -> total_estimated_size +=
_estimated_size_in_bytes ( self -> estimated_size );
ring_add ( & self -> cache -> ring_home , & self -> ring );
Py_INCREF ( self );
}
ZODB
Stores pickles of your Python objects
References save both (classname, id)
So classname is duplicated everywhere
Cannot load your object without the class
So your class can never change names
But worst of all
Because the ZODB stores objects
you can never get outside of the Python
object system to see something simpler
The Mythical Man-Month by Fred Brooks
“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.
Show me your tables ,
and I won't usually need your flowchart;
it'll be obvious.”
Sorry, Fred Brooks
ZODB has no “tables”!
ZODB
No native data representation
that can be examined or manipulated
outside your Python application
NoSQL
Because of the vast popularity of the
very first NoSQL database ever!
When it first came out:
No transactions
No joins
No sub-queries
That's NoSQL!
The majority of programmers
have always wanted something simple
that helps them remember indexed values
Example
MongoDB — a document database
MongoDB
Exciting because it makes your database
as simple as your variables in Python
{
name : "Rosa californica" ,
attrs : {
habitat : [ "chaparral" , "woodlands" ],
flower_color : [ "pink" , "white" , "magenta" ],
}
}
MongoDB documents
Use public JSON standard
No separate schema to maintain
Look like Python dict and list
Language independent
{
name : "Rosa californica" ,
attrs : {
habitat : [ "chaparral" , "woodlands" ],
flower_color : [ "pink" , "white" , "magenta" ],
}
}
{ 'id' : 34 , 'name' : 'Department of Engineering' ,
'employee_ids' : [ 24 , 51 , 381 , 382 , 561 ]}
↑ MongoDB / SQL ↓
"department" "department_employee"
id name dept_id empl_id
┌──────────┐ ┌─────────┐
│ : │ │ : │
│ 34 'Dept…' │ │ 34 24 │
│ : │ │ 34 51 │
└──────────┘ │ 34 381 │
│ 34 382 │
│ 34 561 │
│ : │
└─────────┘
Relations
Schemas, mass migration
Break data into small pieces
Need SQL or ORM to reassemble
"department" "department_employee"
id name dept_id empl_id
┌──────────┐ ┌─────────┐
│ : │ │ : │
│ 34 'Dept…' │ │ 34 24 │
│ : │ │ 34 51 │
└──────────┘ │ 34 381 │
│ 34 382 │
│ 34 561 │
│ : │
└─────────┘
(Also, putting this data in one place
instead of in two tables plus an index
can vastly improve your I/O performance!)
"department" "department_employee"
id name dept_id empl_id
┌──────────┐ ┌─────────┐
│ : │ │ : │
│ 34 'Dept…' │ │ 34 24 │
│ : │ │ 34 51 │
└──────────┘ │ 34 381 │
│ 34 382 │
│ 34 561 │
│ : │
└─────────┘
In conclusion,
♥ Your Database
Learn SQL
or
Try a Document Database
If Fred Brooks is right,
data should be the easy part
and drive understanding of your code
In conclusion,
♥ Your Database
Learn SQL
or
Try a Document Database
If Fred Brooks is right,
data should be the easy part
and drive understanding of your code
In conclusion,
♥ Your Database
Learn SQL
or
Try a Document Database
If Fred Brooks is right,
data should be the easy part
and drive understanding of your code
Any questions?