Know Thy Database



──────────────────
Brandon Rhodes
PyCon PL
Mąchocice, Poland
2011 September 24
──────────────────

The Argument.

The Test

You have 5 seconds to study
the following Python code
that creates a dict
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts
So — do you know how the
resulting dict looks?

I think the code was clear

But just in case,
here is different code
that creates the same dict

(You will have 5 seconds)

>>> map2 = {}
>>> hostname = 'www.google.com'
>>> while '.' in hostname:
...     map2[hostname] = hostname.split('.')
...     hostname = hostname.split('.', 1)[1]
... else:
...     map2[hostname] = [ hostname ]
Do you know how the
resulting dict looks?
Finally, here is a third
example of how to build the dict

(You will have 5 seconds)

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

All right!

You had equal time
with all three slides
Those three slides
describe the exact same data
>>> map1 == map2 == map3
True
What is this data structure?
A dict whose keys are the hostname
www.google.com and the domains above it,
and whose values are the '.' separated
components of each domain name
>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

Compare the code ↑ with the data ↓

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

The code and the data are isomorphic

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

But the data can be read more quickly!

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }
../2011-09-pyatl/mythical-cover.jpg

The Mythical Man-Month by Fred Brooks

Famous book about software project
management, famous for several quotes

The Mythical Man-Month by Fred Brooks

“The bearing of a child takes nine months,
no matter how many women are assigned.”
Fred Books also talks
about code and data

The Mythical Man-Month by Fred Brooks

“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.
Show me your tables,
and I won't usually need your flowchart;
it'll be obvious.”
(1975)

What is a flowchart?

tech_support_cheat_sheet.png

What is a table?

A picture of how your data
is laid out in memory or disk

The Mythical Man-Month by Fred Brooks

“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.
Show me your tables,
and I won't usually need your flowchart;
it'll be obvious.”
(1975)
I will, in fact, claim that the difference
between a bad programmer and a good one
is whether they consider their code or
their data structures more important.
Bad programmers worry about the code.
Good programmers worry about data structures
and their relationships.

— Linus Torvalds

Interesting Fact:

Fred Brooks was comparing
a picture (“your flowchart”)
with another picture (“your tables”)
Back then, programmers worked
with three things at once!

What does this mean?

Code was so ugly in
Assembler, Fortran, PL/I, and COBOL
that you needed separate pictures
of both your algorithm
and your data!
Fred Brooks talks about
a flowchart and tables

but

Our example simply
used Python and Python!
>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts
>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }
Python is so beautiful
that you can use it to draw
pretty pictures of both
algorithms and data

Beautiful ideas are important

../2011-09-pyatl/feynman.jpg

Feynman

../2011-09-pyatl/Feynmann_Diagram_Gluon_Radiation.svg

Feynman diagram

The Nobel Prize in Physics 1965

Sin-Itiro Tomonaga
Julian Schwinger
Richard P. Feynman
was awarded jointly
“for their fundamental work
in quantum electrodynamics (QED),
with deep-ploughing consequences for the
physics of elementary particles.”
All three scientists
had the same idea
the same math
But physics students still use
Feynman diagrams

Why?

Because all three scientists
were working on the same ideas,
but Feynman made those
ideas easy to see
Visualization is a
key component of thinking
See the recent book
Creating Scientific Concepts
by Nancy Nersessian
../2011-09-pyatl/maxwell-light.jpg

Nersessian:

To produce abstract differential equations
Maxwell had to use visual physical analogies

Important Python feature

You do not have to leave
Python to see your data
Python offers
syntactic support
for fundamental containers

tuple, list, set, dict

Right in your code,
you can draw picture of the
data structure you want Python to build
>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }
Python programmers share a common picture
in their heads of how the data types look
and therefore
we all share a common visual language
for our articles, diagrams, and debugging

“Declarative”

Means showing what you want
instead of writing a procedure
that produces what you want

It is the difference between:

>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

↑ procedural / declarative ↓

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

“Declarative”

Do you want to be popular?
Write an API
that lets programmers
draw a picture with code

SQLAlchemy

Once upon a time,
you had to build tables and classes
then connect them up with code
Even back then, table definitions
were pretty declarative
users_table = Table('users', metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String),
    Column('password', String),
    Column('group_id', Integer),
    )
But using the ORM to connect tables
to classes was not declarative
users_table = Table('users', metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String),
    Column('password', String),
    Column('group_id', Integer),
    )

class User(object):
    pass

mapper(User, users_table, properties={
    'group': relation(Group),
    })
Jonathan LaCour became famous
for Exilir, a declarative layer
written on top of SQLAlchemy
So Mike Bayer added a declarative
layer to SQLAlchemy itself!
from sqlalchemy.ext.declarative \
    import declarative_base

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    password = Column(String)
    group_id = Column(Integer)

    group = relation(Group)
It seems that
everyone wants a picture
not a procedure

Declarative syntax

This is really a declaration:

mydict = {'Fibonacci': [1, 1, 2, 3, 5]}

vs languages that force you to write:

mylist = list()
mylist.append(1)
mylist.append(1)
mylist.append(2)
mylist.append(3)
mylist.append(5)
mydict = dict()
mydict['Fibonacci'] = mylist

Let's Talk About Comments

Example comments
from Stack Overflow:
# drunk, fix later
return 1   # returns 1
# I did this the other way
# I don't know why I need this, but it
# stops the people being upside-down

x = -x
# Magic, do not touch
# If this comment is removed
# the program will blow up
# Comment this later

And, my favorite:

# Always returns true.
#
def isAvailable():
   return False

Comments are controversial

Extreme Programming (XP)

This Extreme Programming advice
is good if it makes you think about
how your code might be clearer
# the zero means "not found"
day_of_week = 0
...
if not day_of_week:
    ...

↑ less clear / more clear ↓

NOT_FOUND = 0
day_of_week = NOT_FOUND
...
if day_of_week == NOT_FOUND:
    ...
But contra the XP people,
I think comments are sometimes very valuable!
I have two favorite uses for comments
Comments record important knowledge
that I learned while using a
library someone else wrote
thirdparty.init(encoding='utf-8')
db = thirdparty.open(dbpath)

# Drat. "thirdparty" ignores the
# global encoding on old version-3
# databases; so we set it again:

db.set_encoding('utf-8')
But even more important:
comments can draw pictures of data
Very simple code, easy to read,
but without pictures:
>>> def simplify_name(name):
...     w = name.split()
...     while w and '.' in w[0]:
...         del w[0]
...     while w and '.' in w[-1]:
...         del w[-1]
...     return ' '.join(w)
The operations are simple and clear
but the purpose of the code is not!
With comment-pictures:
>>> def simplify_name(name):
...     w = name.split() # 'Dr.' 'Ed' 'Smith' 'Jr.'
...     while w and  '.' in w[0]:  # 'Dr.'
...         del w[0]
...     while w and '.' in w[-1]:  # 'Jr.'
...         del w[-1]
...     return ' '.join(w)         # 'Ed Smith'
>>> simplify_name('Mr. Gregory M. Jones Sr.')
'Gregory M. Jones'
You can also use a docstring:
>>> def simplify_name(name):
...     """Remove titles and suffixes from a name.
...
...     >>> simplify_name('Dr. Ed Smith Jr.')
...     'Ed Smith'
...
...     """
...     w = name.split()
...     while w and  '.' in w[0]:  # 'Dr.'
...         del w[0]
...     while w and '.' in w[-1]:  # 'Jr.'
...         del w[-1]
...     return ' '.join(w)
return ' '.join(w)  # 'Ed Smith'
Note that these comments use the
canonical repr() syntax for strings
Because Python gives us a standard way
to represent data, our comments will mean
the same thing to all Python programmers
So Python's powerful data syntax
leads both to clear code and to
clear communication about code

So let us return to Fred Brooks

The Mythical Man-Month by Fred Brooks

“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.
Show me your tables,
and I won't usually need your flowchart;
it'll be obvious.”
(1975)

What about your database?

Can you honestly say this?

Show me your code,
and I shall be mystified.
Show me your database,
and your code will be obvious!

Is your database clear and simple?

If Fred Brooks is right,

data

should be the simplest
part of your software

Do you understand your database?

My claim:

SQL scares people away
from their own data

The SQL Inversion

Python code — readable but complex
ORM — declarative and simple
Database — complex mystery!

SQL

I am a SQL expert,
I have used it for 10 years,
I can do things with SQL that
I cannot do in other languages!

I ♥ SQL

But SQL:

Different paradigm than Python

Uses a completely different syntax

Makes you learn
a whole new language
if you want direct access
to your own data

No wonder people avoid SQL!

But hiding behind an ORM deprives
you of your database's true power

What about ZODB?

I owe an apology
to Martijn Faassen
faassen.jpg
I owe an apology
to Martijn Faassen
(And since I will be
in Holland on Monday,
I had better give my
apology now!)

Grok

Declarative (!) web framework built
on top of Zope Component Framework
(“Bluebream”)
I was so active in Grok,
Martijn invited me to the week-long
2009 Cave Sprint
with other developers at his house!

Martijn Faassen

But I never used Grok again

Sorry, Martijn

Why?

Easy excuses:

But what is the real reason?

Too Much Magic

At the sprint I watched
Christian Theune, a very good programmer,
fix a problem in a client's ZODB

It was terrible

His code looked like line noise

Why?

Because at the foundation of the ZODB
is the complexity of Python objects
(__dict__, __getattr__, …)
+
the ZODB's own C-language extensions
that make objects even more complex

What is beneath your objects?

C code like:

if (self->cache) {
    /* Create a node in the ring
       for this unghostified object. */
    self->cache->non_ghost_count++;
    self->cache->total_estimated_size +=
      _estimated_size_in_bytes(self->estimated_size);
    ring_add(&self->cache->ring_home, &self->ring);
    Py_INCREF(self);
}

ZODB

But worst of all

Because the ZODB stores objects
you can never get outside of the Python
object system to see something simpler

The Mythical Man-Month by Fred Brooks

“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.
Show me your tables,
and I won't usually need your flowchart;
it'll be obvious.”
(1975)

Sorry, Fred Brooks

ZODB has no “tables”!

ZODB

No native data representation
that can be examined or manipulated
outside your Python application

NoSQL

People want NoSQL!
How do we know?
Because of the vast popularity of the
very first NoSQL database ever!
mysql-logo.jpg
mysql-logo.jpg

When it first came out:

mysql-logo.jpg
The majority of programmers
have always wanted something simple
that helps them remember indexed values

Why is NoSQL exciting?

Example

MongoDB — a document database

MongoDB

Exciting because it makes your database
as simple as your variables in Python
{
  name: "Rosa californica",
  attrs: {
    habitat: ["chaparral", "woodlands"],
    flower_color: ["pink", "white", "magenta"],
  }
}

MongoDB documents

{
  name: "Rosa californica",
  attrs: {
    habitat: ["chaparral", "woodlands"],
    flower_color: ["pink", "white", "magenta"],
  }
}
{'id': 34, 'name': 'Department of Engineering',
 'employee_ids': [24, 51, 381, 382, 561]}

↑ MongoDB / SQL ↓

"department" "department_employee"
  id  name      dept_id empl_id
┌──────────┐      ┌─────────┐
     :               :   
34 'Dept…'       34  24  
     :           34  51  
└──────────┘       34  381 
                   34  382 
                   34  561 
                       :   
                  └─────────┘

Relations

"department" "department_employee"
  id  name      dept_id empl_id
┌──────────┐      ┌─────────┐
     :               :   
34 'Dept…'       34  24  
     :           34  51  
└──────────┘       34  381 
                   34  382 
                   34  561 
                       :   
                  └─────────┘
(Also, putting this data in one place
instead of in two tables plus an index
can vastly improve your I/O performance!)
"department" "department_employee"
  id  name      dept_id empl_id
┌──────────┐      ┌─────────┐
     :               :   
34 'Dept…'       34  24  
     :           34  51  
└──────────┘       34  381 
                   34  382 
                   34  561 
                       :   
                  └─────────┘

In conclusion,

♥ Your Database

Learn SQL
or
Try a Document Database
If Fred Brooks is right,
data should be the easy part
and drive understanding of your code

In conclusion,

♥ Your Database

Learn SQL
or
Try a Document Database
If Fred Brooks is right,
data should be the easy part
and drive understanding of your code
Thank you!

In conclusion,

♥ Your Database

Learn SQL
or
Try a Document Database
If Fred Brooks is right,
data should be the easy part
and drive understanding of your code

Any questions?