Know Thy Database

──────────────────
Brandon Rhodes
PyCon PL
Mąchocice, Poland
2011 September 24
──────────────────

The Argument.

Easier to read data than code
Your data should be visible
Your data should be easy to read
Your data should be easy to change
NoSQL document databases can help

The Test

You have 5 seconds to study
the following Python code
that creates a dict

>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

So — do you know how the

resulting dict looks?

I think the code was clear

But just in case,
here is different code
that creates the same dict

(You will have 5 seconds)

>>> map2 = {}
>>> hostname = 'www.google.com'
>>> while '.' in hostname:
...     map2[hostname] = hostname.split('.')
...     hostname = hostname.split('.', 1)[1]
... else:
...     map2[hostname] = [ hostname ]

Do you know how the

resulting dict looks?

Finally, here is a third

example of how to build the dict

(You will have 5 seconds)

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

All right!

You had equal time

with all three slides

Those three slides

describe the exact same data

>>> map1 == map2 == map3
True

What is this data structure?

A dict whose keys are the hostname
www.google.com and the domains above it,
and whose values are the '.' separated
components of each domain name

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

Compare the code ↑ with the data ↓

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

The code and the data are isomorphic

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

But the data can be read more quickly!

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

The Mythical Man-Month by Fred Brooks

Famous book about software project

management, famous for several quotes

The Mythical Man-Month by Fred Brooks

“The bearing of a child takes nine months,

no matter how many women are assigned.”

Fred Books also talks

about code and data

The Mythical Man-Month by Fred Brooks

“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.

Show me your tables,
and I won't usually need your flowchart;
it'll be obvious.”

(1975)

What is a flowchart?

What is a table?

A picture of how your data

is laid out in memory or disk

The Mythical Man-Month by Fred Brooks

“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.

Show me your tables,
and I won't usually need your flowchart;
it'll be obvious.”

(1975)

I will, in fact, claim that the difference
between a bad programmer and a good one
is whether they consider their code or
their data structures more important.

Bad programmers worry about the code.
Good programmers worry about data structures
and their relationships.

— Linus Torvalds

Interesting Fact:

Fred Brooks was comparing
a picture (“your flowchart”)
with another picture (“your tables”)

Back then, programmers worked

with three things at once!

Flowcharts
Tables
Code

What does this mean?

Code was so ugly in
Assembler, Fortran, PL/I, and COBOL
that you needed separate pictures
of both your algorithm
and your data!

Fred Brooks talks about

a flowchart and tables

but

Our example simply

used Python and Python!

>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

Python is so beautiful
that you can use it to draw
pretty pictures of both
algorithms and data

Beautiful ideas are important

Feynman

Feynman diagram

The Nobel Prize in Physics 1965

Sin-Itiro Tomonaga
Julian Schwinger
Richard P. Feynman

was awarded jointly
“for their fundamental work
in quantum electrodynamics (QED),

with deep-ploughing consequences for the

physics of elementary particles.”

All three scientists
had the same idea
the same math

But physics students still use

Feynman diagrams

Why?

Because all three scientists
were working on the same ideas,
but Feynman made those
ideas easy to see

Visualization is a

key component of thinking

See the recent book
Creating Scientific Concepts
by Nancy Nersessian

Nersessian:

To produce abstract differential equations

Maxwell had to use visual physical analogies

Important Python feature

You do not have to leave

Python to see your data

Python offers
syntactic support
for fundamental containers

tuple, list, set, dict

Right in your code,
you can draw picture of the
data structure you want Python to build

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

Python programmers share a common picture

in their heads of how the data types look

and therefore

we all share a common visual language

for our articles, diagrams, and debugging

“Declarative”

Means showing what you want
instead of writing a procedure
that produces what you want

It is the difference between:

>>> map1 = {}
>>> hostname = 'www.google.com'
>>> parts = hostname.split('.')
>>> for i in range(len(parts)):
...     subparts = parts[i:]
...     subname = '.'.join(subparts)
...     map1[subname] = subparts

↑ procedural / declarative ↓

>>> map3 = {
...     'www.google.com': ['www', 'google', 'com'],
...     'google.com': ['google', 'com'],
...     'com': ['com'],
...     }

“Declarative”

Do you want to be popular?

Write an API
that lets programmers
draw a picture with code

SQLAlchemy

Once upon a time,
you had to build tables and classes
then connect them up with code

Even back then, table definitions

were pretty declarative

users_table = Table('users', metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String),
    Column('password', String),
    Column('group_id', Integer),
    )

But using the ORM to connect tables

to classes was not declarative

users_table = Table('users', metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String),
    Column('password', String),
    Column('group_id', Integer),
    )

class User(object):
    pass

mapper(User, users_table, properties={
    'group': relation(Group),
    })

Jonathan LaCour became famous
for Exilir, a declarative layer
written on top of SQLAlchemy

So Mike Bayer added a declarative

layer to SQLAlchemy itself!

from sqlalchemy.ext.declarative \
    import declarative_base

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    password = Column(String)
    group_id = Column(Integer)

    group = relation(Group)

It seems that
everyone wants a picture
not a procedure

Declarative syntax

People are excited about it
SQLAlchemy has a declarative syntax
Django ORM has a declarative syntax
But Python has been supporting declarative since 1991!

This is really a declaration:

mydict = {'Fibonacci': [1, 1, 2, 3, 5]}

vs languages that force you to write:

mylist = list()
mylist.append(1)
mylist.append(1)
mylist.append(2)
mylist.append(3)
mylist.append(5)
mydict = dict()
mydict['Fibonacci'] = mylist

Let's Talk About Comments

Example comments

from Stack Overflow:

# drunk, fix later

return 1   # returns 1

# I did this the other way

# I don't know why I need this, but it
# stops the people being upside-down

x = -x

# Magic, do not touch

# If this comment is removed
# the program will blow up

# Comment this later

And, my favorite:

# Always returns true.
#
def isAvailable():
   return False

Comments are controversial

Extreme Programming (XP)

“If a method needs comments it needs to be rewritten”
“Good code is self-documenting”
“Comments lie, code doesn’t”

This Extreme Programming advice
is good if it makes you think about
how your code might be clearer

# the zero means "not found"
day_of_week = 0
...
if not day_of_week:
    ...

↑ less clear / more clear ↓

NOT_FOUND = 0
day_of_week = NOT_FOUND
...
if day_of_week == NOT_FOUND:
    ...

But contra the XP people,

I think comments are sometimes very valuable!

I have two favorite uses for comments

Comments record important knowledge
that I learned while using a
library someone else wrote

thirdparty.init(encoding='utf-8')
db = thirdparty.open(dbpath)

# Drat. "thirdparty" ignores the
# global encoding on old version-3
# databases; so we set it again:

db.set_encoding('utf-8')

But even more important:

comments can draw pictures of data

Very simple code, easy to read,

but without pictures:

>>> def simplify_name(name):
...     w = name.split()
...     while w and '.' in w[0]:
...         del w[0]
...     while w and '.' in w[-1]:
...         del w[-1]
...     return ' '.join(w)

The operations are simple and clear

but the purpose of the code is not!

With comment-pictures:

>>> def simplify_name(name):
...     w = name.split() # 'Dr.' 'Ed' 'Smith' 'Jr.'
...     while w and  '.' in w[0]:  # 'Dr.'
...         del w[0]
...     while w and '.' in w[-1]:  # 'Jr.'
...         del w[-1]
...     return ' '.join(w)         # 'Ed Smith'

>>> simplify_name('Mr. Gregory M. Jones Sr.')
'Gregory M. Jones'

You can also use a docstring:

>>> def simplify_name(name):
...     """Remove titles and suffixes from a name.
...
...     >>> simplify_name('Dr. Ed Smith Jr.')
...     'Ed Smith'
...
...     """
...     w = name.split()
...     while w and  '.' in w[0]:  # 'Dr.'
...         del w[0]
...     while w and '.' in w[-1]:  # 'Jr.'
...         del w[-1]
...     return ' '.join(w)

return ' '.join(w)  # 'Ed Smith'

Note that these comments use the

canonical repr() syntax for strings

Because Python gives us a standard way
to represent data, our comments will mean
the same thing to all Python programmers

So Python's powerful data syntax
leads both to clear code and to
clear communication about code

So let us return to Fred Brooks

The Mythical Man-Month by Fred Brooks

“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.

Show me your tables,
and I won't usually need your flowchart;
it'll be obvious.”

(1975)

What about your database?

Can you honestly say this?

Show me your code,

and I shall be mystified.

Show me your database,

and your code will be obvious!

Is your database clear and simple?

If Fred Brooks is right,

data

should be the simplest

part of your software

Do you understand your database?

Do you dare to look inside?
Do you consult its schema?
Can you load data directly?

My claim:

SQL scares people away

from their own data

The SQL Inversion

Python code — readable but complex
ORM — declarative and simple
Database — complex mystery!

SQL

Powerful language
Excellent support for relational thinking
Declarative (!) description of query

I am a SQL expert,
I have used it for 10 years,
I can do things with SQL that
I cannot do in other languages!

I ♥ SQL

But SQL:

Different paradigm than Python

Uses a completely different syntax

Makes you learn
a whole new language
if you want direct access
to your own data

No wonder people avoid SQL!

But hiding behind an ORM deprives

you of your database's true power

Cannot do bulk operations
Cannot see raw table rows
Cannot use other tools

What about ZODB?

I owe an apology

to Martijn Faassen

I owe an apology

to Martijn Faassen

(And since I will be
in Holland on Monday,
I had better give my
apology now!)

Grok

Declarative (!) web framework built
on top of Zope Component Framework
(“Bluebream”)

I was so active in Grok,

Martijn invited me to the week-long

2009 Cave Sprint

with other developers at his house!

Martijn Faassen

Good programmer
Wonderful host
Very generous

But I never used Grok again

Sorry, Martijn

Why?

Easy excuses:

Not popular in United States
Not a good fit for my projects
Projects had legacy databases

But what is the real reason?

Too Much Magic

At the sprint I watched
Christian Theune, a very good programmer,
fix a problem in a client's ZODB

It was terrible

His code looked like line noise

Why?

Because at the foundation of the ZODB
is the complexity of Python objects
(__dict__, __getattr__, …)

the ZODB's own C-language extensions

that make objects even more complex

What is beneath your objects?

C code like:

if (self->cache) {
    /* Create a node in the ring
       for this unghostified object. */
    self->cache->non_ghost_count++;
    self->cache->total_estimated_size +=
      _estimated_size_in_bytes(self->estimated_size);
    ring_add(&self->cache->ring_home, &self->ring);
    Py_INCREF(self);
}

ZODB

Stores pickles of your Python objects
References save both (classname, id)
So classname is duplicated everywhere
Cannot load your object without the class
So your class can never change names

But worst of all

Because the ZODB stores objects
you can never get outside of the Python
object system to see something simpler

The Mythical Man-Month by Fred Brooks

“Show me your flowchart
and conceal your tables,
and I shall continue to be mystified.

Show me your tables,
and I won't usually need your flowchart;
it'll be obvious.”

(1975)

Sorry, Fred Brooks

ZODB has no “tables”!

ZODB

No native data representation
that can be examined or manipulated
outside your Python application

NoSQL

People want NoSQL!

How do we know?

Because of the vast popularity of the

very first NoSQL database ever!

When it first came out:

No transactions
No joins
No sub-queries
That's NoSQL!

The majority of programmers
have always wanted something simple
that helps them remember indexed values

Why is NoSQL exciting?

Example

MongoDB — a document database

MongoDB

Exciting because it makes your database

as simple as your variables in Python

{
  name: "Rosa californica",
  attrs: {
    habitat: ["chaparral", "woodlands"],
    flower_color: ["pink", "white", "magenta"],
  }
}

MongoDB documents

Use public JSON standard
No separate schema to maintain
Look like Python dict and list
Language independent

{
  name: "Rosa californica",
  attrs: {
    habitat: ["chaparral", "woodlands"],
    flower_color: ["pink", "white", "magenta"],
  }
}

{'id': 34, 'name': 'Department of Engineering',
 'employee_ids': [24, 51, 381, 382, 561]}

↑ MongoDB / SQL ↓

"department" "department_employee"
  id  name      dept_id empl_id
┌──────────┐      ┌─────────┐
│     :    │      │     :   │
│34 'Dept…'│      │ 34  24  │
│     :    │      │ 34  51  │
└──────────┘      │ 34  381 │
                  │ 34  382 │
                  │ 34  561 │
                  │     :   │
                  └─────────┘

Relations

Schemas, mass migration
Break data into small pieces
Need SQL or ORM to reassemble

"department" "department_employee"
  id  name      dept_id empl_id
┌──────────┐      ┌─────────┐
│     :    │      │     :   │
│34 'Dept…'│      │ 34  24  │
│     :    │      │ 34  51  │
└──────────┘      │ 34  381 │
                  │ 34  382 │
                  │ 34  561 │
                  │     :   │
                  └─────────┘

(Also, putting this data in one place
instead of in two tables plus an index
can vastly improve your I/O performance!)

"department" "department_employee"
  id  name      dept_id empl_id
┌──────────┐      ┌─────────┐
│     :    │      │     :   │
│34 'Dept…'│      │ 34  24  │
│     :    │      │ 34  51  │
└──────────┘      │ 34  381 │
                  │ 34  382 │
                  │ 34  561 │
                  │     :   │
                  └─────────┘

In conclusion,

♥ Your Database

Learn SQL
or
Try a Document Database

If Fred Brooks is right,
data should be the easy part
and drive understanding of your code

In conclusion,

♥ Your Database

Learn SQL
or
Try a Document Database

If Fred Brooks is right,
data should be the easy part
and drive understanding of your code

Thank you!

In conclusion,

♥ Your Database

Learn SQL
or
Try a Document Database

If Fred Brooks is right,
data should be the easy part
and drive understanding of your code

Any questions?