httplib and its discontents

──────────────────
PyOhio
2011 July 31st
Brandon Craig Rhodes
──────────────────

This is a talk
about what can go wrong
during programming

Not a talk about how

to use httplib or urllib2

For that, use the new wrapper

library by Kenneth Reitz:

http://pypi.python.org/pypi/requests

Public Service Announcement

Never install Python packages

system-wide except for virtualenv

$ virtualenv playground
$ cd playground
$ ls
bin/  include/  lib/
$ . bin/activate
(playground)$ pip install requests
(playground)$ python
>>> import requests
>>> r = requests.get('http://google.com')

“OO”

OO = Object-Oriented Programming

What is an “Object”?

History

There were variables:

firstname = 'Guido'
lastname = 'van Rossum'
state = 'CA'
age = 55
genre = 'Dutch'

firstnames = ['George', 'John', 'Thomas']
lastnames = ['Washington', 'Adams', 'Jefferson']
states = ['VA', 'MA', 'VA']
terms = [2, 1, 2]

But how do you pass

a whole person as a parameter?

ignorant_function(firstnames[i], lastnames[i],
                  states[i], terms[i])

aware_function(firstnames, lastnames,
               states, terms, i)

Records / Structures

presidents = [
    Person(firstname='George',
           lastname='Washington,
           state='VA',
           terms=2),
    ]

p = presidents[0]
print p.firstname, p.lastname

designate_holiday_for(p)

Quick Digression

presidents = [
    Person(firstname='George',
           lastname='Washington,
           state='VA',
           terms=2),
    ]              ↑
       #    Why did I insert
       # this unnecessary comma?

A:

It creates perfect symmetry in a long list

some_primes = [
    6311,
    6317,
    6323,
    6329,
    6337,
    6343,
    ]

You can cut, paste,
and rearrange code with
the most wild abandon

Terminal commas are always allowed

♥ Python

So, records are great

p.firstname='George'
p.lastname='Washington
p.state='VA'
p.terms=2

They fixed our problem with data.

But what about our code?

We wrote many procedures

for processing records

def create_person(): …
def delete_person(p): …
def send_paycheck_to(p): …
def designate_retiree(p): …
def record_contribution(p): …
def produce_obituary(p): …

These procedures became very complex

as more varieties of each record appeared

def send_paycheck_to(p):
    "Produce a paycheck for person `p`."
    if p.role == 'student':
        …
    elif p.role == 'employee':
        if p.status = 'biweekly':
            …
        elif p.status = 'salary':
            if p.rank = 'emeritus':
                …
            else:
                …

These great forests of
if statement were chopped down
in the great OO revolution

In OO, we do for code

what records did for data

“method”

a function attached to data

myperson.process_payroll()

Now our programming language

implements the if forest for us

class Employee(object):
    def process_payroll(…): …
    def move_to_retirement(…): …

class BiweeklyStaff(Employee):
    def process_payroll(…): …

class EmeritusProf(Employee):
    def move_to_retirement(…): …

OO languages give the method

the object on which it was invoked

myperson.process_payroll()

↓

Employee.process_payroll(myperson)

Python, of course, makes this

explicit rather than implicit

def process_payroll(self):
    …

OO

Not just convenient, but

powerful in important ways

Packaging data and code together

makes it much easier to abstract

But could OO have a dark side?

“The Case of the HTTP Parser”

Stack Overflow question:

"Does Python have a module for parsing

HTTP requests and responses?"

HTTP

“Hypertext Transfer Protocol”

(hypertext means text with links)

Your web browser uses HTTP

“fetch http://google.com for me!”

Your browser:

Sends an HTTP Request to google.com
Receives an HTTP Response from google.com

An HTTP request is just text

GET / HTTP/1.1
Accept: text/html,application/xhtml+xml,…
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Cache-Control: max-age=0
Connection: keep-alive
Cookie:  …
Host: www.google.com
User-Agent: Mozilla/5.0 (X11; Linux i686)
    AppleWebKit/535.1 (KHTML, like Gecko)
    Chrome/13.0.782.41 Safari/535.1

Q:

"Does Python have a module for parsing

HTTP requests and responses?"

A:

∵ SimpleHTTPServer

∴ Python MUST have a parser!

Python's built-in web server

$ python -m SimpleHTTPServer

serves static files from

the current directory

Two questions

Where in the Standard Library

is this parser?
How can I run it on a string?

Server classes take

care of sockets for you

(For all of the details, see
Foundations of Python Network Programming
by Brandon Rhodes and John Goerzen)

Server classes

SocketServer.BaseServer
↑
SocketServer.TCPServer
↑
BaseHTTPServer.HTTPServer

These open a socket,
set its IP address, run listen(),
then run accept() over and over
again accepting connections

But once the connection is
established, the Server classes
have no idea what to do next

So they pass the open
connection to a Handler
to do the actual talking

Handler classes

SocketServer.BaseRequestHandler
↑
SocketServer.StreamRequestHandler
↑
BaseHTTPServer.BaseHTTPRequestHandler

The HTTP handler is an extensible class

The Standard Library invites you
to create your own subclass of the handler
and add methods like do_GET() or do_POST()

GET / HTTP/1.1
Accept: text/html,text/plain
Host: www.example.com

If this request came in, the

server would call your do_GET()

POST / HTTP/1.1
Accept: text/html,text/plain
Content-Length: 12
Host: www.example.com

query=Python

For this the server would call do_POST()

Note that here OO is not being

used merely to provide a container

Instead, the module author thinks

that OO is a party and you are invited

You are invited to write new methods

on the author's class

Hint: this was probably a bad idea!

Classes — as we will soon see — can

be quite complex pieces of machinery.

Why invite another programmer in?

It breaks the clean barrier

that lets us do abstraction

Modern alternatives

You could supply callback functions
You could supply your own object
You could consume requests by iterating

                #inheritance

            BaseHTTPRequestHandler
                      ↑
HTTPServer → MyHTTPRequestHandler

         # API

Mightn't it have been better
to also have an API between
HTTP parser and handler?

HTTPServer → HTTPParser → MyHandler

So how does BaseHTTPRequestHandler

actually parse the HTTP protocol?

BaseRequestHandler.__init__(self, request, client_address…)

Stores params on self
Calls self.setup()
Calls self.handle()
Calls self.finish()

StreamRequestHandler.setup(self)

Wraps socket in file-like objects
Creates self.rfile stream
Creates self.wfile stream

BaseHTTPRequestHandler.handle_one_request(self)

26 lines of code
Calls rfile.readline()
Checks that it is really a line of text
Saves it as self.raw_requestline
Calls self.parse_request()
Calls self.do_GET() or self.do_POST()

BaseHTTPRequestHandler.parse_request(self)

60 lines
Finally — the parsing goodness!
Parses self.raw_requestline
Reads and parses headers from self.rfile

Great! How do I use it?

Q:

What does parse_request() need?

A:

Its needs are not documentated anywhere.
Why?
Because in Object-Oriented programming,
your conscience is taught that you can
use any attributes or methods on self
for free and without apologizing.

A:

Freebies:

self.default_request_version = "HTTP/0.9"
self.protocol_version = "HTTP/1.0"
self.MessageClass = mimetools.Message

Needs:

self.raw_requestline # string
self.rfile # used by self.MessageClass
self.send_error(code, message) # HTTP response!

That's quite an invocation protocol!

You have to set several object attributes.

This is called tight coupling

Useful phrase

“You've written some pretty

tightly coupled code there!”

(use the same tone of voice

as when you step in something)

Q:

So what does the Standard Library do

when they want to test parse_request()?

A:

“Well, you start by opening a socket…”

test/test_httpservers.py

self.server = HTTPServer(('', 0), self.request_handler)
self.test_object.PORT = self.server.socket.getsockname()[1]

Lesson

Python's Standard Library

will not parse an HTTP request …

… unless you submit it through

a real live socket

(… or, unless you are willing to write code
that depends on the internal details
of the BaseHTTPRequestHandler)

I was willing.

from BaseHTTPServer import BaseHTTPRequestHandler
from StringIO import StringIO

class HTTPRequest(BaseHTTPRequestHandler):
    def __init__(self, request_text):
        self.rfile = StringIO(request_text)
        self.raw_requestline = self.rfile.readline()
        self.error_code = self.error_message = None
        self.parse_request()

    def send_error(self, code, message):
        self.error_code = code
        self.error_message = message

# Simply instantiate with the request text

request = HTTPRequest(request_text)

print request.error_code      # None
print request.command         # "GET"
print request.path         # "/who/ken/trust.html"
print request.request_version # "HTTP/1.1"
print len(request.headers)    # 3
print request.headers['host'] # "cm.bell-labs.com"

# Parsing can result in an error code and message

request = HTTPRequest(bad_request_text)

print request.error_code     # 400
print request.error_message  # "Bad request syntax"

Lesson

What have we learned?

OO is an incredible tool

for doing abstraction

BaseHTTPServer hides sockets really well!

But OO deadens the conscience

It lets programmers
write heavily coupled code because
“These are internal details of my class!
I can use anything on self that I want!”

Two remedies

Testing
Procedures

What if?

What if the HTTP parser had been written

as a good old-fashioned procedure?

The argument list and return statement
would provide clear documentation of what
data entered and exited the procedure

The code would have been prevented
from accumulating more and more
fiddly dependencies upon self

def parse_http_request(
      infile,
      request,
      version='HTTP/1.0',
      default_version='HTTP/0.9'):
    …
    if failure:
        raise HTTPError(400, "Bad request")
    …
    request.code = …
    request.path = …
    …
    return

CherryPy

Robert Brewer is pretty awesome

From _cperror.py:

class HTTPError(CherryPyException):
    ⋮
    def get_error_page(self, …):
        return get_error_page(…)

⋮

def get_error_page(status, …):
    ⋮

He defines the procedure separately,
then writes the method that invokes
the procedure with data from self

Two remedies

Testing
Procedures

I recommend dependency injection

It can help our procedures like it

is already helping our classes

Dependency injection

Martin Fowler

http://martinfowler.com/articles/injection.html

Thank you!

Any questions?