===================
PyCon Talk Proposal
===================

:Title: The Mighty Dictionary
:Duration: 30 min
:Level: Beginner
:Categories: core

Summary
=======

Both newcomers and experienced developers alike love Python's built-in data types — especially dictionaries!  But how do dictionaries work? What do they do better than other container types, and where, on the other hand, are their weaknesses?  Using simple, vivid diagrams that show the secrets of how the dictionary is implemented, and a series of progressively interesting examples of its behavior, we will train the Python developer's mind to picture what the dictionary is doing in just enough detail to make good decisions, as your data sets get larger, about when to use dictionaries and when other data structures might be more appropriate.

Description
===========

With some judicious use of ``ctypes``, one can write a Python routine
that dissects a Python dictionary or set and displays its internals.
Using such a tool — which I will also release on PyPI, to accompany the
presentation slides — my presentation will show how dictionaries behave
as you add items, overwrite them later, remove them, and iterate across
the whole dictionary.

By contemplating these normally hidden mechanics, and by showing some
judicious results from the ``timeit`` module, both newcomers and
experienced developers can gain new insight into the trade-offs that
dictionaries provide between space and computational complexity,
compared to the other alternatives in Python.  They will also understand
why Python provides a ``hash()`` function; why user-defined classes are
given the freedom to define their own hash function as well; and, what
happens if they choose not to.

The talk will actually discuss sets for most of its length, since they
are simpler to diagram and understand, then show, at the end, how a
dictionary is just a set with a second column, that holds a reference to
an object stored at that key value.

The talk will go something like this — each of the following 5 items,
I'm imagining, will take up about five minutes (and probably five to ten
slides) of my presentation, adding together to 25 minutes (leaving
5 minutes left over for questions):

1. Computer memory is like a Python list
----------------------------------------

Computer memory is indexed by integers, like Python lists (though the
indexes tend to be much bigger!).  So a Python list is simple: it's an
array of numbers, each indicating where in memory a Python object is
stored, and Python can jump directly to list item *n*, but has to
iterate across the whole list to find whether a particular item is in
the list.

An ordered list would let you find items more quickly, by jumping in
halfway through, and then restricting your search to one half of the
remaining list, just like looking for a name in a telephone book.  But,
the cost would still grow as the list grew longer.  And, lists would be
expensive to keep ordered!

So, let's think about another plan.  In a normal list, items wind up at
all sorts of indexes.  What if we created a list, and magically knew
ahead of time exactly where each Python object belonged?  Then we could
jump right to a given item, immediately, every time!

2. The idea of a hash table
---------------------------

We would need a function, called a *hash function*, that when given a
certain value — like the number 42, or the string ``"Ni"`` — always
returns the same index.  Python provides this with a built-in called
``hash()`` for which each built-in type provides an implementation.

As an example, we will examine an empty ``set()`` — “look, it starts
with space for eight items, even when it's empty!” — then we run
``hash()`` on three simple Python values; then we insert them into the
set, and see them land right at the indexes where ``hash()`` told them
to.

We now see the trade-off a hash table makes: in return for holding open
several empty slots, and thus spending *memory*, it can find an item (or
discover that it's absent) after only incurring the *static* cost of
computing a hash.  With a few ``timeit`` tests, we determine how costly
it is to compute a hash compared to two simple list operations: jumping
directly to list item *n*, versus iterating across a small or large list
to find an item.  Very small lists are very fast, but quickly become
more expensive than computing the hash value to look in a set or
dictionary.

3. When indexes collide
-----------------------

The ``hash()`` function has to return a limited range of values for an
unlimited range of inputs, so many objects *collide*.  I will create a
collision in the set shown in the slides, and show how the hash table
shunts aside the second item and puts it in a second spot that it can
find it quickly again when we ask.  Removing the collision can still
leave the other object stranded where it was put, so the cost of a
collision can linger.

When I add a fifth item to the set, it suddenly becomes 32 items long!
This is to prevent collisions from piling up too deep; both the size of
the hash table, *and* some of its behavior, are thus driven by the need
to handle collisions.  I will note that, when a set or dictionary is
re-sized, all of its contents are re-inserted, so that any junk left
over gets periodically erased as long as the dictionary is occasionally
growing or shrinking.

I will show how the cost of re-allocating the whole hash table is
reasonable if spread across many hundreds of set inserts, and also
quickly show an animation of the dictionary growing, then shrinking as
items are removed.  By showing some animations of how a dictionary
"looks" as it grows and gets used, using some real-world data from
observing a dictionary in one of my own applications, I will give a feel
for how they behave in the wild.

4. Providing your own hash function
-----------------------------------

The critical idea of giving one of your own classes its own
``__hash__()`` method is whether each member of your class represents a
*value* that could later — or even simultaneously — be represented by a
different instance of your class.  I will show how we can easily create
two floats with different ``id()`` but the same value (such that they
satisfy Python equality), and show how they both go into the same
dictionary slot because they have the same hash value.

With a few examples and simple illustrations, I will show how simply
calling ``hash()`` on the instance variables that give your class
instance its own unique value, and combining their values together, you
can create a decent hash for your own class.

What if your class has no hash routine?  I will show how objects are
then tested for uniqueness, rather than value, and how deleting and
re-creating the "same" object gives it a different hash-table slot.

5. The dictionary, and its alternatives
---------------------------------------

First, I finally show the dictionary in all of its glory: like a set, it
keeps items in a hash table; but it adds a second column that for each
key provides a "value".

I will then show an animation of iteration across a dictionary: why the
objects come out in random order, and why it's dangerous to modify the
dictionary during iteration.

Finally, I will briefly discuss alternatives.  If you want items back
out in order, rather than having random access, use a ``heapq``.  If you
only add and remove objects from the ends of a series, use a ``deque``.
If you need both key-value referencing *and* ordering, then (for today)
you might just use ``sorted()`` on your keys each time, or (in the
future) use an ``OrderedDict``.

But, for most uses, the List and Dictionary are king, and the audience
will now hopefully understand why they're each perfect for their common
uses.  “Any questions?”