=================== PyCon Talk Proposal =================== :Title: The Mighty Dictionary :Duration: 30 min :Level: Beginner :Categories: core Summary ======= Both newcomers and experienced developers alike love Python's built-in data types — especially dictionaries! But how do dictionaries work? What do they do better than other container types, and where, on the other hand, are their weaknesses? Using simple, vivid diagrams that show the secrets of how the dictionary is implemented, and a series of progressively interesting examples of its behavior, we will train the Python developer's mind to picture what the dictionary is doing in just enough detail to make good decisions, as your data sets get larger, about when to use dictionaries and when other data structures might be more appropriate. Description =========== With some judicious use of ``ctypes``, one can write a Python routine that dissects a Python dictionary or set and displays its internals. Using such a tool — which I will also release on PyPI, to accompany the presentation slides — my presentation will show how dictionaries behave as you add items, overwrite them later, remove them, and iterate across the whole dictionary. By contemplating these normally hidden mechanics, and by showing some judicious results from the ``timeit`` module, both newcomers and experienced developers can gain new insight into the trade-offs that dictionaries provide between space and computational complexity, compared to the other alternatives in Python. They will also understand why Python provides a ``hash()`` function; why user-defined classes are given the freedom to define their own hash function as well; and, what happens if they choose not to. The talk will actually discuss sets for most of its length, since they are simpler to diagram and understand, then show, at the end, how a dictionary is just a set with a second column, that holds a reference to an object stored at that key value. The talk will go something like this — each of the following 5 items, I'm imagining, will take up about five minutes (and probably five to ten slides) of my presentation, adding together to 25 minutes (leaving 5 minutes left over for questions): 1. Computer memory is like a Python list ---------------------------------------- Computer memory is indexed by integers, like Python lists (though the indexes tend to be much bigger!). So a Python list is simple: it's an array of numbers, each indicating where in memory a Python object is stored, and Python can jump directly to list item *n*, but has to iterate across the whole list to find whether a particular item is in the list. An ordered list would let you find items more quickly, by jumping in halfway through, and then restricting your search to one half of the remaining list, just like looking for a name in a telephone book. But, the cost would still grow as the list grew longer. And, lists would be expensive to keep ordered! So, let's think about another plan. In a normal list, items wind up at all sorts of indexes. What if we created a list, and magically knew ahead of time exactly where each Python object belonged? Then we could jump right to a given item, immediately, every time! 2. The idea of a hash table --------------------------- We would need a function, called a *hash function*, that when given a certain value — like the number 42, or the string ``"Ni"`` — always returns the same index. Python provides this with a built-in called ``hash()`` for which each built-in type provides an implementation. As an example, we will examine an empty ``set()`` — “look, it starts with space for eight items, even when it's empty!” — then we run ``hash()`` on three simple Python values; then we insert them into the set, and see them land right at the indexes where ``hash()`` told them to. We now see the trade-off a hash table makes: in return for holding open several empty slots, and thus spending *memory*, it can find an item (or discover that it's absent) after only incurring the *static* cost of computing a hash. With a few ``timeit`` tests, we determine how costly it is to compute a hash compared to two simple list operations: jumping directly to list item *n*, versus iterating across a small or large list to find an item. Very small lists are very fast, but quickly become more expensive than computing the hash value to look in a set or dictionary. 3. When indexes collide ----------------------- The ``hash()`` function has to return a limited range of values for an unlimited range of inputs, so many objects *collide*. I will create a collision in the set shown in the slides, and show how the hash table shunts aside the second item and puts it in a second spot that it can find it quickly again when we ask. Removing the collision can still leave the other object stranded where it was put, so the cost of a collision can linger. When I add a fifth item to the set, it suddenly becomes 32 items long! This is to prevent collisions from piling up too deep; both the size of the hash table, *and* some of its behavior, are thus driven by the need to handle collisions. I will note that, when a set or dictionary is re-sized, all of its contents are re-inserted, so that any junk left over gets periodically erased as long as the dictionary is occasionally growing or shrinking. I will show how the cost of re-allocating the whole hash table is reasonable if spread across many hundreds of set inserts, and also quickly show an animation of the dictionary growing, then shrinking as items are removed. By showing some animations of how a dictionary "looks" as it grows and gets used, using some real-world data from observing a dictionary in one of my own applications, I will give a feel for how they behave in the wild. 4. Providing your own hash function ----------------------------------- The critical idea of giving one of your own classes its own ``__hash__()`` method is whether each member of your class represents a *value* that could later — or even simultaneously — be represented by a different instance of your class. I will show how we can easily create two floats with different ``id()`` but the same value (such that they satisfy Python equality), and show how they both go into the same dictionary slot because they have the same hash value. With a few examples and simple illustrations, I will show how simply calling ``hash()`` on the instance variables that give your class instance its own unique value, and combining their values together, you can create a decent hash for your own class. What if your class has no hash routine? I will show how objects are then tested for uniqueness, rather than value, and how deleting and re-creating the "same" object gives it a different hash-table slot. 5. The dictionary, and its alternatives --------------------------------------- First, I finally show the dictionary in all of its glory: like a set, it keeps items in a hash table; but it adds a second column that for each key provides a "value". I will then show an animation of iteration across a dictionary: why the objects come out in random order, and why it's dangerous to modify the dictionary during iteration. Finally, I will briefly discuss alternatives. If you want items back out in order, rather than having random access, use a ``heapq``. If you only add and remove objects from the ends of a series, use a ``deque``. If you need both key-value referencing *and* ordering, then (for today) you might just use ``sorted()`` on your keys each time, or (in the future) use an ``OrderedDict``. But, for most uses, the List and Dictionary are king, and the audience will now hopefully understand why they're each perfect for their common uses. “Any questions?”