by Brandon Rhodes • Home

Learning Pandas through payroll taxes and paystubs

Date: 14 February 2014
Tags:python

I will admit it: I only thought to pull out Pandas when my Python script was nearly complete, because running print on a Pandas data frame would save me the trouble of formatting 12 rows of data by hand.

But as I added the import statement, it suddenly struck me that Pandas aggregate operations might be able to replace some of the steps inside my big for loop. Half an hour later, my loop had disappeared completely and my entire script was reduced to a short sequence of Pandas method calls!

This post is a brief tour of the final script, written up as an IPython notebook and organized around five basic lessons that I learned about Pandas by applying it to this problem.

1. Pandas lets you bring your own objects

The most surprising feature of tools like the NumPy vector library and the Pandas data series might be that they let you bring your own objects to the party.

Since they were invented to efficiently pack millions of floating-point values into a single Python object, you might think that these libraries would only work with numeric types that they themselves can express as vectors. But in fact they have a clean fallback behavior when faced with alternative numeric types: they essentially turn into Python lists, and keep a separate reference to each object that you have supplied. When asked to perform their famous aggregate operations — this is the crucial step — they simply turn to the objects and call the __add__() or __mul__() method of every single item to build the result.

When doing taxes and pay stubs we always want to use Python Decimal objects to guarantee correctness and careful rounding to cents, which we can implement with a simple c() function:

from decimal import Decimal

one_cent = Decimal('0.01')

def c(value):
    "Convert `value` to Decimal cents."
    return Decimal(value).quantize(one_cent)

You might think that my next step will be building a twelve-element list with which to initialize the monthly wages in our payroll table. But data frames let us do something simpler: we can supply a single value for a column, and Pandas will automatically broadcast it across the whole index that we have defined. Here we set up the twelve calendar months of the year and imagine the simple case of someone who makes the same wage every month:

import pandas as pd
from calendar import month_abbr

df = pd.DataFrame(index=month_abbr[1:])
df['wage'] = c('11000.00')
print df
         wage
Jan  11000.00
Feb  11000.00
Mar  11000.00
Apr  11000.00
May  11000.00
Jun  11000.00
Jul  11000.00
Aug  11000.00
Sep  11000.00
Oct  11000.00
Nov  11000.00
Dec  11000.00

An example wage of $11,000 per month might sound grandiose, but I have chosen it so that this example will reach the Social Security wage limit by the end of the year — to test whether we model the limit correctly in our data frame.

2. Mapping is suddenly worth it

The second surprise of working with Pandas is the convenience of its map() method, which lets us build a new series by passing each element of an existing series through a transform function. Here, for example, we compute the federal withholding that will be due each month for our employee by using the TaxSchedule object of my Luca accounting library:

from luca.taxes import TaxSchedule
federal_monthly = TaxSchedule([
    (325, '10'), (1023, '15'), (3163, '25'), (6050, '28'),
    (9050, '33'), (15906, '35'), (17925, '39.6'),
    ])

df['fedwh'] = df['wage'].map(federal_monthly.compute_tax_on)
print df
         wage    fedwh
Jan  11000.00  2596.05
Feb  11000.00  2596.05
Mar  11000.00  2596.05
Apr  11000.00  2596.05
May  11000.00  2596.05
Jun  11000.00  2596.05
Jul  11000.00  2596.05
Aug  11000.00  2596.05
Sep  11000.00  2596.05
Oct  11000.00  2596.05
Nov  11000.00  2596.05
Dec  11000.00  2596.05

It surprised me that mapping was so pleasant, because the current Python habit is to avoid map() — which, after all, Guido nearly removed from Python 3’s built-ins — in favor of list comprehensions. Not only are comprehensions far more general because they can pivot from calling a function to computing any other kind of expression for each input item, but they actually look like they are calling a function because its name gets followed by parentheses just as in any other function call:

# List comprehension versus calling map()

line1 = [federal_monthly.compute_tax_on(w) for w in df['wage']]
line2 = df['wage'].map(federal_monthly.compute_tax_on)

But in the particular case of Pandas a comprehension has disadvantages.

First, it adds syntactic noise. Line 1 is not simply longer than line 2, but visually busier as well. Iteration that ought to be quietly implied inside of a vector becomes an external and noisy for loop instead. The eye has to travel to check that the w loop item is indeed the input to the function. And an extra pair of square brackets surrounds the whole affair.

The second disadvantage is small for this particular data set, but still conceptually regrettable: the list comprehension spins up an extra throw-away list as a temporary holding pen for the values that are really destined to live inside the Pandas data frame.

Finally, you will soon see that chaining Pandas method calls one after another is a pretty common way to take an input series, perform a useful sequence of operations upon it, and be left with a useful result. Method calls to map() dovetail perfectly inside of this pattern, while list comprehensions make you break out of it.

So my Pandas code not only features my only calls to a map() routine these days, but tends to do it pretty often — and is simpler and cleaner as a result!

3. Pandas operations can go down, not just across

My first exposure to Pandas, and to NumPy for that matter, involved two kinds of operations. First, there were binary operators like a + b that we normally think of as involving two numbers, but that when applied to whole vectors might do thousands or millions of pairwise operations instead. Second, there were aggregate operators, that take an entire vector and produce its minimum, maximum, sum, or even full statistics about the contents.

But the key to computing payroll taxes is to perform a third kind of operation: a cumulative sum of the wages paid to date so that the result can be compared with the Social Security wage limit, the point at which one’s Social Security obligation for the year is complete:

# For 2014
ss_limit = c('117000.00')
ss_rate = Decimal('0.062')
mc_rate = Decimal('0.0145')

It turns out that Pandas has a cumulative operator that fits this situation perfectly! The cumsum() method takes a series as its input and produces a series of cumulative totals instead:

print df['wage'].cumsum()
Jan     11000.00
Feb     22000.00
Mar     33000.00
Apr     44000.00
May     55000.00
Jun     66000.00
Jul     77000.00
Aug     88000.00
Sep     99000.00
Oct    110000.00
Nov    121000.00
Dec    132000.00
dtype: object

After computing this cumulative sum, we can apply the Social Security limit using the convenient clip_upper() method — which took me several minutes to find, because I somehow thought that minimum would have wound up somewhere in its name — to assure that we incur no additional tax obligation once the limit has been reached.

Thanks to how cleanly Pandas methods chain together, this can simply be added as another step beyond the cumsum() call:

cum_ss_wage = df['wage'].cumsum().clip_upper(ss_limit)
print cum_ss_wage
Jan     11000.00
Feb     22000.00
Mar     33000.00
Apr     44000.00
May     55000.00
Jun     66000.00
Jul     77000.00
Aug     88000.00
Sep     99000.00
Oct    110000.00
Nov    117000.00
Dec    117000.00
dtype: object

Finally, we need to step backwards. Having created a cumulative sum involving the wages earned to date, we need to split out each month’s earnings back into a separate stand-alone value.

Pandas supports this with a diff() method that is the opposite of cumsum() except that the first row of the result will be NaN (the numerics code for “not a number”) because pairwise subtraction between twelve rows of data produces only eleven values. No problem! Not-a-number comes up so often in data operations that Pandas provides a fillna() method that lets us restore that first value from our original series of monthly wages.

ss_wage = cum_ss_wage.diff().fillna(df['wage'])
print ss_wage
Jan    11000.00
Feb    11000.00
Mar    11000.00
Apr    11000.00
May    11000.00
Jun    11000.00
Jul    11000.00
Aug    11000.00
Sep    11000.00
Oct    11000.00
Nov     7000.00
Dec        0.00
dtype: object

We now know exactly how many dollars are susceptible to Social Security tax each month, and are ready to finish out our data frame.

4. Throwaway series need not live in your data frame

In an early version of this computation, I was faithfully attaching each of the intermediate series that we computed above as new columns to my growing data frame. Between the fedwh column and the ss_tax column in the table below, imagine seeing all of my intermediate results — extra columns with names like cumwage, cumw_clipped, clipped_diff, and ss_wage. To make a reasonable display of the table, I had to cut more than half of its columns out!

It was therefore something of an epiphany to realize that intermediate data series need not be stored in the data frame at all. Look at ss_wage from the previous section as a perfect example: even though I have kept it outside my data frame as a stand-alone series, it follows all the same rules as data frame columns, participates in aggregate operations, and even gets to use the data frame’s own index.

Stand-alone series keep my data frame clean. I save as columns only the outputs that need to be displayed, like the Social Security and Medicare taxes that get printed on a monthly pay stub:

df['ss_tax'] = (ss_wage * ss_rate).map(c)
df['mc_tax'] = (df['wage'] * mc_rate).map(c)
print df
         wage    fedwh  ss_tax  mc_tax
Jan  11000.00  2596.05  682.00  159.50
Feb  11000.00  2596.05  682.00  159.50
Mar  11000.00  2596.05  682.00  159.50
Apr  11000.00  2596.05  682.00  159.50
May  11000.00  2596.05  682.00  159.50
Jun  11000.00  2596.05  682.00  159.50
Jul  11000.00  2596.05  682.00  159.50
Aug  11000.00  2596.05  682.00  159.50
Sep  11000.00  2596.05  682.00  159.50
Oct  11000.00  2596.05  682.00  159.50
Nov  11000.00  2596.05  434.00  159.50
Dec  11000.00  2596.05    0.00  159.50

At this point our data frame is almost complete! We need only compute one last withholding schedule and perform the big subtraction that figures out how much money actually gets sent home with the employee each month, and we will be done:

# Ohio monthly withholding

ohio_exemption = c('650.00') / Decimal('12')
ohio_monthly = TaxSchedule(
    [(0, '0.5'), (5000, '1'), (10000, '2'), (15000, '2.5'),
     (20000, 3), (40000, '3.5'), (80000, '4'), (100000, '5')],
    ohio_exemption,
    )

df['statewh'] = df['wage'].map(ohio_monthly.compute_tax_on)

df['paycheck'] = (df['wage'] - df['fedwh'] - df['ss_tax']
                  - df['mc_tax'] - df['statewh'])

print df
         wage    fedwh  ss_tax  mc_tax statewh paycheck
Jan  11000.00  2596.05  682.00  159.50   95.00  7467.45
Feb  11000.00  2596.05  682.00  159.50   95.00  7467.45
Mar  11000.00  2596.05  682.00  159.50   95.00  7467.45
Apr  11000.00  2596.05  682.00  159.50   95.00  7467.45
May  11000.00  2596.05  682.00  159.50   95.00  7467.45
Jun  11000.00  2596.05  682.00  159.50   95.00  7467.45
Jul  11000.00  2596.05  682.00  159.50   95.00  7467.45
Aug  11000.00  2596.05  682.00  159.50   95.00  7467.45
Sep  11000.00  2596.05  682.00  159.50   95.00  7467.45
Oct  11000.00  2596.05  682.00  159.50   95.00  7467.45
Nov  11000.00  2596.05  434.00  159.50   95.00  7715.45
Dec  11000.00  2596.05    0.00  159.50   95.00  8149.45

Note again that all of this math has been performed by real first-class Decimal objects doing the same rounding that is implemented by the Internal Revenue Service. This is a grid of true decimal objects — organized and displayed by Pandas, but to which Pandas is always deferring the actual math.

5. Reporting becomes easy

Finally, my Pandas data frame not only did a great job of formatting itself for terminal output — exempting me from the fussy job of measuring column widths, justifying fields, and adding whitespace — but it could effortlessly slice and aggrate the table as well. Here, for instance, is an individual pay stub for September:

print 'September payroll stub:'
print df.ix['Sep']
September payroll stub:
wage        11000.00
fedwh        2596.05
ss_tax        682.00
mc_tax        159.50
statewh        95.00
paycheck     7467.45
Name: Sep, dtype: object

And generating a W-2 for delivery to the employee at year’s end is as simple as applying Python’s sum routine to each column — the data frame does have a sum() method, but it seems to insist on operating in floating point — and letting Pandas display the result:

print 'End-of-year W-2'
print df.apply(sum)
End-of-year W-2
wage        132000.00
fedwh        31152.60
ss_tax        7254.00
mc_tax        1914.00
statewh       1140.00
paycheck     90539.40
dtype: object

When I first saw Pandas touted, I planned to use it the next time I had a big problem — like a million-line Apache log file. But it turns out to do just as well with tiny problems. Unlike heavyweight solutions that are not worth trotting out and setting up unless one has a truly big problem to deploy them against, Pandas scaled right down to my twelve-line paystub table and made my code simpler and easier to maintain.

Even on a tiny data set it is more than worth it to import pandas, set up your index, and get to work!

©2014