|Date:||14 February 2014|
I will admit it: I only thought to pull out Pandas when my Python script was nearly complete, because running
But as I added the
import statement, it suddenly struck me that Pandas aggregate operations might be able to replace some of the steps inside my big
for loop. Half an hour later, my loop had disappeared completely and my entire script was reduced to a short sequence of Pandas method calls!
This post is a brief tour of the final script, written up as an IPython notebook and organized around five basic lessons that I learned about Pandas by applying it to this problem.
The most surprising feature of tools like the NumPy vector library and the Pandas data series might be that they let you bring your own objects to the party.
Since they were invented to efficiently pack millions of floating-point values into a single Python object, you might think that these libraries would only work with numeric types that they themselves can express as vectors. But in fact they have a clean fallback behavior when faced with alternative numeric types: they essentially turn into Python lists, and keep a separate reference to each object that you have supplied. When asked to perform their famous aggregate operations — this is the crucial step — they simply turn to the objects and call the
__mul__() method of every single item to build the result.
When doing taxes and pay stubs we always want to use Python
Decimal objects to guarantee correctness and careful rounding to cents, which we can implement with a simple
from decimal import Decimal one_cent = Decimal('0.01') def c(value): "Convert `value` to Decimal cents." return Decimal(value).quantize(one_cent)
You might think that my next step will be building a twelve-element list with which to initialize the monthly wages in our payroll table. But data frames let us do something simpler: we can supply a single value for a column, and Pandas will automatically broadcast it across the whole index that we have defined. Here we set up the twelve calendar months of the year and imagine the simple case of someone who makes the same wage every month:
import pandas as pd from calendar import month_abbr df = pd.DataFrame(index=month_abbr[1:]) df['wage'] = c('11000.00') print df
wage Jan 11000.00 Feb 11000.00 Mar 11000.00 Apr 11000.00 May 11000.00 Jun 11000.00 Jul 11000.00 Aug 11000.00 Sep 11000.00 Oct 11000.00 Nov 11000.00 Dec 11000.00
An example wage of $11,000 per month might sound grandiose, but I have chosen it so that this example will reach the Social Security wage limit by the end of the year — to test whether we model the limit correctly in our data frame.
The second surprise of working with Pandas is the convenience of its
map() method, which lets us build a new series by passing each element of an existing series through a transform function. Here, for example, we compute the federal withholding that will be due each month for our employee by using the
TaxSchedule object of my Luca accounting library:
from luca.taxes import TaxSchedule federal_monthly = TaxSchedule([ (325, '10'), (1023, '15'), (3163, '25'), (6050, '28'), (9050, '33'), (15906, '35'), (17925, '39.6'), ]) df['fedwh'] = df['wage'].map(federal_monthly.compute_tax_on) print df
wage fedwh Jan 11000.00 2596.05 Feb 11000.00 2596.05 Mar 11000.00 2596.05 Apr 11000.00 2596.05 May 11000.00 2596.05 Jun 11000.00 2596.05 Jul 11000.00 2596.05 Aug 11000.00 2596.05 Sep 11000.00 2596.05 Oct 11000.00 2596.05 Nov 11000.00 2596.05 Dec 11000.00 2596.05
It surprised me that mapping was so pleasant, because the current Python habit is to avoid
map() — which, after all, Guido nearly removed from Python 3’s built-ins — in favor of list comprehensions. Not only are comprehensions far more general because they can pivot from calling a function to computing any other kind of expression for each input item, but they actually look like they are calling a function because its name gets followed by parentheses just as in any other function call:
# List comprehension versus calling map() line1 = [federal_monthly.compute_tax_on(w) for w in df['wage']] line2 = df['wage'].map(federal_monthly.compute_tax_on)
But in the particular case of Pandas a comprehension has disadvantages.
First, it adds syntactic noise. Line 1 is not simply longer than line 2, but visually busier as well. Iteration that ought to be quietly implied inside of a vector becomes an external and noisy
for loop instead. The eye has to travel to check that the
w loop item is indeed the input to the function. And an extra pair of square brackets surrounds the whole affair.
The second disadvantage is small for this particular data set, but still conceptually regrettable: the list comprehension spins up an extra throw-away
list as a temporary holding pen for the values that are really destined to live inside the Pandas data frame.
Finally, you will soon see that chaining Pandas method calls one after another is a pretty common way to take an input series, perform a useful sequence of operations upon it, and be left with a useful result. Method calls to
map() dovetail perfectly inside of this pattern, while list comprehensions make you break out of it.
So my Pandas code not only features my only calls to a
map() routine these days, but tends to do it pretty often — and is simpler and cleaner as a result!
My first exposure to Pandas, and to NumPy for that matter, involved two kinds of operations. First, there were binary operators like
a + b that we normally think of as involving two numbers, but that when applied to whole vectors might do thousands or millions of pairwise operations instead. Second, there were aggregate operators, that take an entire vector and produce its minimum, maximum, sum, or even full statistics about the contents.
But the key to computing payroll taxes is to perform a third kind of operation: a cumulative sum of the wages paid to date so that the result can be compared with the Social Security wage limit, the point at which one’s Social Security obligation for the year is complete:
# For 2014 ss_limit = c('117000.00') ss_rate = Decimal('0.062') mc_rate = Decimal('0.0145')
It turns out that Pandas has a cumulative operator that fits this situation perfectly! The
cumsum() method takes a series as its input and produces a series of cumulative totals instead:
Jan 11000.00 Feb 22000.00 Mar 33000.00 Apr 44000.00 May 55000.00 Jun 66000.00 Jul 77000.00 Aug 88000.00 Sep 99000.00 Oct 110000.00 Nov 121000.00 Dec 132000.00 dtype: object
After computing this cumulative sum, we can apply the Social Security limit using the convenient
clip_upper() method — which took me several minutes to find, because I somehow thought that minimum would have wound up somewhere in its name — to assure that we incur no additional tax obligation once the limit has been reached.
Thanks to how cleanly Pandas methods chain together, this can simply be added as another step beyond the
cum_ss_wage = df['wage'].cumsum().clip_upper(ss_limit) print cum_ss_wage
Jan 11000.00 Feb 22000.00 Mar 33000.00 Apr 44000.00 May 55000.00 Jun 66000.00 Jul 77000.00 Aug 88000.00 Sep 99000.00 Oct 110000.00 Nov 117000.00 Dec 117000.00 dtype: object
Finally, we need to step backwards. Having created a cumulative sum involving the wages earned to date, we need to split out each month’s earnings back into a separate stand-alone value.
Pandas supports this with a
diff() method that is the opposite of
cumsum() except that the first row of the result will be
NaN (the numerics code for “not a number”) because pairwise subtraction between twelve rows of data produces only eleven values. No problem! Not-a-number comes up so often in data operations that Pandas provides a
fillna() method that lets us restore that first value from our original series of monthly wages.
ss_wage = cum_ss_wage.diff().fillna(df['wage']) print ss_wage
Jan 11000.00 Feb 11000.00 Mar 11000.00 Apr 11000.00 May 11000.00 Jun 11000.00 Jul 11000.00 Aug 11000.00 Sep 11000.00 Oct 11000.00 Nov 7000.00 Dec 0.00 dtype: object
We now know exactly how many dollars are susceptible to Social Security tax each month, and are ready to finish out our data frame.
In an early version of this computation, I was faithfully attaching each of the intermediate series that we computed above as new columns to my growing data frame. Between the
fedwh column and the
ss_tax column in the table below, imagine seeing all of my intermediate results — extra columns with names like
ss_wage. To make a reasonable display of the table, I had to cut more than half of its columns out!
It was therefore something of an epiphany to realize that intermediate data series need not be stored in the data frame at all. Look at
ss_wage from the previous section as a perfect example: even though I have kept it outside my data frame as a stand-alone series, it follows all the same rules as data frame columns, participates in aggregate operations, and even gets to use the data frame’s own index.
Stand-alone series keep my data frame clean. I save as columns only the outputs that need to be displayed, like the Social Security and Medicare taxes that get printed on a monthly pay stub:
df['ss_tax'] = (ss_wage * ss_rate).map(c) df['mc_tax'] = (df['wage'] * mc_rate).map(c) print df
wage fedwh ss_tax mc_tax Jan 11000.00 2596.05 682.00 159.50 Feb 11000.00 2596.05 682.00 159.50 Mar 11000.00 2596.05 682.00 159.50 Apr 11000.00 2596.05 682.00 159.50 May 11000.00 2596.05 682.00 159.50 Jun 11000.00 2596.05 682.00 159.50 Jul 11000.00 2596.05 682.00 159.50 Aug 11000.00 2596.05 682.00 159.50 Sep 11000.00 2596.05 682.00 159.50 Oct 11000.00 2596.05 682.00 159.50 Nov 11000.00 2596.05 434.00 159.50 Dec 11000.00 2596.05 0.00 159.50
At this point our data frame is almost complete! We need only compute one last withholding schedule and perform the big subtraction that figures out how much money actually gets sent home with the employee each month, and we will be done:
# Ohio monthly withholding ohio_exemption = c('650.00') / Decimal('12') ohio_monthly = TaxSchedule( [(0, '0.5'), (5000, '1'), (10000, '2'), (15000, '2.5'), (20000, 3), (40000, '3.5'), (80000, '4'), (100000, '5')], ohio_exemption, ) df['statewh'] = df['wage'].map(ohio_monthly.compute_tax_on) df['paycheck'] = (df['wage'] - df['fedwh'] - df['ss_tax'] - df['mc_tax'] - df['statewh']) print df
wage fedwh ss_tax mc_tax statewh paycheck Jan 11000.00 2596.05 682.00 159.50 95.00 7467.45 Feb 11000.00 2596.05 682.00 159.50 95.00 7467.45 Mar 11000.00 2596.05 682.00 159.50 95.00 7467.45 Apr 11000.00 2596.05 682.00 159.50 95.00 7467.45 May 11000.00 2596.05 682.00 159.50 95.00 7467.45 Jun 11000.00 2596.05 682.00 159.50 95.00 7467.45 Jul 11000.00 2596.05 682.00 159.50 95.00 7467.45 Aug 11000.00 2596.05 682.00 159.50 95.00 7467.45 Sep 11000.00 2596.05 682.00 159.50 95.00 7467.45 Oct 11000.00 2596.05 682.00 159.50 95.00 7467.45 Nov 11000.00 2596.05 434.00 159.50 95.00 7715.45 Dec 11000.00 2596.05 0.00 159.50 95.00 8149.45
Note again that all of this math has been performed by real first-class
Decimal objects doing the same rounding that is implemented by the Internal Revenue Service. This is a grid of true decimal objects — organized and displayed by Pandas, but to which Pandas is always deferring the actual math.
Finally, my Pandas data frame not only did a great job of formatting itself for terminal output — exempting me from the fussy job of measuring column widths, justifying fields, and adding whitespace — but it could effortlessly slice and aggrate the table as well. Here, for instance, is an individual pay stub for September:
print 'September payroll stub:' print df.ix['Sep']
September payroll stub: wage 11000.00 fedwh 2596.05 ss_tax 682.00 mc_tax 159.50 statewh 95.00 paycheck 7467.45 Name: Sep, dtype: object
And generating a W-2 for delivery to the employee at year’s end is as simple as applying Python’s
sum routine to each column — the data frame does have a
sum() method, but it seems to insist on operating in floating point — and letting Pandas display the result:
print 'End-of-year W-2' print df.apply(sum)
End-of-year W-2 wage 132000.00 fedwh 31152.60 ss_tax 7254.00 mc_tax 1914.00 statewh 1140.00 paycheck 90539.40 dtype: object
When I first saw Pandas touted, I planned to use it the next time I had a big problem — like a million-line Apache log file. But it turns out to do just as well with tiny problems. Unlike heavyweight solutions that are not worth trotting out and setting up unless one has a truly big problem to deploy them against, Pandas scaled right down to my twelve-line paystub table and made my code simpler and easier to maintain.
Even on a tiny data set it is more than worth it to import
pandas, set up your index, and get to work!