Date: | 14 February 2014 |
---|---|
Tags: | python |
I will admit it:
I only thought to pull out Pandas
when my Python script was nearly complete,
because running print
on a Pandas data frame
would save me the trouble of formatting 12 rows of data by hand.
But as I added the import
statement,
it suddenly struck me
that Pandas aggregate operations
might be able to replace some of the steps inside my big for
loop.
Half an hour later,
my loop had disappeared completely
and my entire script was reduced
to a short sequence of Pandas method calls!
This post is a brief tour of the final script, written up as an IPython notebook and organized around five basic lessons that I learned about Pandas by applying it to this problem.
The most surprising feature of tools like the NumPy vector library and the Pandas data series might be that they let you bring your own objects to the party.
Since they were invented to efficiently pack
millions of floating-point values into a single Python object,
you might think that these libraries would only work
with numeric types that they themselves can express as vectors.
But in fact they have a clean fallback behavior
when faced with alternative numeric types:
they essentially turn into Python lists,
and keep a separate reference to each object that you have supplied.
When asked to perform their famous aggregate operations —
this is the crucial step —
they simply turn to the objects
and call the __add__()
or __mul__()
method of every single item
to build the result.
When doing taxes and pay stubs
we always want to use Python Decimal
objects
to guarantee correctness and careful rounding to cents,
which we can implement with a simple c()
function:
from decimal import Decimal
one_cent = Decimal('0.01')
def c(value):
"Convert `value` to Decimal cents."
return Decimal(value).quantize(one_cent)
You might think that my next step will be building a twelve-element list with which to initialize the monthly wages in our payroll table. But data frames let us do something simpler: we can supply a single value for a column, and Pandas will automatically broadcast it across the whole index that we have defined. Here we set up the twelve calendar months of the year and imagine the simple case of someone who makes the same wage every month:
import pandas as pd
from calendar import month_abbr
df = pd.DataFrame(index=month_abbr[1:])
df['wage'] = c('11000.00')
print df
wage Jan 11000.00 Feb 11000.00 Mar 11000.00 Apr 11000.00 May 11000.00 Jun 11000.00 Jul 11000.00 Aug 11000.00 Sep 11000.00 Oct 11000.00 Nov 11000.00 Dec 11000.00
An example wage of $11,000 per month might sound grandiose, but I have chosen it so that this example will reach the Social Security wage limit by the end of the year — to test whether we model the limit correctly in our data frame.
The second surprise of working with Pandas
is the convenience of its map()
method,
which lets us build a new series
by passing each element of an existing series
through a transform function.
Here, for example,
we compute the federal withholding that will be due
each month for our employee
by using the TaxSchedule
object of my
Luca accounting library:
from luca.taxes import TaxSchedule
federal_monthly = TaxSchedule([
(325, '10'), (1023, '15'), (3163, '25'), (6050, '28'),
(9050, '33'), (15906, '35'), (17925, '39.6'),
])
df['fedwh'] = df['wage'].map(federal_monthly.compute_tax_on)
print df
wage fedwh Jan 11000.00 2596.05 Feb 11000.00 2596.05 Mar 11000.00 2596.05 Apr 11000.00 2596.05 May 11000.00 2596.05 Jun 11000.00 2596.05 Jul 11000.00 2596.05 Aug 11000.00 2596.05 Sep 11000.00 2596.05 Oct 11000.00 2596.05 Nov 11000.00 2596.05 Dec 11000.00 2596.05
It surprised me that mapping was so pleasant,
because the current Python habit is to avoid map()
—
which, after all, Guido nearly removed from Python 3’s built-ins —
in favor of list comprehensions.
Not only are comprehensions far more general
because they can pivot from calling a function
to computing any other kind of expression for each input item,
but they actually look like they are calling a function
because its name gets followed by parentheses
just as in any other function call:
# List comprehension versus calling map()
line1 = [federal_monthly.compute_tax_on(w) for w in df['wage']]
line2 = df['wage'].map(federal_monthly.compute_tax_on)
But in the particular case of Pandas a comprehension has disadvantages.
First, it adds syntactic noise.
Line 1 is not simply longer than line 2,
but visually busier as well.
Iteration that ought to be quietly implied inside of a vector
becomes an external and noisy for
loop instead.
The eye has to travel to check that the w
loop item
is indeed the input to the function.
And an extra pair of square brackets surrounds the whole affair.
The second disadvantage is small for this particular data set,
but still conceptually regrettable:
the list comprehension spins up an extra throw-away list
as a temporary holding pen
for the values that are really destined to live
inside the Pandas data frame.
Finally, you will soon see
that chaining Pandas method calls one after another
is a pretty common way to take an input series,
perform a useful sequence of operations upon it,
and be left with a useful result.
Method calls to map()
dovetail perfectly inside of this pattern,
while list comprehensions make you break out of it.
So my Pandas code
not only features my only calls to a map()
routine these days,
but tends to do it pretty often —
and is simpler and cleaner as a result!
My first exposure to Pandas, and to NumPy for that matter,
involved two kinds of operations.
First, there were binary operators like a + b
that we normally think of as involving two numbers,
but that when applied to whole vectors
might do thousands or millions of pairwise operations instead.
Second, there were aggregate operators,
that take an entire vector and produce its minimum, maximum, sum,
or even full statistics about the contents.
But the key to computing payroll taxes is to perform a third kind of operation: a cumulative sum of the wages paid to date so that the result can be compared with the Social Security wage limit, the point at which one’s Social Security obligation for the year is complete:
# For 2014
ss_limit = c('117000.00')
ss_rate = Decimal('0.062')
mc_rate = Decimal('0.0145')
It turns out that Pandas has a cumulative operator
that fits this situation perfectly!
The cumsum()
method takes a series as its input
and produces a series of cumulative totals instead:
print df['wage'].cumsum()
Jan 11000.00 Feb 22000.00 Mar 33000.00 Apr 44000.00 May 55000.00 Jun 66000.00 Jul 77000.00 Aug 88000.00 Sep 99000.00 Oct 110000.00 Nov 121000.00 Dec 132000.00 dtype: object
After computing this cumulative sum,
we can apply the Social Security limit
using the convenient clip_upper()
method —
which took me several minutes to find,
because I somehow thought that minimum
would have wound up somewhere in its name —
to assure that we incur no additional tax obligation
once the limit has been reached.
Thanks to how cleanly Pandas methods chain together,
this can simply be added as another step
beyond the cumsum()
call:
cum_ss_wage = df['wage'].cumsum().clip_upper(ss_limit)
print cum_ss_wage
Jan 11000.00 Feb 22000.00 Mar 33000.00 Apr 44000.00 May 55000.00 Jun 66000.00 Jul 77000.00 Aug 88000.00 Sep 99000.00 Oct 110000.00 Nov 117000.00 Dec 117000.00 dtype: object
Finally, we need to step backwards. Having created a cumulative sum involving the wages earned to date, we need to split out each month’s earnings back into a separate stand-alone value.
Pandas supports this with a diff()
method
that is the opposite of cumsum()
except that the first row of the result will be NaN
(the numerics code for “not a number”)
because pairwise subtraction between twelve rows of data
produces only eleven values.
No problem!
Not-a-number comes up so often in data operations
that Pandas provides a fillna()
method
that lets us restore that first value
from our original series of monthly wages.
ss_wage = cum_ss_wage.diff().fillna(df['wage'])
print ss_wage
Jan 11000.00 Feb 11000.00 Mar 11000.00 Apr 11000.00 May 11000.00 Jun 11000.00 Jul 11000.00 Aug 11000.00 Sep 11000.00 Oct 11000.00 Nov 7000.00 Dec 0.00 dtype: object
We now know exactly how many dollars are susceptible to Social Security tax each month, and are ready to finish out our data frame.
In an early version of this computation,
I was faithfully attaching each of the intermediate series
that we computed above as new columns to my growing data frame.
Between the fedwh
column and the ss_tax
column in the table below,
imagine seeing all of my intermediate results —
extra columns with names like cumwage
, cumw_clipped
,
clipped_diff
, and ss_wage
.
To make a reasonable display of the table,
I had to cut more than half of its columns out!
It was therefore something of an epiphany
to realize that intermediate data series
need not be stored in the data frame at all.
Look at ss_wage
from the previous section as a perfect example:
even though I have kept it outside my data frame as a stand-alone series,
it follows all the same rules as data frame columns,
participates in aggregate operations,
and even gets to use the data frame’s own index.
Stand-alone series keep my data frame clean. I save as columns only the outputs that need to be displayed, like the Social Security and Medicare taxes that get printed on a monthly pay stub:
df['ss_tax'] = (ss_wage * ss_rate).map(c)
df['mc_tax'] = (df['wage'] * mc_rate).map(c)
print df
wage fedwh ss_tax mc_tax Jan 11000.00 2596.05 682.00 159.50 Feb 11000.00 2596.05 682.00 159.50 Mar 11000.00 2596.05 682.00 159.50 Apr 11000.00 2596.05 682.00 159.50 May 11000.00 2596.05 682.00 159.50 Jun 11000.00 2596.05 682.00 159.50 Jul 11000.00 2596.05 682.00 159.50 Aug 11000.00 2596.05 682.00 159.50 Sep 11000.00 2596.05 682.00 159.50 Oct 11000.00 2596.05 682.00 159.50 Nov 11000.00 2596.05 434.00 159.50 Dec 11000.00 2596.05 0.00 159.50
At this point our data frame is almost complete! We need only compute one last withholding schedule and perform the big subtraction that figures out how much money actually gets sent home with the employee each month, and we will be done:
# Ohio monthly withholding
ohio_exemption = c('650.00') / Decimal('12')
ohio_monthly = TaxSchedule(
[(0, '0.5'), (5000, '1'), (10000, '2'), (15000, '2.5'),
(20000, 3), (40000, '3.5'), (80000, '4'), (100000, '5')],
ohio_exemption,
)
df['statewh'] = df['wage'].map(ohio_monthly.compute_tax_on)
df['paycheck'] = (df['wage'] - df['fedwh'] - df['ss_tax']
- df['mc_tax'] - df['statewh'])
print df
wage fedwh ss_tax mc_tax statewh paycheck Jan 11000.00 2596.05 682.00 159.50 95.00 7467.45 Feb 11000.00 2596.05 682.00 159.50 95.00 7467.45 Mar 11000.00 2596.05 682.00 159.50 95.00 7467.45 Apr 11000.00 2596.05 682.00 159.50 95.00 7467.45 May 11000.00 2596.05 682.00 159.50 95.00 7467.45 Jun 11000.00 2596.05 682.00 159.50 95.00 7467.45 Jul 11000.00 2596.05 682.00 159.50 95.00 7467.45 Aug 11000.00 2596.05 682.00 159.50 95.00 7467.45 Sep 11000.00 2596.05 682.00 159.50 95.00 7467.45 Oct 11000.00 2596.05 682.00 159.50 95.00 7467.45 Nov 11000.00 2596.05 434.00 159.50 95.00 7715.45 Dec 11000.00 2596.05 0.00 159.50 95.00 8149.45
Note again that all of this math
has been performed by real first-class Decimal
objects
doing the same rounding that is implemented by the Internal Revenue Service.
This is a grid of true decimal objects —
organized and displayed by Pandas,
but to which Pandas is always deferring the actual math.
Finally, my Pandas data frame not only did a great job of formatting itself for terminal output — exempting me from the fussy job of measuring column widths, justifying fields, and adding whitespace — but it could effortlessly slice and aggrate the table as well. Here, for instance, is an individual pay stub for September:
print 'September payroll stub:'
print df.ix['Sep']
September payroll stub: wage 11000.00 fedwh 2596.05 ss_tax 682.00 mc_tax 159.50 statewh 95.00 paycheck 7467.45 Name: Sep, dtype: object
And generating a W-2 for delivery to the employee
at year’s end is as simple as applying Python’s sum
routine to each column
— the data frame does have a sum()
method,
but it seems to insist on operating in floating point —
and letting Pandas display the result:
print 'End-of-year W-2'
print df.apply(sum)
End-of-year W-2 wage 132000.00 fedwh 31152.60 ss_tax 7254.00 mc_tax 1914.00 statewh 1140.00 paycheck 90539.40 dtype: object
When I first saw Pandas touted, I planned to use it the next time I had a big problem — like a million-line Apache log file. But it turns out to do just as well with tiny problems. Unlike heavyweight solutions that are not worth trotting out and setting up unless one has a truly big problem to deploy them against, Pandas scaled right down to my twelve-line paystub table and made my code simpler and easier to maintain.
Even on a tiny data set it is more than worth it to import pandas
,
set up your index, and get to work!