The most successful joke among statisticians (a low bar, I know) is that we spend 80 percent of our time cleaning data, and the other 20 percent complaining about cleaning data.The joke resonates with analysts because it rings true. We do spend a nontrivial amount of time going through the busy work of data munging, when we’d much rather be actually analyzing the data. When doing data analysis, you’ll often have assumptions that should be true but might not be.

For example, every person has one (and only one) email address for your service. Getting these assumptions wrong can potentially bias your analysis, so it’s good to check.

However, the process of writing and checking those assumptions is tedious.

Read how to make D&A Part of Your DNA!

Enter Engarde

To streamline the process of reading data, checking assumptions, applying transformations, and back to checking assumptions, I’ve written a simple Python library called Engarde, which builds on top of pandas.

For this post, we’ll work with a data set of customer preferences on trains, which I’ve saved to data/trains.csv.

We can start by making some very basic assertions – that the dataset is the correct shape and that a few columns are the correct data types. Assertions are made as decorators to functions that return a DataFrame.

In [1]: import pandas as pd

In [2]: import engarde.decorators as ed

In [3]: pd.set_option('display.max_rows', 10)

In [4]: dtypes = dict(
   ...:     price1=int,
   ...:     price2=int,
   ...:     time1=int,
   ...:     time2=int,
   ...:     change1=int,
   ...:     change2=int,
   ...: )

In [5]: @ed.is_shape((None, 11))
   ...: @ed.has_dtypes(items=dtypes)
   ...: def unload():
   ...:     trains = pd.read_csv("data/trains.csv", index_col=0)
   ...:     return trains

In [6]: unload()
Out[6]:
     id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  change2  comfort2
1     1         1  choice1    2400    150        0         1    4000    150        0         1
2     1         2  choice1    2400    150        0         1    3200    130        0         1
3     1         3  choice1    2400    115        0         1    4000    115        0         0
4     1         4  choice2    4000    130        0         1    3200    150        0         0
5     1         5  choice2    2400    150        0         1    3200    150        0         0
..   ..       ...      ...     ...    ...      ...       ...     ...    ...      ...       ...
347  30         7  choice1    2100    135        1         1    2800    135        1         0
348  30         8  choice1    2100    125        1         1    3500    125        1         0
349  30         9  choice1    2100    150        0         0    2800    125        0         1
350  30        10  choice1    2800    125        0         1    2800    135        1         0
351  30        11  choice2    3500    125        1         0    2800    135        1         0

[351 rows x 11 columns]

One very important part of the design of Engarde is that your code, the code actually doing the work, shouldn’t have to change. I don’t want a bunch of asserts cluttering up the logic of what’s happening. This is a perfect case for decorators.

The order of execution here is unload returns the DataFrametrains. Next, ed.has_dtypes asserts that trains has the correct dtypes as specified with dtypes. Once that assert passes, has_dtypes passes trains along to the next check, and so on, until the original caller gets back trains.

Each row of this dataset contains a passenger’s preference over two routes. Each route has an associated cost, travel time, comfort level, and number of changes.

Like any good economist, we’ll assume people are rational. Their first choice is surely going to be better in at least one way than their second choice (faster, more comfortable, etc).

This is fundamental to our analysis later on, so we’ll explicitly state it in our code, and check it in our data analysis.

We write our custom assumption as a function rational that takes a DataFrame and returns one or more boolean (true / false) values.

In [7]: def rational(df):
   ...:     """
   ...:     Check that at least one criteria is better.
   ...:     """
   ...:     r = ((df.price1 < df.price2) | (df.time1 < df.time2) |
   ...:          (df.change1 < df.change2) | (df.comfort1 > df.comfort2))
   ...:     return r
   ...:

In [8]: @ed.is_shape((None, 11))
   ...: @ed.has_dtypes(items=dtypes)
   ...: @ed.verify_all(rational)
   ...: def unload():
   ...:     trains = pd.read_csv("data/trains.csv", index_col=0)
   ...:     return trains
   ...:

In [9]: df = unload()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-9-b108f050ce4e> in <module>()
----> 1 df = unload()

...

AssertionError: ('rational not true for all',

id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  \
13    2         3  choice2    2450    121        0         0    2450     93
18    2         8  choice2    2975    108        0         0    2450    108
27    3         6  choice2    1920    106        0         0    1440     96
28    3         7  choice1    1920    106        0         0    1920     96
33    4         1  choice2     545    105        1         1     545     85
..   ..       ...      ...     ...    ...      ...       ...     ...    ...
306  27         9  choice1    3920    140        1         1    3920    125
319  28         8  choice2    2450    133        1         1    2450    108
325  28        14  choice2    2450    123        0         1    2450    108

[42 rows x 11 columns])

So, our check failed. Apparently people aren’t rational.

Engarde has printed the name of the failed assertion, and the rows that violated our assumption. We’ll “fix” this problem by ignoring those people. Although, in reality, we’d dig into why those people are different.

In [16]: @ed.verify_all(rational)
   ....: def drop_nonrational_people(df):
   ....:     r = df.query("price1 < price2 | time1 < time2 |"
   ....:                  "change1 < change2 | comfort1 > comfort2")
   ....:     return r
   ....:

In [17]: @ed.is_shape((None, 11))
   ....: @ed.has_dtypes(items=dtypes)
   ....: def unload():
   ....:     trains = pd.read_csv("data/trains.csv", index_col=0)
   ....:     return trains

In [18]: df = unload().pipe(drop_nonrational_people)

In [19]: df.head()
Out[19]:
   id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  change2  comfort2
1   1         1  choice1    2400    150        0         1    4000    150        0         1
2   1         2  choice1    2400    150        0         1    3200    130        0         1
3   1         3  choice1    2400    115        0         1    4000    115        0         0
4   1         4  choice2    4000    130        0         1    3200    150        0         0
5   1         5  choice2    2400    150        0         1    3200    150        0         0

All of our assertions have now “passed,” so we’re happy, and our data analysis can proceed.

By using (and reusing) simple libraries like Engarde, we were able to do the data analysis efficiently while maintaining correctness.

– Tom Augspurger, Lead Data Analysis Scientist at MITTERA

Share on FacebookTweet about this on TwitterShare on LinkedIn