dataHans: July 2015

Sunday, 12 July 2015

Avoid groupby for large datasets in Pandas?

Calculating readmission rates to hospitals events sounds easy when you have a complete dataset of all hospital events in a year: Just determine what proportion of events that were followed by a new hospital event within 30 days.

In reality, of course, it is often complicated. First, there are many conceptual and practical problems: Should outpatient visits count as the same as an inpatient visit? Do you only count admissions with a diagnosis related to the previous treatment? How to avoid counting transfers between hospitals as readmissions. And the important but pedantic: Do you include patients at 30 days?

Even after you manage to solve these problems, there is a problem of speed. With 17 million events at the hospital (each with a in and out date and a unique person id) it simply took way to much time to group the events for person (using groupby) and apply a user defined function to determine whether the event was associated with another event within 30 days.

The solution?

A combination of elimination and dummy creation solved the problems without the need for groupby. First, sort observations by id and out-date. Next, create a new id and sa new out-date column by shifting the original id and out-date column up one row. Calculate the difference between the original out date and the shifted out date - values below 30 represent observations that should be counted as having a readmissions.

But there is one problem: Because we shifted the values, the last observation for each person was compared to the first observation for another person. This can be solved by eliminating or setting to "nan" the observations for which the person id and the shifted person id are different. Now we have a solution by simply counting the number of observations with a value of less than 30 in a single column.

This solution avoided groupby, it was fast even with 17 million observations. There may be a general lesson here: Groupby in pandas is time expensive in large datasets and you may achieve the same results using sort, shift and simple calculations.