Saturday, 31 May 2014

Subtracting one column from another in Pandas created memory probems ... and a solution

I had two datasets with about 17 million observations for different variables in each. One was an event file (admissions to hospitals, when, what and so on). The other file was a person level file describing the characteristics of the individual who was admitted (gender, birth year and so on). All I wanted to do was to add some of the individual characteristics to the dataframe of the event so that I could calculate the relative frequency of different events in different age categories. Sounds easy. And it is, if you ignore memory problems. Using the dataframe in pandas, python, I first tried:
dfe["male"]=dfp["male"]
dfe["birth_year"] = dfp["birth_year"]
dfe["age"]  = dfe["out_year"] - dfe["birth_year"]
But the last line caused memory errors.

It may be an issue internal to Pandas or the garbage collector in Python.

I guess one might try generators and iterators, but I found something simple that also worked: Just convert the two columns to lists, delete one list from another and append the result to the dataframe. In short, doing the process outside of the dataframe seemed to solve the problem. Like this:

test1 = dfe["out_year"].values
test2 = dfe["birth_year"].values
test3 = test1 - test2