Sunday 23 November 2014

Pandas and indexing by integer or label: A gotcha!

Just a warning: Be aware of the difference between indexing by integers and by labels in Pandas. Perhaps even use the more recent methods that explicitly distinguish between the two.

Example: Say you have a dataframe (df) with three obervations that are indexed by integers like this:

1  45
4  74
7  93

df.ix[1] is by default label based so it would return 45, but it is easy to forget this and think that .ix refers to location when the index is a list of consecutive integer numbers (as it quite often is).

If .ix[integer] does not find its label, then it tried to use position. So df[2] would return 93 (positional indexing starts at 0, also slightly confusing initially).

Anyway, not knowing the difference it is easy to make mistakes when, for instance, trying to assign a value to a specific cell in the dataframe. It may end up somewhere you did not intended, or you may get an error.

For this reason it is probably better to use the explicit notation

df.iloc(integer) for positional indexing

and

df.loc(label) for label based indexing

This is all in the documentation, so it is nothing new, just something that might confuse people who have not read it before.

Minor comment: The difference between terms .iloc and .loc may not convey the intuitive difference. Both are loc. The same goes for the added "i" to indicate row selection (as opposed to columns). Perfectly OK, but in general the i (as well as axis = 0 vs 1 notation) might have been more inutitively phrased. For instance by using simply "axis = "row/columns" and loc_row. Multidimensional objects might create problems here, but it could be solved by adding row1, row2, or allowing both "row" or the integer style). Anyway, just a minor quibble!







Saturday 31 May 2014

Subtracting one column from another in Pandas created memory probems ... and a solution

I had two datasets with about 17 million observations for different variables in each. One was an event file (admissions to hospitals, when, what and so on). The other file was a person level file describing the characteristics of the individual who was admitted (gender, birth year and so on). All I wanted to do was to add some of the individual characteristics to the dataframe of the event so that I could calculate the relative frequency of different events in different age categories. Sounds easy. And it is, if you ignore memory problems. Using the dataframe in pandas, python, I first tried:
dfe["male"]=dfp["male"]
dfe["birth_year"] = dfp["birth_year"]
dfe["age"]  = dfe["out_year"] - dfe["birth_year"]
But the last line caused memory errors.

It may be an issue internal to Pandas or the garbage collector in Python.

I guess one might try generators and iterators, but I found something simple that also worked: Just convert the two columns to lists, delete one list from another and append the result to the dataframe. In short, doing the process outside of the dataframe seemed to solve the problem. Like this:

test1 = dfe["out_year"].values
test2 = dfe["birth_year"].values
test3 = test1 - test2