Sunday 23 November 2014

Pandas and indexing by integer or label: A gotcha!

Just a warning: Be aware of the difference between indexing by integers and by labels in Pandas. Perhaps even use the more recent methods that explicitly distinguish between the two.

Example: Say you have a dataframe (df) with three obervations that are indexed by integers like this:

1  45
4  74
7  93

df.ix[1] is by default label based so it would return 45, but it is easy to forget this and think that .ix refers to location when the index is a list of consecutive integer numbers (as it quite often is).

If .ix[integer] does not find its label, then it tried to use position. So df[2] would return 93 (positional indexing starts at 0, also slightly confusing initially).

Anyway, not knowing the difference it is easy to make mistakes when, for instance, trying to assign a value to a specific cell in the dataframe. It may end up somewhere you did not intended, or you may get an error.

For this reason it is probably better to use the explicit notation

df.iloc(integer) for positional indexing

and

df.loc(label) for label based indexing

This is all in the documentation, so it is nothing new, just something that might confuse people who have not read it before.

Minor comment: The difference between terms .iloc and .loc may not convey the intuitive difference. Both are loc. The same goes for the added "i" to indicate row selection (as opposed to columns). Perfectly OK, but in general the i (as well as axis = 0 vs 1 notation) might have been more inutitively phrased. For instance by using simply "axis = "row/columns" and loc_row. Multidimensional objects might create problems here, but it could be solved by adding row1, row2, or allowing both "row" or the integer style). Anyway, just a minor quibble!