Saturday 8 June 2013

Reducing memory problems with large files in Pandas: Eliminate nan's and use dtypes

I have had some problems working with a large dataset in pandas (17 million obervations). I am repeating myself a bit (see previous post), but the problem is not solved. Some solutions so far have been:

1. Eliminate or replace missing data before using pd.read.csv. Make the missing values -1 or 0 or something that Pandas does not interprets as "nan" (not a number, missing). Why? Because if the column contains a nan, Pandas automatically makes the variable a datatype which takes more memory.

2. Even this is often not enough, because the pd.read.csv often assigns datatypes that takes more memory than usual, for instance using float64 when float16 is enough and so on. To reduce this problem, you can explicitly declare the datatype (and import only the variables needed):

dtypes = {"id" : np.int32 , "birth_year" : np.int16 , "male" : np.int8 , "dead_year" : np.int16 , "dead_month" : np.int8}
vars = dtypes.keys()
dfp = pd.read_csv(path + "eolnprreducedv2personfile.csv", usecols = vars, dtype = dtypes, sep = ";", index_col="id")

3. Finally it is a good idea to save and later import the file using the pickel (or hd5) format since the pd.read.csv often runs out of memory even even when the file itself is not very large. So having several large files in memory and adding another using pd.read.csv will lead to memory problems, but importing the same as a whole dataframe it using pickle will be ok.

No comments:

Post a Comment