Monday, 31 July 2017

What is the fastest file format for pandas dataframes? A surprise!

I was surprised to discover that for long (but not very wide) messy and text heavy data, the good old .csv file format was read very fast by the csv reader in Pandas. A dataframe with more than 6 million rows and about 50 columns  took less than four minutes.

This may not sound very fast, and initially I thought that some of the other formats might be faster. So I tried hdf, parquet, and feather. Some of these are very fast (feather), but the issue was not only speed, but also flexibility. I wanted to be able to read  several files and only select some columns and rows and avoid reading it all and then filtering.

In addition, speed comparison depends heavily on the nature of the data. In this case the data was messy: there was no encoding into categoricals, there were some columns with text that varied in length and there were lots of missing values and unparsed dates. All this influences the speed of reading from various storage formats. A more numeric focused dataframe would be much faster for some of the other formats. But for text csv seemed to do well. Perhaps, sadly? Is there is a lack of good string support in many data storage formats?

In addition, there is the question of flexibility. Some formats, like hdf, are well established and it is easy to read selections (columns/rows) as well as adding new variables. But in my case the data was quite static and I was not going to add many new columns or rows. So this was not important. In any case hdf turned out not to be faster than pandas csv for my data. But flexible reading was important. And unlike, for instance, feather, pandas csv offered very flexible reading. Parquet was flexible in terms of colummns, but less good at rows (sometimes it was possible, but it assumed the existence of a well balanced and relatively low-cardinality grouping variable).

So, all in all, after a roundtrip I ended up with the good old csv and Pandas csv reader. for other types of data it may be better to use some of the other formats. It all depends on the type of data you have and what you want to do with it. I may switch to a binary format after cleaning up the files (using categorical data types, parsing dates and so on). But all in all I was very impressed by the csv reader in Pandas after a comparison with other solutions.