Sunday 3 April 2016

Idea: Read random rows instead of first n rows to infer datatypes

Reading large .csv files often creates problems because the program has to infer the datatype of the columns. Typically this is done by sniffing the first n rows, since it would take too much time to sniff everything. And if the inferred datatype is wrong, the whole process breaks down (often after considerable time) or becomes very slow. 

Here is an example: I waned to use dask dataframe to analyze a large .csv file. After some time, I got the following error:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns. These first 1,000 rows led us to an incorrect
guess. For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row. You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes. df = dd.read_csv(..., dtype={'my-column': float}) Pandas has given us the following error when trying to parse the file: "could not convert string to float: '2010-12-13 00:00:00'"

This was useful, but it lead me to think that a sniffing process using random rows, or at least rows in the beginning and the end of the file, might be better than using the first rows to infer datatypes. Why? Typically the csv. file is already sorted and non-standard or missing values may be located towards the end of the file, or at least the first rows may be filled with columns that typically have only "nice" values.

By the way: dask is great (once I got it working!)

No comments:

Post a Comment