Sunday 24 April 2016

Combining many, many text rows in one row in Pandas



Sometimes it is useful to combine the content of many columns into one. For instance data from hospital events often contain one row for for each of the diagnostic categories the patient has received. Combining this in a single row with all migh be useful for several reasons. First of all, it may save a lot of memory. Instead of twenty rows, many of which are empty for a lot of patients since the majority only have a few diagnosis. Second, sometimes it is easier to search in one column when we want to select patients who have received a particular diagnosis.


How can this be done using Pandas. There are lots of possible options, but most are bad when we are dealing with a large dataset. Here is one:

icdcolumns = ['icd1', 'icd2', 'icd3', 'icd4', 'icd4', 'icd6']df['icd_all'] = df[icdcolumns].apply(lambda x: x.str.cat(sep = ', '), axis=1)

It looks promising (and it works), but there is a major problem: It is very menory intensive and using apply along axis 1 tend to be slow. Even with a very good computer, this took about six hours on a dataset with more than fifty million observations and twenty columns to be combined.

Even worse, After the computer was finished, I tried to save the result using:

df.to_pickle('stringicd')

This led the computer to crash because of memory problems (Lessons: Be careful with serialization when memory is short). Six hours of CPU time wasted.

Instead of simply running the whole thing again, I kept thinking that there must be a better way. Any suggestions? I have an idea, but that is for another blogpost.


Thursday 14 April 2016

Learning How to Build a Web Application for Data Scientists

Learning How to Build a Web Application by Robert Chang is a great overview of how to develop web apps for data visualizations. Recommended!





Installing pymc3 on Windows machines

PyMC3 is a python package for estimating statistical models in python. The package has an API which makes it very easy to create the model you want (because it stays close to the way you would write it in standard mathematical notation), and it also includes fast algorithms that estimate the parameters in the models (such as NUTS).  All in all it is a great package, BUT there is one significant problem: It seems very difficult to install pymc3 with the dependencies that are needed to make it run reasonable fast. This is partly because pymc3 requires Theano which is a package that speeds up computations, and this package in turn speeds up computations even more if it can use the GPU instead of the CPU, but making this happen often requires some extra installations. After fiddling around for some hours, I finally ended up with the following recipe for installing something that worked reasonably fast (but it does not use the GPU):

1. Install Anaconda 2.7 (64 bit) from Continum: https://www.continuum.io/downloads

2 If you do not have github installed, install it.
Easy setup here: https://desktop.github.com/
More detailed instructions here: https://help.github.com/articles/set-up-git/


3. Then open the anaconda terminal from the start menu (make sure it is the Python 2 version if you have several versions) and run the following:

conda install mingw libpython

pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

pip install git+https://github.com/pymc-devs/pymc3


If you get an error about not finding "git", you need to add git to the path: Instructions here: http://stackoverflow.com/questions/26620312/installing-git-in-path-with-github-client-for-windows



Sunday 3 April 2016

Idea: Read random rows instead of first n rows to infer datatypes

Reading large .csv files often creates problems because the program has to infer the datatype of the columns. Typically this is done by sniffing the first n rows, since it would take too much time to sniff everything. And if the inferred datatype is wrong, the whole process breaks down (often after considerable time) or becomes very slow. 

Here is an example: I waned to use dask dataframe to analyze a large .csv file. After some time, I got the following error:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns. These first 1,000 rows led us to an incorrect
guess. For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row. You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes. df = dd.read_csv(..., dtype={'my-column': float}) Pandas has given us the following error when trying to parse the file: "could not convert string to float: '2010-12-13 00:00:00'"

This was useful, but it lead me to think that a sniffing process using random rows, or at least rows in the beginning and the end of the file, might be better than using the first rows to infer datatypes. Why? Typically the csv. file is already sorted and non-standard or missing values may be located towards the end of the file, or at least the first rows may be filled with columns that typically have only "nice" values.

By the way: dask is great (once I got it working!)

Saturday 2 April 2016