Wednesday 23 August 2017

The fast way to join string columns in Pandas: 600 times faster!

Say you have two or more string columns in pandas and want to join the content of those columns into one single column.

List of two column names:
eventcol = [bio, procedure_codes]

A common recommendation on Stackoverflow is to use apply and join.With about 850 000 rows and two columns:

%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is quite slow and it would be significantly slower if you have many rows (and columns).

Here are some alternatives:

* Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

* Convert to string and add:
%timeit df.bio.astype(str) + df.procedure_codes.astype(str)
223 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So simply adding the columns seems to be the fast solution (It is more than 600 times faster than using apply). The reason is that apply works row by row, while the other methods works on whole columns (vectorized).

Tuesday 1 August 2017

With this trick, HDF becomes a very fast dataformat ...

A common task for people anlysing health register data is the following: You want to select all persons with hip fractures from a health register, and then select and analyse all the treatments administered to these patients.

To do this you first have to get the unique id for all the relevant patients, then get the the registered events for all the people associated with the ids.

Both steps can be very time consuming since we are often talking about millions of observations and we may have to search several text strings for each event to get the ids of those with a specific disease. The first problem -- how to efficiently search for individuals with a specific diagnose - is a separate topic. Here I will only deal with the last issue: How to get all the events for individuals given that we know their ids.

The problem is not that it is difficult in itself to write a query to find events associated with different ids, but that the file is so large that it becomes important to think about how to store and select the events from files of tens of million of events.

An initial solution may use csv files and Pandas. This works surprisingly well, but it still took several minutes to extract all events for patients with a disease. To speed things up, I tried using a different file format (hdf5) instead of csv, but at the first attempt there was no speedup.

However, a closer inspection reveals that it is possible to get significantly faster. The following three steps made a big difference:

1. Make the id column an "index column" when saving the data.
2. Use "store.select_column()" to get the id variable and create a boolean array of the observations you want, (The key is here to use store.select" and NOT "store.columnname"  to select a column)
3. Use a boolean array in the "where statement" when reading the data

More details:

Assume you have: hip_ids (a list of ids of patients with hip fractures). In the pandas HDFstore version of hdf it is possible to select both columns and rows when loading the data. This means that we could get the events we want using the following:

df = pd.read_hdf('file.h5', key='np12',  where='id == hip_ids')

Unfortunately this was slow. One reason is that the computer takes a lot of time evaluating the "where" condition. It turns out, however, that a boolean array (True, True, False, True, False etc) specifying the correct rows to include is much faster.

FAST:

df = pd.read_hdf('file.h5', key='np12',  where=boolean_array_for rows_to_include)

But to use this last solution, we need to create a boolean array for which events to include (those involving patients with a given id).

Again, the first attempt was slow:

events_for_hip_patients = store.np12.id.isin(hip_ids)

But I then discovered "store_select" which is the fast column selector:

FAST

ids_all_events = store.select_column('np12', 'id')
events_for_hip_patients = ids_all_events.isin(hip_ids)

All this, of course, requires that you pre-specify that "id" is a data column. When storing the file, you specify this by using the data_columns argument:

df.to_hdf(file, key ,format='table',append=True, data_columns = ['id'])

Using this solution the process of getting get all events for peope with a given id took seconds, not minutes.

In short: Make id an indexable datacolumn; Identify the desired rows by creating a boolean array using the "select_column" format; ,Use the array when reading the file (in the "where" argument).