Tuesday 1 August 2017

With this trick, HDF becomes a very fast dataformat ...

A common task for people anlysing health register data is the following: You want to select all persons with hip fractures from a health register, and then select and analyse all the treatments administered to these patients.

To do this you first have to get the unique id for all the relevant patients, then get the the registered events for all the people associated with the ids.

Both steps can be very time consuming since we are often talking about millions of observations and we may have to search several text strings for each event to get the ids of those with a specific disease. The first problem -- how to efficiently search for individuals with a specific diagnose - is a separate topic. Here I will only deal with the last issue: How to get all the events for individuals given that we know their ids.

The problem is not that it is difficult in itself to write a query to find events associated with different ids, but that the file is so large that it becomes important to think about how to store and select the events from files of tens of million of events.

An initial solution may use csv files and Pandas. This works surprisingly well, but it still took several minutes to extract all events for patients with a disease. To speed things up, I tried using a different file format (hdf5) instead of csv, but at the first attempt there was no speedup.

However, a closer inspection reveals that it is possible to get significantly faster. The following three steps made a big difference:

1. Make the id column an "index column" when saving the data.
2. Use "store.select_column()" to get the id variable and create a boolean array of the observations you want, (The key is here to use store.select" and NOT "store.columnname"  to select a column)
3. Use a boolean array in the "where statement" when reading the data

More details:

Assume you have: hip_ids (a list of ids of patients with hip fractures). In the pandas HDFstore version of hdf it is possible to select both columns and rows when loading the data. This means that we could get the events we want using the following:

df = pd.read_hdf('file.h5', key='np12',  where='id == hip_ids')

Unfortunately this was slow. One reason is that the computer takes a lot of time evaluating the "where" condition. It turns out, however, that a boolean array (True, True, False, True, False etc) specifying the correct rows to include is much faster.

FAST:

df = pd.read_hdf('file.h5', key='np12',  where=boolean_array_for rows_to_include)

But to use this last solution, we need to create a boolean array for which events to include (those involving patients with a given id).

Again, the first attempt was slow:

events_for_hip_patients = store.np12.id.isin(hip_ids)

But I then discovered "store_select" which is the fast column selector:

FAST

ids_all_events = store.select_column('np12', 'id')
events_for_hip_patients = ids_all_events.isin(hip_ids)

All this, of course, requires that you pre-specify that "id" is a data column. When storing the file, you specify this by using the data_columns argument:

df.to_hdf(file, key ,format='table',append=True, data_columns = ['id'])

Using this solution the process of getting get all events for peope with a given id took seconds, not minutes.

In short: Make id an indexable datacolumn; Identify the desired rows by creating a boolean array using the "select_column" format; ,Use the array when reading the file (in the "where" argument).










No comments:

Post a Comment