Wednesday 23 August 2017

The fast way to join string columns in Pandas: 600 times faster!

Say you have two or more string columns in pandas and want to join the content of those columns into one single column.

List of two column names:
eventcol = [bio, procedure_codes]

A common recommendation on Stackoverflow is to use apply and join.With about 850 000 rows and two columns:

%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is quite slow and it would be significantly slower if you have many rows (and columns).

Here are some alternatives:

* Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

* Convert to string and add:
%timeit df.bio.astype(str) + df.procedure_codes.astype(str)
223 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So simply adding the columns seems to be the fast solution (It is more than 600 times faster than using apply). The reason is that apply works row by row, while the other methods works on whole columns (vectorized).

Tuesday 1 August 2017

With this trick, HDF becomes a very fast dataformat ...

A common task for people anlysing health register data is the following: You want to select all persons with hip fractures from a health register, and then select and analyse all the treatments administered to these patients.

To do this you first have to get the unique id for all the relevant patients, then get the the registered events for all the people associated with the ids.

Both steps can be very time consuming since we are often talking about millions of observations and we may have to search several text strings for each event to get the ids of those with a specific disease. The first problem -- how to efficiently search for individuals with a specific diagnose - is a separate topic. Here I will only deal with the last issue: How to get all the events for individuals given that we know their ids.

The problem is not that it is difficult in itself to write a query to find events associated with different ids, but that the file is so large that it becomes important to think about how to store and select the events from files of tens of million of events.

An initial solution may use csv files and Pandas. This works surprisingly well, but it still took several minutes to extract all events for patients with a disease. To speed things up, I tried using a different file format (hdf5) instead of csv, but at the first attempt there was no speedup.

However, a closer inspection reveals that it is possible to get significantly faster. The following three steps made a big difference:

1. Make the id column an "index column" when saving the data.
2. Use "store.select_column()" to get the id variable and create a boolean array of the observations you want, (The key is here to use store.select" and NOT "store.columnname"  to select a column)
3. Use a boolean array in the "where statement" when reading the data

More details:

Assume you have: hip_ids (a list of ids of patients with hip fractures). In the pandas HDFstore version of hdf it is possible to select both columns and rows when loading the data. This means that we could get the events we want using the following:

df = pd.read_hdf('file.h5', key='np12',  where='id == hip_ids')

Unfortunately this was slow. One reason is that the computer takes a lot of time evaluating the "where" condition. It turns out, however, that a boolean array (True, True, False, True, False etc) specifying the correct rows to include is much faster.

FAST:

df = pd.read_hdf('file.h5', key='np12',  where=boolean_array_for rows_to_include)

But to use this last solution, we need to create a boolean array for which events to include (those involving patients with a given id).

Again, the first attempt was slow:

events_for_hip_patients = store.np12.id.isin(hip_ids)

But I then discovered "store_select" which is the fast column selector:

FAST

ids_all_events = store.select_column('np12', 'id')
events_for_hip_patients = ids_all_events.isin(hip_ids)

All this, of course, requires that you pre-specify that "id" is a data column. When storing the file, you specify this by using the data_columns argument:

df.to_hdf(file, key ,format='table',append=True, data_columns = ['id'])

Using this solution the process of getting get all events for peope with a given id took seconds, not minutes.

In short: Make id an indexable datacolumn; Identify the desired rows by creating a boolean array using the "select_column" format; ,Use the array when reading the file (in the "where" argument).










Monday 31 July 2017

What is the fastest file format for pandas dataframes? A surprise!

I was surprised to discover that for long (but not very wide) messy and text heavy data, the good old .csv file format was read very fast by the csv reader in Pandas. A dataframe with more than 6 million rows and about 50 columns  took less than four minutes.

This may not sound very fast, and initially I thought that some of the other formats might be faster. So I tried hdf, parquet, and feather. Some of these are very fast (feather), but the issue was not only speed, but also flexibility. I wanted to be able to read  several files and only select some columns and rows and avoid reading it all and then filtering.

In addition, speed comparison depends heavily on the nature of the data. In this case the data was messy: there was no encoding into categoricals, there were some columns with text that varied in length and there were lots of missing values and unparsed dates. All this influences the speed of reading from various storage formats. A more numeric focused dataframe would be much faster for some of the other formats. But for text csv seemed to do well. Perhaps, sadly? Is there is a lack of good string support in many data storage formats?

In addition, there is the question of flexibility. Some formats, like hdf, are well established and it is easy to read selections (columns/rows) as well as adding new variables. But in my case the data was quite static and I was not going to add many new columns or rows. So this was not important. In any case hdf turned out not to be faster than pandas csv for my data. But flexible reading was important. And unlike, for instance, feather, pandas csv offered very flexible reading. Parquet was flexible in terms of colummns, but less good at rows (sometimes it was possible, but it assumed the existence of a well balanced and relatively low-cardinality grouping variable).

So, all in all, after a roundtrip I ended up with the good old csv and Pandas csv reader. for other types of data it may be better to use some of the other formats. It all depends on the type of data you have and what you want to do with it. I may switch to a binary format after cleaning up the files (using categorical data types, parsing dates and so on). But all in all I was very impressed by the csv reader in Pandas after a comparison with other solutions.


Wednesday 17 May 2017

Strange? Appending to a list directly gives None, but not when appending indirectly

a = ['no', 'nei']
a.append('hei')
print(a)

Works fine and returns  ['no', 'nei', 'hei]

But:
print(['no', 'nei'].append('hei'))

Returns None

This is probably logical, but I did not expect it!

Wednesday 29 March 2017

Existence of repeated patterns as an indicator of potential computer language improvements

I recently came across the following code:
Class Example(Model): def __init__(self, height, width, citizen_density, cop_density,
citizen_vision, cop_vision, legitimacy,
max_jail_term, active_threshold=.1, arrest_prob_constant=2.3,
movement=True, max_iters=1000):
super().__init__()
self.height = height
self.width = width
self.citizen_density = citizen_density
self.cop_density = cop_density
self.citizen_vision = citizen_vision
self.cop_vision = cop_vision
self.legitimacy = legitimacy
self.max_jail_term = max_jail_term
self.active_threshold = active_threshold
self.arrest_prob_constant = arrest_prob_constant
self.movement = movement
self.running = True
self.max_iters = max_iters
self.iteration = 0
self.schedule = RandomActivation(self)
self.grid = Grid(height, width, torus=True) There seems to be a lot of repetition here. And the problem is not the code itself. The repetitive pattern is required by the language. Seems to me it could be simplified. If not automatically, then perhaps by making it possible to say: "Take all argument of the function and make them properties of the object" Or perhaps, more flexible: "Take all except these ..." One line instead of many. Is it possible and is it a good idea?