Wednesday 23 August 2017

The fast way to join string columns in Pandas: 600 times faster!

Say you have two or more string columns in pandas and want to join the content of those columns into one single column.

List of two column names:
eventcol = [bio, procedure_codes]

A common recommendation on Stackoverflow is to use apply and join.With about 850 000 rows and two columns:

%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is quite slow and it would be significantly slower if you have many rows (and columns).

Here are some alternatives:

* Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

* Convert to string and add:
%timeit df.bio.astype(str) + df.procedure_codes.astype(str)
223 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So simply adding the columns seems to be the fast solution (It is more than 600 times faster than using apply). The reason is that apply works row by row, while the other methods works on whole columns (vectorized).

No comments:

Post a Comment