Sunday, 24 April 2016

Combining many, many text rows in one row in Pandas



Sometimes it is useful to combine the content of many columns into one. For instance data from hospital events often contain one row for for each of the diagnostic categories the patient has received. Combining this in a single row with all migh be useful for several reasons. First of all, it may save a lot of memory. Instead of twenty rows, many of which are empty for a lot of patients since the majority only have a few diagnosis. Second, sometimes it is easier to search in one column when we want to select patients who have received a particular diagnosis.


How can this be done using Pandas. There are lots of possible options, but most are bad when we are dealing with a large dataset. Here is one:

icdcolumns = ['icd1', 'icd2', 'icd3', 'icd4', 'icd4', 'icd6']df['icd_all'] = df[icdcolumns].apply(lambda x: x.str.cat(sep = ', '), axis=1)

It looks promising (and it works), but there is a major problem: It is very menory intensive and using apply along axis 1 tend to be slow. Even with a very good computer, this took about six hours on a dataset with more than fifty million observations and twenty columns to be combined.

Even worse, After the computer was finished, I tried to save the result using:

df.to_pickle('stringicd')

This led the computer to crash because of memory problems (Lessons: Be careful with serialization when memory is short). Six hours of CPU time wasted.

Instead of simply running the whole thing again, I kept thinking that there must be a better way. Any suggestions? I have an idea, but that is for another blogpost.


3 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete