Thursday, 12 May 2022

Saturday, 1 June 2019

Where to start? About code complexity

While creating an app that would calculate the expected value of several options, I wrote several functions that cleaned and preparet the input so the relevant numbers could be calculated. It turns out that this simple task created a small lesson in a possible cause of code complexity, or at least confusion.

Bakground: The original input was a text file. The text file was cleaned (delete comments and so on) before another functions broke it all down into rows in a list, and there was a function that translated the whole text file into a dictionary strucure. The values in the dictionary was further processes to an expression that could be evaluated by the computer, and this was evaluated to a number. Then there were functions dealing with how to analyse the numbers, quanitfy the uncertainty, draw graphs, compare different alternatives and much more.

One way of approaching this, was to view thw functions as a long line where many of the functions would take the output of the previous function as input.

process_text --> ceate a dictionary -> fix dictionary -> get_numbers -> analyse

The create_dict functions might require a preprocesses string  to work, or it may take the text file itself and internally process it.

Unfortunately I switched back on forth on this. Sometimes the functions I created would work with the original text file as input - and just processs the fileit it wasnot processed already. Sometimes I created functuons that required a specific input that was already trnasformed using other functions.

This quickly became a little complex and difficult to remember.

Why did I do it both ways? I think there were at least two reasons. First: To increase the speed of the code. Sometimes the functons were used repeatedly and it seemed wasteful to require multiple transformations. Second, it just seemed intuitive that internal helper functions could take variouos inputs while the user facing API would be more consistent (always taking the model as a string as input).

But I was wrong. Or at least I think I was wrong.  I shuld have used the same input in most places (the model as a string). It would be a lot easier to remember. The alternative required me (and, God forbid, others!) maintain a mental book about how processed the input of different functions had to be before the output was OK. It did not scale well, in terms of the cognitive load it placed on the programmer.'

Of course, it might have been better to rewite the whole thing and avoid some of these problems. But it might be a general problem that cannot be avoided:

the functions could also be viwed as  more stand alone units that should take the original text as input, and give the result back.

Sunday, 12 May 2019

Python extend and replace: A common mistake?

If you want to replace some characters in a string in Python, you can write

new_string=old_string.replace('OLD', 'NEW')

Note that the string itself is not modified in-place, i.e. the following would not modify the content of old_string:
old_string.replace('OLD', 'NEW')

With lists, however, it is different. Extending a list with another list works inplace i.e. old_list will be modified if you write:

old_list.extend(another_list)

In fact, if you will create a bug if you try:

new_list  = old_list.extend(another_list) 

In this case the content of new_list, somewhat surprisingly for many,  becomes None.


There are good reasons for all of this: Object that are mutable can be changed in-place, while immutable objects cannot be changed in-place. Strings are immutable, lists are mutable. But for beginners who does not know this distinction, the difference may be very confusing and lead to bugs.

Not being an expert, one might ting that methods either always or never work in-place on the object by default. Instead you have to remember this on a case-by-case basis. Once you know more of the logic of "when and why" it becomes easier.



Saturday, 6 October 2018

Hint: conda clean ... and free 2GB

I often use conda to install packages. I was happy to learn that:
    conds clean -a
cleaned unnecessary files that had gathered as time passed and I did upgrades etc.

An easy way to free 2 GB in my case.

Wednesday, 19 September 2018

Surprise: Pandas and 584 years

Timestamp Limitations

Since pandas represents timestamps in nanosecond resolution, the time span that can be represented using a 64-bit integer is limited to approximately 584 years:
In [68]: pd.Timestamp.min
Out[68]: Timestamp('1677-09-21 00:12:43.145225')

In [69]: pd.Timestamp.max
Out[69]: Timestamp('2262-04-11 23:47:16.854775807')

Wednesday, 23 August 2017

The fast way to join string columns in Pandas: 600 times faster!

Say you have two or more string columns in pandas and want to join the content of those columns into one single column.

List of two column names:
eventcol = [bio, procedure_codes]

A common recommendation on Stackoverflow is to use apply and join.With about 850 000 rows and two columns:

%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is quite slow and it would be significantly slower if you have many rows (and columns).

Here are some alternatives:

* Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

* Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

* Convert to string and add:
%timeit df.bio.astype(str) + df.procedure_codes.astype(str)
223 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So simply adding the columns seems to be the fast solution (It is more than 600 times faster than using apply). The reason is that apply works row by row, while the other methods works on whole columns (vectorized).

Tuesday, 1 August 2017

With this trick, HDF becomes a very fast dataformat ...

A common task for people anlysing health register data is the following: You want to select all persons with hip fractures from a health register, and then select and analyse all the treatments administered to these patients.

To do this you first have to get the unique id for all the relevant patients, then get the the registered events for all the people associated with the ids.

Both steps can be very time consuming since we are often talking about millions of observations and we may have to search several text strings for each event to get the ids of those with a specific disease. The first problem -- how to efficiently search for individuals with a specific diagnose - is a separate topic. Here I will only deal with the last issue: How to get all the events for individuals given that we know their ids.

The problem is not that it is difficult in itself to write a query to find events associated with different ids, but that the file is so large that it becomes important to think about how to store and select the events from files of tens of million of events.

An initial solution may use csv files and Pandas. This works surprisingly well, but it still took several minutes to extract all events for patients with a disease. To speed things up, I tried using a different file format (hdf5) instead of csv, but at the first attempt there was no speedup.

However, a closer inspection reveals that it is possible to get significantly faster. The following three steps made a big difference:

1. Make the id column an "index column" when saving the data.
2. Use "store.select_column()" to get the id variable and create a boolean array of the observations you want, (The key is here to use store.select" and NOT "store.columnname"  to select a column)
3. Use a boolean array in the "where statement" when reading the data

More details:

Assume you have: hip_ids (a list of ids of patients with hip fractures). In the pandas HDFstore version of hdf it is possible to select both columns and rows when loading the data. This means that we could get the events we want using the following:

df = pd.read_hdf('file.h5', key='np12',  where='id == hip_ids')

Unfortunately this was slow. One reason is that the computer takes a lot of time evaluating the "where" condition. It turns out, however, that a boolean array (True, True, False, True, False etc) specifying the correct rows to include is much faster.

FAST:

df = pd.read_hdf('file.h5', key='np12',  where=boolean_array_for rows_to_include)

But to use this last solution, we need to create a boolean array for which events to include (those involving patients with a given id).

Again, the first attempt was slow:

events_for_hip_patients = store.np12.id.isin(hip_ids)

But I then discovered "store_select" which is the fast column selector:

FAST

ids_all_events = store.select_column('np12', 'id')
events_for_hip_patients = ids_all_events.isin(hip_ids)

All this, of course, requires that you pre-specify that "id" is a data column. When storing the file, you specify this by using the data_columns argument:

df.to_hdf(file, key ,format='table',append=True, data_columns = ['id'])

Using this solution the process of getting get all events for peope with a given id took seconds, not minutes.

In short: Make id an indexable datacolumn; Identify the desired rows by creating a boolean array using the "select_column" format; ,Use the array when reading the file (in the "where" argument).