dataHans: July 2016

Tuesday, 19 July 2016

Notes on Markov models in pyton - with emphasis on Health Economic Evaluation

Markov models and health economic evaluation in python

Intro
Markov models are commonly used to do health economic evaluation. In this post I'll explore some tools and options for doing this kind of analysis in Python.

NetworkX
A good toold to construct a model formally, is to use the python package networkX. It contains more than you need, but it is quite easy to construct a model with nodes and edges (a directed graph). It is also easy to add properties to nodes and graphs (probabilities, utility, costs associated to states and events). It also allows you to export the model to other packages (like pandas). The model can also be exported to a standard language for networks (dot).

NetworkX is a tool for creating a model, but in order to visualize it, you may want to install some other tools and packages.

Visualizing the model
This turned out to be a little more tricky than I thought. In short you need to install something called Graphviz AND a Python wrapper, for instance: graphviz python

To visualize the markov model in a jupyter notebook nxpd works fine.

You may also want to consider xdot for more interactive visualizations (but it requires some quite heavy dependencies).

"Solving" the model
There are lots of tools for markov models in python, but they are usually not designed to handle the typical workflow in health economics.

In health economic models, the Markov model is used to analyse how an intervention affects the outcome. This is done by assigning utilities and costs to each state and transition probabilities between each state. An intervention could, for instance, reduce the probability of becoming sick. We then run a simulating with a given number of individuals, observe how many end up in each state in each time period (after each cycle of the model), calculate costs/utilities and go on for another step for the life span of all individuals. The results are summrized and we can compare the total costs and benefits in models with different costs and transition probabilities.

Here are some tools that migh be used to do the simulation described above:

http://quant-econ.net/py/finite_markov.html

https://github.com/gvanderheide/discreteMarkovChain

http://pomegranate.readthedocs.io/en/latest/markovchain.html

https://github.com/riccardoscalco/Pykov

http://pymc-devs.github.io/pymc3/index.html

The quant package seems to be the one that fits best for our purpose. Alternatively pykov might also be used.

Here is a notebook that describes markov modelling using the quant package:

http://nbviewer.jupyter.org/github/QuantEcon/QuantEcon.notebooks/blob/master/markov_chain_ex01_py.ipynb

Saturday, 2 July 2016

Just do it: PEP 8, Style guide for Python, no spaces around equal signs in default arguments

When defining default functions in Python, I used to do this:

def foo(var1 = 'default1', var2 = 42):
...

After installing an automatic tool to check for the correct style, I was told that the correct style was to have no spaces surrounding the equal sign in this case, i.e. the correct style is:

def foo(var1='default1', var2=42):
...

I tend to like space since I think it makes the code more readable, and toyed with the idea of being a style rebel. But then I came to my senses. We all have different hang ups and quirks. Of course I could do it like I want it, but there are very good arguments in favour of having one consistent style for most users. It increases readability in general and makes debugging easier.

Besides, there is probably some good reason why the style guide recommends no spaces in this case - and there is no point in being stubborn only to discover that these people (who have a lot more experience) were right after writing lots of code.

So, despite the first instinct, I think the correct answer is: Just do it! Don't be stubborn, follow the guidelines.

API: adding new functions vs. including more arguments into existing functions

When expanding the features, there is often a choice between adding new functions and including more arguments into existing functions. Consider the api for downloading data into a pandas dataframe in stats_to_pandas (0.0.7):

read_box(), downloads a table as specified by a widget box
read_premade(), downloads a table based on a premade table_id (a subset of tables)
read_all(), downloads all the variables and values from the table indicated by table_id
read_with_json(), downloads a table given by "table_id" based on a user supplied json dictionary that selects variables and values
read_url(): downloads a json-stat table from an url

Instead of this, it would be possible to have one read function with several optional arguments:

read(box=None, table_id=None, premade=False, language= 'en', query=None, full_url = None)

Although the last is only one function, it is not inherently simpler since it requires the user to know a longer list of arguments.

One advantage of the all-in-one approach is that the high level api stays relatively short. It also reduces the amount of code since many of these functions share similar code.

A disadvantage is that autocomplete often works better with functions than with arguments inside functions.

Another disadvantage is that the all-in-one function easily becomes very complex and non-intuitive. For instance, in the case above:

If "box" is specified as an argument, none of the other arguments are needed (and will be ignored).
If tabe_id is specified, the box argument is ignored, but the user has to supply more arguments to determine what variables and values to download from the table: is it a premade table (set premade to True) and no further information is necessary, if it is not a premade table, but you want to download all variables and values, set query = 'all'. If you want to use a json-stat dictionary to specify which variables and values to download, use query = my_json_dict.lastly, if you have the full url to a json-stat table, you need only specify the full_url argument (and all other argument will be ignored).

All of this sounds complicated, and it may well be better to maintain separate functions instead of trying to include them all into one. Ideally the list or arguments should not have a structure that "argument B is only needed if A is given, in which case argument C will be ignored." Instead it seems more logically consistent and intuitive that function arguments should never by mutually destructive. The arguments should specify different features that all apply to the object. In the example above the arguments are mutually destructive: If one is specified, sometimes one of the features is no longer relevant. Specifying the source as a boc makes the table_id redundant. The user may then be confused: What is returned if both box and table_id is specified?

The problem could partly be solved by better argument structure. There could, for instance, be a "source" argument which tells the whether the source of the dataframe is a widget box or a table_id (or something else). Originally this was implicit in the sense that as soon as one was specified, the code considered it to be a source - but this makes the result unpredictable if both are specified.

So, perhaps in this case the many-functions, few-arguents is a better approach?

Perhaps. But this also confuses the user since it may be unclear which function to use. Some users do not know what a json dictionary is and will be confused by the read_with_json() function.

The user is king, and this means that before choosing one might consider how often users will need, and be confused, by different functions. If, for instance, most users almost all the time will only need a single read function with the same set of arguments, one might consider having a single read function with these defaults, and just add some arguments to this function for more advanced users who occasionally need more advanced options. And these users are often able to figure out the arguments in the function. This is the case above. Few users will need the read_premade function, but it is useful to have for those who need it. Adding it as a separate function may confuse less experienced users since they do not quite know which read function to use, but adding it as an argument to a general read function eliminate this problem. One may argue that it just pushes the problem to a different level since the user will now be confused by the arguments in the function, but with sensible defaults, less experienced users may not care or think too much about the other more complex options and are free to ignore them.

Although it may be friendly to less experienced users, I am bothered by the logical inconsistency created by the all-in-one function using common defaults.