dataHans: 2016

Saturday, 15 October 2016

What do we want from an API: A list of six criteria

An API should be:

Intuitive
Consistent
Short (the number of functions, the length of each function name, and the length of the arguments and parameters)
Logical (almost the same as consistency, but not quite)
Unique (enough to avoid name confusion and collisions with other packages)
Structurally similar to other packages

What else?

Contain a limited set of arguments?

Why is there no "flatten list" function in python?

The problem

I have a dictionary and each element in the dictionary is a list of codes. I want the set of all unique codes.

Example

codes['1992'] = ['K50', 'K51']

codes['1993'] = ['S72', 'K51']

Approach

codes.values() gives all codes, but is is a list of lists so it needs to be flattened before I can get the set. It is not difficult, but it always takes me a little time.

Pain poin

Flattening a list of this type is not a major problem, but it is slightly annoying that there is no inbuilt flatten command. Why not? Two possible reasons:

1. It is easy to do (a one-liner) so a special command is not needed.

2. There are corner cases where a flatten command would not work or produce unexpected or non-unique results.

Argument 1 is true, but not sufficient: We often create convenience functions for things that could be done using other functions. Just to make life a little easier. Also, although short and quite easy, it always takes some seconds to think how to deal with nested list comprephensionos (see here for an intuitive example of nested list comprehensions).

Argument 2 is also true, but I wonder if it is avoidable by limiting the flatten function to easy cases only or by including optional parameters that the user can specify. At least other languages have managed this.

I know there are some easy solutions, but often I wish we had flatten function to make life even easier every time I have to do something like this.

More here: http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python

Tuesday, 19 July 2016

Notes on Markov models in pyton - with emphasis on Health Economic Evaluation

Markov models and health economic evaluation in python

Intro
Markov models are commonly used to do health economic evaluation. In this post I'll explore some tools and options for doing this kind of analysis in Python.

NetworkX
A good toold to construct a model formally, is to use the python package networkX. It contains more than you need, but it is quite easy to construct a model with nodes and edges (a directed graph). It is also easy to add properties to nodes and graphs (probabilities, utility, costs associated to states and events). It also allows you to export the model to other packages (like pandas). The model can also be exported to a standard language for networks (dot).

NetworkX is a tool for creating a model, but in order to visualize it, you may want to install some other tools and packages.

Visualizing the model
This turned out to be a little more tricky than I thought. In short you need to install something called Graphviz AND a Python wrapper, for instance: graphviz python

To visualize the markov model in a jupyter notebook nxpd works fine.

You may also want to consider xdot for more interactive visualizations (but it requires some quite heavy dependencies).

"Solving" the model
There are lots of tools for markov models in python, but they are usually not designed to handle the typical workflow in health economics.

In health economic models, the Markov model is used to analyse how an intervention affects the outcome. This is done by assigning utilities and costs to each state and transition probabilities between each state. An intervention could, for instance, reduce the probability of becoming sick. We then run a simulating with a given number of individuals, observe how many end up in each state in each time period (after each cycle of the model), calculate costs/utilities and go on for another step for the life span of all individuals. The results are summrized and we can compare the total costs and benefits in models with different costs and transition probabilities.

Here are some tools that migh be used to do the simulation described above:

http://quant-econ.net/py/finite_markov.html

https://github.com/gvanderheide/discreteMarkovChain

http://pomegranate.readthedocs.io/en/latest/markovchain.html

https://github.com/riccardoscalco/Pykov

http://pymc-devs.github.io/pymc3/index.html

The quant package seems to be the one that fits best for our purpose. Alternatively pykov might also be used.

Here is a notebook that describes markov modelling using the quant package:

http://nbviewer.jupyter.org/github/QuantEcon/QuantEcon.notebooks/blob/master/markov_chain_ex01_py.ipynb

Saturday, 2 July 2016

Just do it: PEP 8, Style guide for Python, no spaces around equal signs in default arguments

When defining default functions in Python, I used to do this:

def foo(var1 = 'default1', var2 = 42):
...

After installing an automatic tool to check for the correct style, I was told that the correct style was to have no spaces surrounding the equal sign in this case, i.e. the correct style is:

def foo(var1='default1', var2=42):
...

I tend to like space since I think it makes the code more readable, and toyed with the idea of being a style rebel. But then I came to my senses. We all have different hang ups and quirks. Of course I could do it like I want it, but there are very good arguments in favour of having one consistent style for most users. It increases readability in general and makes debugging easier.

Besides, there is probably some good reason why the style guide recommends no spaces in this case - and there is no point in being stubborn only to discover that these people (who have a lot more experience) were right after writing lots of code.

So, despite the first instinct, I think the correct answer is: Just do it! Don't be stubborn, follow the guidelines.

API: adding new functions vs. including more arguments into existing functions

When expanding the features, there is often a choice between adding new functions and including more arguments into existing functions. Consider the api for downloading data into a pandas dataframe in stats_to_pandas (0.0.7):

read_box(), downloads a table as specified by a widget box
read_premade(), downloads a table based on a premade table_id (a subset of tables)
read_all(), downloads all the variables and values from the table indicated by table_id
read_with_json(), downloads a table given by "table_id" based on a user supplied json dictionary that selects variables and values
read_url(): downloads a json-stat table from an url

Instead of this, it would be possible to have one read function with several optional arguments:

read(box=None, table_id=None, premade=False, language= 'en', query=None, full_url = None)

Although the last is only one function, it is not inherently simpler since it requires the user to know a longer list of arguments.

One advantage of the all-in-one approach is that the high level api stays relatively short. It also reduces the amount of code since many of these functions share similar code.

A disadvantage is that autocomplete often works better with functions than with arguments inside functions.

Another disadvantage is that the all-in-one function easily becomes very complex and non-intuitive. For instance, in the case above:

If "box" is specified as an argument, none of the other arguments are needed (and will be ignored).
If tabe_id is specified, the box argument is ignored, but the user has to supply more arguments to determine what variables and values to download from the table: is it a premade table (set premade to True) and no further information is necessary, if it is not a premade table, but you want to download all variables and values, set query = 'all'. If you want to use a json-stat dictionary to specify which variables and values to download, use query = my_json_dict.lastly, if you have the full url to a json-stat table, you need only specify the full_url argument (and all other argument will be ignored).

All of this sounds complicated, and it may well be better to maintain separate functions instead of trying to include them all into one. Ideally the list or arguments should not have a structure that "argument B is only needed if A is given, in which case argument C will be ignored." Instead it seems more logically consistent and intuitive that function arguments should never by mutually destructive. The arguments should specify different features that all apply to the object. In the example above the arguments are mutually destructive: If one is specified, sometimes one of the features is no longer relevant. Specifying the source as a boc makes the table_id redundant. The user may then be confused: What is returned if both box and table_id is specified?

The problem could partly be solved by better argument structure. There could, for instance, be a "source" argument which tells the whether the source of the dataframe is a widget box or a table_id (or something else). Originally this was implicit in the sense that as soon as one was specified, the code considered it to be a source - but this makes the result unpredictable if both are specified.

So, perhaps in this case the many-functions, few-arguents is a better approach?

Perhaps. But this also confuses the user since it may be unclear which function to use. Some users do not know what a json dictionary is and will be confused by the read_with_json() function.

The user is king, and this means that before choosing one might consider how often users will need, and be confused, by different functions. If, for instance, most users almost all the time will only need a single read function with the same set of arguments, one might consider having a single read function with these defaults, and just add some arguments to this function for more advanced users who occasionally need more advanced options. And these users are often able to figure out the arguments in the function. This is the case above. Few users will need the read_premade function, but it is useful to have for those who need it. Adding it as a separate function may confuse less experienced users since they do not quite know which read function to use, but adding it as an argument to a general read function eliminate this problem. One may argue that it just pushes the problem to a different level since the user will now be confused by the arguments in the function, but with sensible defaults, less experienced users may not care or think too much about the other more complex options and are free to ignore them.

Although it may be friendly to less experienced users, I am bothered by the logical inconsistency created by the all-in-one function using common defaults.

Friday, 10 June 2016

More about naming things: Consistency is king or allow intuitive exceptions to general rules?

To separate or not to separate?
Consider the following example where I use the pandas library to create a list of the unique ids for people with a specific disease.

ibd_patients = df.ids.unique().to_list()

This creates an error since there is no "to_list()" method. Instead it is called "tolist()."

My bad, but I kept making the same mistake several times. Why? I admit that I may be a bit slow, but there might be an additional cause: The pandas library often use underscore to separate the terms in methods (to_csv(), to_datatime(), to_numeric() and so on). Because "tolist()" is an exception to this general rule it becomes an easy mistake to make.

The lesson? Consistency is king! If you start splitting words using underscore, do so everywhere!

But, I hear myself cry, sometimes it is quite intuitive and easy to join the words. Can't we just have some exceptions?

Perhaps, but when I grant myself the right to exceptions in my own programming, I often regret it later. The key problem is that it is not obvious when an exception is intuitive. So the next day when I continue to write my code, I make silly mistakes, referring to a variable or a method with terms separated with an underscore that my old self believed was an intuitive exception.

So, no exceptions for me. They create mess since my future self have a different intuition than my current self, and tend to disagree on what an intuitive exception is.

Systematic exceptions?
Or? Maybe, maybe it is possible to create a system with some systematic exceptions? Here are some that I have tried:

no underscore when there are only two terms (so getlist() is OK), but use underscore when there are three terms, like "get_from_list()"

no underscore when using short terms like "get", "to" and so on

no underscrore between text and numbers. In this case the following name for a variable or a dataframe would be OK:

patients2015and2016 = ... (a list or a dataframe)
Instead of keystroke-challenged version:
patients_2015_and_2016

But in the end I keep failing and it seems like the only safe rule, at least for me, is to be consistent, however painful and non-intuitive it may seem in some individual cases.

Acceptable exceptions to the rule that consistency is king?
But wait, there are some exceptions that are common. For instance: lstrip, nunique, groupby and so on. As for lstrip and so on, the tend to be accepted, perhaps, because the first term is an abbreviation. Or just because it it so common.

The problems never end!
And as if this was not enough, I keep messing up with several other consistency problems:

- Should the name of an object with many elements always be in plural?

bad_year = [2001, 2002]

or

bad_years = [2001, 2002]

My feeling: Yes, but sometimes it feels unintuitive. In that case I remind myself: Consistency is king!

The same problems occurrs in naming methods. For instance in the pandas it is easy to forget whether it is:

df.value_counts()
or
df.value_count()

since the plural naming scheme is not consistently implemented in all methods.

I have some mixed feelings about whether the following variable naming in a loop is good or bad:

for var in vars
for year in years

While I use it, it is very easy to confuse objects with so similar names when reading the code (only distinguished by a plural ending). On the other hand, it is sometimes very logical. My solution is to add a term to the plural:

for var in surgery_vars
for year in bad_years

- Should the type of object be indicated in the name or left implicit?
bad_year_list = [2001, 2002]
or just
bad_years = [2001, 2002]

My answer: The cost in terms of verbosity is not worth the benefit I occasionally get from knowing fast (by reading the variable name) how I should slice or index or get items out from the object. How I get information out depends on whether the object is a list or a dict, but it just becaomes tooo much if I have to do this for all objects and since consistency is king, I try to avoid it even if it would be useful sometimes.

Sunday, 29 May 2016

Naming is hard and important: A true story!

After finishing the first version of a tool for downloading data to a pandas dataframe in python (https://gist.github.com/anonymous/e1463d45e4c4e8673bfcfbaf585cdd8c), I took a moment to reflect. The coding part was fun, but there is one key lesson, and an issue that I would like to get some input on before rewriting the whole thing is:

Naming is hard and important

Initially I used the following to syntax to download a table (specified in a gui box) into a dataframe:

df = get_df(box)

This is OK, but later I decided that it might be more intuitive to recycle the standard pandas api-style:

df = read_box(box)

Isolated changes like these may sound like small details, but logically consistent and intuitive names for methods, functions and variables turned out to be very important as the code expanded. For instance, at some point it was evident that it woud be useful to download the data not only based on the selection in a gui box, but by using a specified json dictionary (for users wanting to avoid the notebook and the gui).

The "read_" syntax made it easier to extend the code logically to cover this, since I could just use:

df = read_json(table_id = '14714', query = json_query)

And later, to read all variables and values:

df = read_all()

Although "read_" is better, I am not entirely happy since the different cases above are not exactly analogous. Here are some alternatives:

df = read_based_on_box(box)
df = read_based_on_json(json)

df = read_json(json = 'all')

But this is rather verbose and has lots of underscores.

The same goes for parameters in the functions. Ideally one would like something short, intuitive, consistent, and not too common (to avoid name collisions). For instance, to make the gui box where the user can select the variables/values, I first used:

build_query(id = 10714)

I like the brevity of this, but "id" is, first of all, a reserved word in Python. It is also not very explicit since it does not convey what kind of id that is being referenced. Finally, I suppose many users already used id as a variable. This is not a fatal problem in Pyton, since variables are local, but it makes it more confusing. So, reluctantly (since I am lazy and like to use as few keystrokes as possible) I had to change the id parameter.

To what? My first replacement was "table":

build_query(table = 10714)

But this ended up being a mess. I should have known this from the beginning, but I wanted something short without lots of underscores. But "table" is a term that is far too common: used in lots of other places in my code as well as other peoples' code, and it is not explicit and intuitive because it does not convey the key fact that we are talking about the unique id number of a table.

So I eventually had to admit defeat and add an underscore parameter "table_id":

build_query(table_id = 10714)

This was, not, however, the end of my worries. A minor improvement, I think, was to rename the "build_query" to "select." One reason for this was that non-experts may not know what "build_query" means, but more importantly the function does not really build a query. It simply creates a gui box for the user to select the variables to be included when downloading the table: Which years to include, whether to include per capita or total values and so on. As an added bonus it was shorter and there was no underscore. So, now we have:

select (table_id = 10714)

The probem now was that the unique table sometimes included leading zeros:

select (table_id = 07187)

Leading zeros does not well with integers.

Right now I have a solution that may be frawned upon: I allow table_id to be both a string and an integer. If the user specifies an integer with less than five digits the function "magically" makes it a string and add initial zeros.

In other words: the following is currently equivalent:

select (table_id = '07187')

select (table_id = 7187)

Oh, the horror? Or perhaps it if a useful flexible approach?

Is the flexibility worth the wrath of the PEP Gods?

OK, I admit: It is most likely horribly inconsistent and should be avoided. The Gods are right.

But I am not sure what to do. Is it best to define "table_id" as a string (always), or an integer (always), or is it OK to allow both?

If forced to choose: Although I tend to like integers since (again, to reduce clutter and keystrokes), but in the end a string might be the most concistent specification.

Argh! A string table_id is slightly more ugly and cumbersome, but this is where things are headed unless you stop me ...

As if this was not enough, I initially allowed two ways of selecting a table:

1. Based on the stable and unique five digit table_id specified by Statistics Norway:
select (table_id = '07187')

2. Based on the row number in a dataframe with different tables
select(row = 4, source = search_result_dataframe)

In fact I was stupid enough to make #2 the standard approach at first since I thought it would be easier and more intuitive for beginners to identify a table by the row number (since it is the first column in the dataframe shown on the screen). It was also shorter and the reader did not have to make the mentally ardous task of learning about the longer (but unique and stable) table_ids that belong to each table..

But it was a mistake; Approach #1 using "table_id" is more reliable, intuitive and shorter (with row, the source dataframe also has to be specified).

Right now the "row" selection is still allowed, but I think I will eliminate this and only allow "table_id." In which case I will also update the result of the search by making "table_id" the first column (ie. the index column) in the dataframe that is returned after a search. Unfortunately this means that the index will not be concecutively numbered in natural patter 0, 1, 2 etc, but by the five digit id (14514, 05646 etc). Still, it is for the better, I think.

In short, naming things is important, but difficult. I wonder if there are some more general principles that can be used. For instance: always (or never) use a verb? Like "get_df" "read_df" and "avoid/use abbreviations" (use "language" not "lang"?), avid triple underscores (?) and so on. Grammar of graphics have some rules. Is htere a grammar of naming? Ok. I know there are rules such as "avoid camelCase" and so on, but I was thinking more about the terms themselves and the structure. Perhaps it is one of those things that are simply not suited for general priciples. Art more than science?

Saturday, 28 May 2016

Import data from Statistics Norway to a Pandas dataframe in Python

Statistics Norway has made more than 5000 tables available with a new API and here is a tool that will make it easier to download the data to a Pandas dataframe:
https://gist.github.com/anonymous/e1463d45e4c4e8673bfcfbaf585cdd8c

Basically the tool allows you to:

1. search('cows')
2. Select variables and values for a table in a widget gui: box = select(table_id = '14714')
3. Download the selected table/values in a Pandas dataframe: df = read_box(box)

And some other potentially useful things e.g. get the json query associated with the selection:
get_json(box) or get the json associated with getting all variables and values: full_json(table_id = '14714').

The coding part was fun, but the process also made me reflect a little. One key lesson, and an issue that I would like to get some input on before rewriting the whole thing is:

Wednesday, 4 May 2016

RAIRD

Note to future self (2017): RAIRD

It has the potential to become a great resource. Data on education, income, welfare payments and so on in Norway. Analysis "on server" using Pyton. (no data download, which is reasonable in this case).

'via Blog this'

Sunday, 24 April 2016

Combining many, many text rows in one row in Pandas

Sometimes it is useful to combine the content of many columns into one. For instance data from hospital events often contain one row for for each of the diagnostic categories the patient has received. Combining this in a single row with all migh be useful for several reasons. First of all, it may save a lot of memory. Instead of twenty rows, many of which are empty for a lot of patients since the majority only have a few diagnosis. Second, sometimes it is easier to search in one column when we want to select patients who have received a particular diagnosis.

How can this be done using Pandas. There are lots of possible options, but most are bad when we are dealing with a large dataset. Here is one:

icdcolumns = ['icd1', 'icd2', 'icd3', 'icd4', 'icd4', 'icd6']df['icd_all'] = df[icdcolumns].apply(lambda x: x.str.cat(sep = ', '), axis=1)

It looks promising (and it works), but there is a major problem: It is very menory intensive and using apply along axis 1 tend to be slow. Even with a very good computer, this took about six hours on a dataset with more than fifty million observations and twenty columns to be combined.

Even worse, After the computer was finished, I tried to save the result using:

df.to_pickle('stringicd')

This led the computer to crash because of memory problems (Lessons: Be careful with serialization when memory is short). Six hours of CPU time wasted.

Instead of simply running the whole thing again, I kept thinking that there must be a better way. Any suggestions? I have an idea, but that is for another blogpost.

Thursday, 14 April 2016

Learning How to Build a Web Application for Data Scientists

Learning How to Build a Web Application by Robert Chang is a great overview of how to develop web apps for data visualizations. Recommended!

Installing pymc3 on Windows machines

PyMC3 is a python package for estimating statistical models in python. The package has an API which makes it very easy to create the model you want (because it stays close to the way you would write it in standard mathematical notation), and it also includes fast algorithms that estimate the parameters in the models (such as NUTS). All in all it is a great package, BUT there is one significant problem: It seems very difficult to install pymc3 with the dependencies that are needed to make it run reasonable fast. This is partly because pymc3 requires Theano which is a package that speeds up computations, and this package in turn speeds up computations even more if it can use the GPU instead of the CPU, but making this happen often requires some extra installations. After fiddling around for some hours, I finally ended up with the following recipe for installing something that worked reasonably fast (but it does not use the GPU):

1. Install Anaconda 2.7 (64 bit) from Continum: https://www.continuum.io/downloads

2 If you do not have github installed, install it.
Easy setup here: https://desktop.github.com/
More detailed instructions here: https://help.github.com/articles/set-up-git/

3. Then open the anaconda terminal from the start menu (make sure it is the Python 2 version if you have several versions) and run the following:

conda install mingw libpython

pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

pip install git+https://github.com/pymc-devs/pymc3

If you get an error about not finding "git", you need to add git to the path: Instructions here: http://stackoverflow.com/questions/26620312/installing-git-in-path-with-github-client-for-windows

Sunday, 3 April 2016

Idea: Read random rows instead of first n rows to infer datatypes

Reading large .csv files often creates problems because the program has to infer the datatype of the columns. Typically this is done by sniffing the first n rows, since it would take too much time to sniff everything. And if the inferred datatype is wrong, the whole process breaks down (often after considerable time) or becomes very slow.

Here is an example: I waned to use dask dataframe to analyze a large .csv file. After some time, I got the following error:

Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns. These first 1,000 rows led us to an incorrect
guess. For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row. You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes. df = dd.read_csv(..., dtype={'my-column': float}) Pandas has given us the following error when trying to parse the file: "could not convert string to float: '2010-12-13 00:00:00'"

This was useful, but it lead me to think that a sniffing process using random rows, or at least rows in the beginning and the end of the file, might be better than using the first rows to infer datatypes. Why? Typically the csv. file is already sorted and non-standard or missing values may be located towards the end of the file, or at least the first rows may be filled with columns that typically have only "nice" values.

By the way: dask is great (once I got it working!)

Saturday, 2 April 2016

Proof that Pandas really is developed by Python (episode 1)

From the docs. Check the link ...

Like many packages, pandas uses the Nose testing system and the convenient extensions in numpy.testing.