dataHans: May 2016

After finishing the first version of a tool for downloading data to a pandas dataframe in python (https://gist.github.com/anonymous/e1463d45e4c4e8673bfcfbaf585cdd8c), I took a moment to reflect. The coding part was fun, but there is one key lesson, and an issue that I would like to get some input on before rewriting the whole thing is:

Naming is hard and important

Initially I used the following to syntax to download a table (specified in a gui box) into a dataframe:

df = get_df(box)

This is OK, but later I decided that it might be more intuitive to recycle the standard pandas api-style:

df = read_box(box)

Isolated changes like these may sound like small details, but logically consistent and intuitive names for methods, functions and variables turned out to be very important as the code expanded. For instance, at some point it was evident that it woud be useful to download the data not only based on the selection in a gui box, but by using a specified json dictionary (for users wanting to avoid the notebook and the gui).

The "read_" syntax made it easier to extend the code logically to cover this, since I could just use:

df = read_json(table_id = '14714', query = json_query)

And later, to read all variables and values:

df = read_all()

Although "read_" is better, I am not entirely happy since the different cases above are not exactly analogous. Here are some alternatives:

df = read_based_on_box(box)
df = read_based_on_json(json)

df = read_json(json = 'all')

But this is rather verbose and has lots of underscores.

The same goes for parameters in the functions. Ideally one would like something short, intuitive, consistent, and not too common (to avoid name collisions). For instance, to make the gui box where the user can select the variables/values, I first used:

build_query(id = 10714)

I like the brevity of this, but "id" is, first of all, a reserved word in Python. It is also not very explicit since it does not convey what kind of id that is being referenced. Finally, I suppose many users already used id as a variable. This is not a fatal problem in Pyton, since variables are local, but it makes it more confusing. So, reluctantly (since I am lazy and like to use as few keystrokes as possible) I had to change the id parameter.

To what? My first replacement was "table":

build_query(table = 10714)

But this ended up being a mess. I should have known this from the beginning, but I wanted something short without lots of underscores. But "table" is a term that is far too common: used in lots of other places in my code as well as other peoples' code, and it is not explicit and intuitive because it does not convey the key fact that we are talking about the unique id number of a table.

So I eventually had to admit defeat and add an underscore parameter "table_id":

build_query(table_id = 10714)

This was, not, however, the end of my worries. A minor improvement, I think, was to rename the "build_query" to "select." One reason for this was that non-experts may not know what "build_query" means, but more importantly the function does not really build a query. It simply creates a gui box for the user to select the variables to be included when downloading the table: Which years to include, whether to include per capita or total values and so on. As an added bonus it was shorter and there was no underscore. So, now we have:

select (table_id = 10714)

The probem now was that the unique table sometimes included leading zeros:

select (table_id = 07187)

Leading zeros does not well with integers.

Right now I have a solution that may be frawned upon: I allow table_id to be both a string and an integer. If the user specifies an integer with less than five digits the function "magically" makes it a string and add initial zeros.

In other words: the following is currently equivalent:

select (table_id = '07187')

select (table_id = 7187)

Oh, the horror? Or perhaps it if a useful flexible approach?

Is the flexibility worth the wrath of the PEP Gods?

OK, I admit: It is most likely horribly inconsistent and should be avoided. The Gods are right.

But I am not sure what to do. Is it best to define "table_id" as a string (always), or an integer (always), or is it OK to allow both?

If forced to choose: Although I tend to like integers since (again, to reduce clutter and keystrokes), but in the end a string might be the most concistent specification.

Argh! A string table_id is slightly more ugly and cumbersome, but this is where things are headed unless you stop me ...

As if this was not enough, I initially allowed two ways of selecting a table:

1. Based on the stable and unique five digit table_id specified by Statistics Norway:
select (table_id = '07187')

2. Based on the row number in a dataframe with different tables
select(row = 4, source = search_result_dataframe)

In fact I was stupid enough to make #2 the standard approach at first since I thought it would be easier and more intuitive for beginners to identify a table by the row number (since it is the first column in the dataframe shown on the screen). It was also shorter and the reader did not have to make the mentally ardous task of learning about the longer (but unique and stable) table_ids that belong to each table..

But it was a mistake; Approach #1 using "table_id" is more reliable, intuitive and shorter (with row, the source dataframe also has to be specified).

Right now the "row" selection is still allowed, but I think I will eliminate this and only allow "table_id." In which case I will also update the result of the search by making "table_id" the first column (ie. the index column) in the dataframe that is returned after a search. Unfortunately this means that the index will not be concecutively numbered in natural patter 0, 1, 2 etc, but by the five digit id (14514, 05646 etc). Still, it is for the better, I think.

In short, naming things is important, but difficult. I wonder if there are some more general principles that can be used. For instance: always (or never) use a verb? Like "get_df" "read_df" and "avoid/use abbreviations" (use "language" not "lang"?), avid triple underscores (?) and so on. Grammar of graphics have some rules. Is htere a grammar of naming? Ok. I know there are rules such as "avoid camelCase" and so on, but I was thinking more about the terms themselves and the structure. Perhaps it is one of those things that are simply not suited for general priciples. Art more than science?

Statistics Norway has made more than 5000 tables available with a new API and here is a tool that will make it easier to download the data to a Pandas dataframe:
https://gist.github.com/anonymous/e1463d45e4c4e8673bfcfbaf585cdd8c

Basically the tool allows you to:

1. search('cows')
2. Select variables and values for a table in a widget gui: box = select(table_id = '14714')
3. Download the selected table/values in a Pandas dataframe: df = read_box(box)

And some other potentially useful things e.g. get the json query associated with the selection:
get_json(box) or get the json associated with getting all variables and values: full_json(table_id = '14714').

The coding part was fun, but the process also made me reflect a little. One key lesson, and an issue that I would like to get some input on before rewriting the whole thing is:

dataHans

Sunday, 29 May 2016

Naming is hard and important: A true story!

Saturday, 28 May 2016

Import data from Statistics Norway to a Pandas dataframe in Python

Wednesday, 4 May 2016

RAIRD