dataHans: API: adding new functions vs. including more arguments into existing functions

When expanding the features, there is often a choice between adding new functions and including more arguments into existing functions. Consider the api for downloading data into a pandas dataframe in stats_to_pandas (0.0.7):

read_box(), downloads a table as specified by a widget box
read_premade(), downloads a table based on a premade table_id (a subset of tables)
read_all(), downloads all the variables and values from the table indicated by table_id
read_with_json(), downloads a table given by "table_id" based on a user supplied json dictionary that selects variables and values
read_url(): downloads a json-stat table from an url

Instead of this, it would be possible to have one read function with several optional arguments:

read(box=None, table_id=None, premade=False, language= 'en', query=None, full_url = None)

Although the last is only one function, it is not inherently simpler since it requires the user to know a longer list of arguments.

One advantage of the all-in-one approach is that the high level api stays relatively short. It also reduces the amount of code since many of these functions share similar code.

A disadvantage is that autocomplete often works better with functions than with arguments inside functions.

Another disadvantage is that the all-in-one function easily becomes very complex and non-intuitive. For instance, in the case above:

If "box" is specified as an argument, none of the other arguments are needed (and will be ignored).
If tabe_id is specified, the box argument is ignored, but the user has to supply more arguments to determine what variables and values to download from the table: is it a premade table (set premade to True) and no further information is necessary, if it is not a premade table, but you want to download all variables and values, set query = 'all'. If you want to use a json-stat dictionary to specify which variables and values to download, use query = my_json_dict.lastly, if you have the full url to a json-stat table, you need only specify the full_url argument (and all other argument will be ignored).

All of this sounds complicated, and it may well be better to maintain separate functions instead of trying to include them all into one. Ideally the list or arguments should not have a structure that "argument B is only needed if A is given, in which case argument C will be ignored." Instead it seems more logically consistent and intuitive that function arguments should never by mutually destructive. The arguments should specify different features that all apply to the object. In the example above the arguments are mutually destructive: If one is specified, sometimes one of the features is no longer relevant. Specifying the source as a boc makes the table_id redundant. The user may then be confused: What is returned if both box and table_id is specified?

The problem could partly be solved by better argument structure. There could, for instance, be a "source" argument which tells the whether the source of the dataframe is a widget box or a table_id (or something else). Originally this was implicit in the sense that as soon as one was specified, the code considered it to be a source - but this makes the result unpredictable if both are specified.

So, perhaps in this case the many-functions, few-arguents is a better approach?

Perhaps. But this also confuses the user since it may be unclear which function to use. Some users do not know what a json dictionary is and will be confused by the read_with_json() function.

The user is king, and this means that before choosing one might consider how often users will need, and be confused, by different functions. If, for instance, most users almost all the time will only need a single read function with the same set of arguments, one might consider having a single read function with these defaults, and just add some arguments to this function for more advanced users who occasionally need more advanced options. And these users are often able to figure out the arguments in the function. This is the case above. Few users will need the read_premade function, but it is useful to have for those who need it. Adding it as a separate function may confuse less experienced users since they do not quite know which read function to use, but adding it as an argument to a general read function eliminate this problem. One may argue that it just pushes the problem to a different level since the user will now be confused by the arguments in the function, but with sensible defaults, less experienced users may not care or think too much about the other more complex options and are free to ignore them.

Although it may be friendly to less experienced users, I am bothered by the logical inconsistency created by the all-in-one function using common defaults.

dataHans

Saturday, 2 July 2016

API: adding new functions vs. including more arguments into existing functions

No comments:

Post a Comment