dataHans: 2013

Wednesday, 21 August 2013

Shoud "inplace" be standard to reduce clutter, keystrokes and errors

Every time I do something, like replacing missing values in pandas dataframe, I often have to either write the name of the object twice, or explicitly specify "inplace." Now, I know there are good reasons for this, but I just wonder whether it might be practical to do it the other way around. make "inplace standard" when using methods on objects and if you want to avoid changing the object this has to be explicitly specified.

Why? To save keystrokes (important in itself). However, it also reduces the probability of silly errors like typing the name of the object wrong in one place.

Example: I wanted to drop a row because it was added in the original dataset by an error. I wrote:

cost_in_age = cost_in_age.drop(cost_in_age.index[105])

A shorter way, which does not work now, but is equally (or more) intuitive, would be:

cost_in_age.drop(cost_in_age.index[105])

Sometimes it is possible to do the latter, but then one has to specify that it is to be done "inplace". So when we do something with objects we should not have to specify this.

OK, I know there are probably good reasons why it is like this. Methods and functions return things and unless it is explicitly specified they we should put the output returned in an object, it is simply just returned.

But still, it does not look technically impossible to make it the other way. Inplace as standard? Benefits would be a lot of saved typing and reduced errors. Cost?

Wednesday, 10 July 2013

Scraping: Don't forget regex!

I was using python to scrape some times related to run pacing and split times. Using lxml.html worked fine to import web pages with tables to collect data and links (See Wes' book: Python for Data Analysis, p. 166- for a brief introduction). But there was one problem. An important link to the pages with historical split time for each individual, was hidden within a table and it used java ("open.window" pop-up). No matter how hard I tried I could not make lxml catch the url of this link. Then, after a few hours the solution occurred to me: Why not simply use regular expressions!

The key problem is that when you parse a document using lxml, it creates a structured document which makes it easy to scrape some things (like tables), but sometimes more difficult to find other things like a single piece of information hidden within an element contained in another element. In this case searching using standard regular expressions on a flat text file of the document may be easier - and faster - than using lxml or other tools.

In fact, this worked so fast end well that it may serve as a useful reminder to others: before investing a lot of time in lxml, BeautifulSoup or Scrapy, consider whether regex might be sufficient!

Here is a the python code for scraping standard links (starting with http):

import re

from urllib2 import urlopen

url = (address to the web page you want to scrape)
html = urlopen(url).read()
links = re.findall(r'href=[\'"]?([^\'" >]+)', html)

Thursday, 20 June 2013

Standard errors in MEPS - Quick comparison of dirty, brr and proper

MEPS (Medical Expenditure Panels Survey) has a complicated survey design which involves repeated interviews, oversampling, and random recruitment of new households from randomly selected sampling units. In short, when estimating the standard errors of, say, mean expenditures, you cannot assume that the simple sample mean and the standard error can be generalized to whole country. To solve this there are several methods and the AHRQ provides both the variables and codes needed. Unfortunately it takes some time to use all this, so I thought it could be useful to give a very brief summary of my experience.

First: Does it really matter? Theoretically one might be accused of "doing it wrong" is one simply use the sample mean without weighting and adjusting, but if the error is small the extra time might not be worth the trouble.

Short and boring answer: It depends (on the size of the subgroups you examine). In my experience the mean is not too bad even for small sub-groups (say 70 individuals in a sample of more than 200 000), but the standard error can be way off. For instance, when examining total health spending among those without insurance who also died (between 1996 and 2010), the sample mean and the weighted mean was in the neighborhood of 9000 to 11 000 USD.

The standard error, however, was way off. The quick and dirty method of using the sample mean, pretending this was a random sample, gave a standard error around 60 USD. This looked too promising, so I went ahead and checked it using BRR (Balanced repeated replication). Why BRR? I had used Pandas and Python to merge and wrangle the data and there was no standard module to find the standard error of a complex dataset in Python/Pandas. Given that I had to write it, BRR seemed like a good option: The AHRQ provide a file with the half-samples (h036brr), it does not require the whole dataset to be in memory (only the sub-sample) and the method is not too difficult to implement (basically just calculating 128 means using weighted average and then finding the average deviations of the overall (weighted) sample mean and the 128 sub-sample means. SQRT this and you have the standard error).

Having done this, I was a bit shocked to see that the result was a standard error of about 5000 USD. Obviously the dirty method gave a standard error which was way too small. However, I also started to doubt the BRR result. It looked too big. I could not find anything wrong with the algorithm, but the method itself is a bit conservative. It converges on the right result, but it may do so with slower speed than other methods.

So I had to bite the bitter apple and go outside Python to try a third approach: Importing the data into stata and declaring to stata that this was a complex sample with various weights and strata. Stata then takes care of all the issues and calculates the standard error (I think Stata uses using Taylor Series methods). The result - a standard error of about 2000 USD.

In sum, yes for small samples the method may make a big difference for the estimation of standard errors, but the mean seems a lot more robust.

Saturday, 8 June 2013

Reducing memory problems with large files in Pandas: Eliminate nan's and use dtypes

I have had some problems working with a large dataset in pandas (17 million obervations). I am repeating myself a bit (see previous post), but the problem is not solved. Some solutions so far have been:

1. Eliminate or replace missing data before using pd.read.csv. Make the missing values -1 or 0 or something that Pandas does not interprets as "nan" (not a number, missing). Why? Because if the column contains a nan, Pandas automatically makes the variable a datatype which takes more memory.

2. Even this is often not enough, because the pd.read.csv often assigns datatypes that takes more memory than usual, for instance using float64 when float16 is enough and so on. To reduce this problem, you can explicitly declare the datatype (and import only the variables needed):

dtypes = {"id" : np.int32 , "birth_year" : np.int16 , "male" : np.int8 , "dead_year" : np.int16 , "dead_month" : np.int8}
vars = dtypes.keys()
dfp = pd.read_csv(path + "eolnprreducedv2personfile.csv", usecols = vars, dtype = dtypes, sep = ";", index_col="id")

3. Finally it is a good idea to save and later import the file using the pickel (or hd5) format since the pd.read.csv often runs out of memory even even when the file itself is not very large. So having several large files in memory and adding another using pd.read.csv will lead to memory problems, but importing the same as a whole dataframe it using pickle will be ok.

Friday, 10 May 2013

The Datasets Package — statsmodels 0.4.0 documentation

Good!

The Datasets Package — statsmodels 0.4.0 documentation:

'via Blog this'

Large files in Pandas

I have had some problems loading and analyzing large files in Pandas. Python complained about running out of memory, even when I did not expect it to. After some trial and error I found some partial solutions which might be helpful for others.

Sometime the problem is in the process of reading the file. The "pd.read_csv" seems to use a lot of memory. Anyway, here is what I did:

1. The obvious: Use the "usecols" option to load only the variables needed.

2. I tried using the "chunk" option and gluing together, but for me this dis not solve the problem.

3. I tried reading one column at a time, use "squeeze" and glue toegther, This worked better, but eventually I ran into memory problem here too.

4. The final - and best solution so far - specify the data types. How? Create a dictionary with the variable names and the dtypes you want. Use the dtype = dict option in the "pd.read_csv" Why does it work? Well, it seems like Pandas assigned all variables float64, which takes a lot of memory. Since my variables were usually float16 and less, telling Pandas this dramatically reduced the memory requirement.

The last solution, however, does not completely solve the problem. First of all, the most memory efficient solution (specifying many of the variables as short integers and booleans), does not work when you have missing values. Second, eventually there will be memory problems if you have enouogh variables/observations.

Still, a combination of the methods above seems to dramatically reduce the memory problem when loading the data.

Hopefully some day it will be possible to use pandas on datasets without having to load the whole dataset into memory. SAS does this and making it possible in Pandas would be very useful.

Added:
5. Making sure that there are no missing values (deleting or changing the values) is also helpful because pandas converts integers to float format - which is much more memory intensive - if there are missing variables. observations.

Tuesday, 7 May 2013

wbopendata: Stata module to access World Bank databases

Perfect for quick downloading of large WHO datasets.

wbopendata: Stata module to access World Bank databases | Data: "db wbopendata"

'via Blog this'

Tuesday, 23 April 2013

The Best Way to Learn Python | Nettuts+

Good resource:

The Best Way to Learn Python | Nettuts+:

'via Blog this'

Monday, 18 March 2013

Web development tutorials, from beginner to advanced | Nettuts+

This was quite good:

Web development tutorials, from beginner to advanced | Nettuts+:

For instance:

Python

And more specifically (a good beginners intro):

Object Oriented Programming

'via Blog this'

Polar predictions and hierarchical models

I was recently asked to predict how many of the states which will implemented the Medicaid expansion by 2016 and 2020. The question led to a reflection about the nature of predictions in which the extremes are more likely than the middle. The probability of many states having signed up in 2016 is high given the level of federal subsities offered to the states (100% for the first three years, then 90%). On the other hand, there may be political costs (seemingly accepting the ACA may have a political cost) and worries about the credibility of the promised federal funding. In addition, however, the presidential election in 2016 may drastically alter the system. So, I find myself believing that in 2020 there is a high probability that very many or very few (none!) of the states will be enrolled in the expansion. Exacly how would I derive the probabilities?

The answer is that a hierarchy of distriution and beliefs must be aggregated. We have beliefs about a democratic vs. republican vicory in 2016. We also have beliefs about the extent to which the Republicans will change different aspects of ACA. Then there is the risk of a financial crisis and renegotiation of the terms. Taken together this may lead to a polar prediction distribution - with fat end points and little in between.

So what? Well, it may be obviuous, but still one ofte thinks about probabilities as monotonically increasing. If 36 is most likely, then 35 is quite likely and 0 is very unlikely. The example reminds me that this intuition is wrong. It is perfectly possible that the extremes have high probabilities.

Weight measurement, uncertainty and statistical significance

During a demonstration of the electronic health records in a VA hospital, we were shown a graph of the weight of the patient at different visits. A participant then asked whether it would be possible to make the program also indicate whether a change in weight was statistically significant. Initially I thought it was a bad idea for several reasons: First of all, the weight of a patient is not associated with the same type of uncertainty we have when we draw a random sample from the population in an in an opinion survey or similar samples. The measured weight of the persion is the weight of the person. OK, there might me measurement errors, but that is a different kind of uncertainty and the error associated with the scale should not be large in a hospital.

Discussing this with a friend, however, nuanced my view a little. The weight of a person might differ depending on whether the person has just eaten and so on. This means that if the interesting parameter is"the average weight within a short time period" and not the "weight right now", then there current measurement will be drawn from a distribution. Both could be relevant. Weight right now may be more relevant for dosage to be used right away, weight "in general" would be more relevant to assess weight loss or proper dosage of drugs over the short term.

However, the uncertainty from weight variation will still not be the same as the sample uncertainty we get when we draw individuals from a large population. Instead, it seems that we should use the knowledge we have about how much the weight might plausibly vary during a day to model the uncertainty. Measurements righ after a meal might, perhaps, increase the weight by 1 kg. After exercise and no drinking, the weight might be 1 kg below average. As a first approximation one might assume that the weight of the person is drawn from a normal distribution with 0.5 kg as the standard deviation so 95% would be within +/- 1 kg.

Still, this solution seems far from perfect. The weight of a person will have sudden spikes (meals, exercise, bathroom, drinks) and measurements are not equally likely to be taken at every point during the day. Now, there is a difference between the individual and the aggregate, but I still worry a little about the spikes and timing. I am not sure how much it matters, so it may just be a theoretical worry, but before I am willing to ignore it it would be good to try it out.

How? A Bayesian model using R and JAGS might be be used to model the distribution of measured weights relative to actual "average weight" depending on different assumptions about the distribution. I do not know hoe to model spikes like the ones we would observe with weight, but it could be an intersting exercise.

Of course, one might just avoid the whole problem by arguing that there is little point in adding or changing a graph that is easily visualized and continous into something - statistical significance - that is discrete and numerical. I agree. In this sense the example simple reveals how standard classic intuitions about statistics lead to wrong demands.

Reference

http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=885300

Tuesday, 19 February 2013

Two changes at the same time in Python

I needed to remove ending of the variable names of a few thousand variables in fifteen different panda databases. The variable names often had the year included, which made it difficult to merge since having the year included in the variable names made the names different across years. No big problem:

i=0
for db in dflist:
db.columns = [varName.replace(yearEndList[i], "") for varName in db.columns]
i = i + 1

I also wanted to change the names to lower case. Again, no big problem:

db.columns = [varName.lower() for varName in db.columns]

But I kept wondering whether it could be done more elegantly. Both replace and upper at the same time, not sequentially- in an elegant and fast way. A for loop could do this, for instance something similar to this seems natural:

for varName in db.columns
replace(yearEndList[i], "")
upper()

It is implicit that we are doing tings with the variable in the loop.

Perhaps there is no similarly easy readable way to do multiple changes with list comprehension. Yes, it can be done, but not very elegantly, or?

Wednesday, 23 January 2013

Argh! BOM, UTF-8, and solution

Potentially useful rant: If you ever have a problem importing and analyzing what you believe is a standard .csv file (for examples using Python and Pandas' read.csv), you may want to know that sometimes the .csv file contains a hidden code (details about the encoding used, such as UTF-8 etc, BOM). After wasting too much time discovering and dealing with this, I found a quick solution: Open the .csv file in Notebook ++, go to Encoding and select "Encode in UTF-8 without BOM." Save the file again and the problem is gone.