MEPS (Medical Expenditure Panels Survey) has a complicated survey design which involves repeated interviews, oversampling, and random recruitment of new households from randomly selected sampling units. In short, when estimating the standard errors of, say, mean expenditures, you cannot assume that the simple sample mean and the standard error can be generalized to whole country. To solve this there are several methods and the AHRQ provides both the variables and codes needed. Unfortunately it takes some time to use all this, so I thought it could be useful to give a very brief summary of my experience.
First: Does it really matter? Theoretically one might be accused of "doing it wrong" is one simply use the sample mean without weighting and adjusting, but if the error is small the extra time might not be worth the trouble.
Short and boring answer: It depends (on the size of the subgroups you examine). In my experience the mean is not too bad even for small sub-groups (say 70 individuals in a sample of more than 200 000), but the standard error can be way off. For instance, when examining total health spending among those without insurance who also died (between 1996 and 2010), the sample mean and the weighted mean was in the neighborhood of 9000 to 11 000 USD.
The standard error, however, was way off. The quick and dirty method of using the sample mean, pretending this was a random sample, gave a standard error around 60 USD. This looked too promising, so I went ahead and checked it using BRR (Balanced repeated replication). Why BRR? I had used Pandas and Python to merge and wrangle the data and there was no standard module to find the standard error of a complex dataset in Python/Pandas. Given that I had to write it, BRR seemed like a good option: The AHRQ provide a file with the half-samples (h036brr), it does not require the whole dataset to be in memory (only the sub-sample) and the method is not too difficult to implement (basically just calculating 128 means using weighted average and then finding the average deviations of the overall (weighted) sample mean and the 128 sub-sample means. SQRT this and you have the standard error).
Having done this, I was a bit shocked to see that the result was a standard error of about 5000 USD. Obviously the dirty method gave a standard error which was way too small. However, I also started to doubt the BRR result. It looked too big. I could not find anything wrong with the algorithm, but the method itself is a bit conservative. It converges on the right result, but it may do so with slower speed than other methods.
So I had to bite the bitter apple and go outside Python to try a third approach: Importing the data into stata and declaring to stata that this was a complex sample with various weights and strata. Stata then takes care of all the issues and calculates the standard error (I think Stata uses using Taylor Series methods). The result - a standard error of about 2000 USD.
In sum, yes for small samples the method may make a big difference for the estimation of standard errors, but the mean seems a lot more robust.
Mostly python and data analysis (small and big). Mixed with random rants and more mild mannered musings.
Thursday, 20 June 2013
Saturday, 8 June 2013
Reducing memory problems with large files in Pandas: Eliminate nan's and use dtypes
I have had some problems working with a large dataset in pandas (17 million obervations). I am repeating myself a bit (see previous post), but the problem is not solved. Some solutions so far have been:
1. Eliminate or replace missing data before using pd.read.csv. Make the missing values -1 or 0 or something that Pandas does not interprets as "nan" (not a number, missing). Why? Because if the column contains a nan, Pandas automatically makes the variable a datatype which takes more memory.
2. Even this is often not enough, because the pd.read.csv often assigns datatypes that takes more memory than usual, for instance using float64 when float16 is enough and so on. To reduce this problem, you can explicitly declare the datatype (and import only the variables needed):
3. Finally it is a good idea to save and later import the file using the pickel (or hd5) format since the pd.read.csv often runs out of memory even even when the file itself is not very large. So having several large files in memory and adding another using pd.read.csv will lead to memory problems, but importing the same as a whole dataframe it using pickle will be ok.
1. Eliminate or replace missing data before using pd.read.csv. Make the missing values -1 or 0 or something that Pandas does not interprets as "nan" (not a number, missing). Why? Because if the column contains a nan, Pandas automatically makes the variable a datatype which takes more memory.
2. Even this is often not enough, because the pd.read.csv often assigns datatypes that takes more memory than usual, for instance using float64 when float16 is enough and so on. To reduce this problem, you can explicitly declare the datatype (and import only the variables needed):
dtypes = {"id" : np.int32 , "birth_year" : np.int16 , "male" : np.int8 , "dead_year" : np.int16 , "dead_month" : np.int8}
vars = dtypes.keys()
dfp = pd.read_csv(path + "eolnprreducedv2personfile.csv", usecols = vars, dtype = dtypes, sep = ";", index_col="id")
3. Finally it is a good idea to save and later import the file using the pickel (or hd5) format since the pd.read.csv often runs out of memory even even when the file itself is not very large. So having several large files in memory and adding another using pd.read.csv will lead to memory problems, but importing the same as a whole dataframe it using pickle will be ok.
Subscribe to:
Posts (Atom)