pull down to refresh

Across the social and biological sciences, statisticians use a technique that leverages randomness to deal with the unknown.
Data is almost always incomplete. Patients drop out of clinical trials and survey respondents skip questions; schools fail to report scores, and governments ignore elements of their economies. When data goes missing, standard statistical tools, like taking averages, are no longer useful.
“We cannot calculate with missing data, just as we can’t divide by zero,” said Stef van Buuren(opens a new tab), the professor of statistical analysis of incomplete data at the University of Utrecht.
Suppose you are testing a new drug to reduce blood pressure. You measure the blood pressure of your study participants every week, but a few get impatient: Their blood pressure hasn’t improved much, so they stop showing up.
You could leave those patients out, keeping only the data of those who completed the study, a method known as complete case analysis. That may seem intuitive, even obvious. It’s also cheating. If you leave out the people who didn’t complete the study, you’re excluding the cases where your drug did the worst, making the treatment look better than it actually is. You’ve biased your results.
Avoiding this bias, and doing it well, is surprisingly hard. For a long time, researchers relied on ad hoc tricks, each with their own major shortcomings. But in the 1970s, a statistician named Donald Rubin(opens a new tab) proposed a general technique, albeit one that strained the computing power of the day. His idea was essentially to make a bunch of guesses about what the missing data could be, and then to use those guesses. This method met with resistance at first, but over the past few decades, it has become the most common way to deal with missing data in everything from population studies to drug trials. Recent advances in machine learning might make it even more widespread.
Rubin’s approach, called multiple imputation, takes that distribution into account. To use it, first make several copies of your data set. For a given missing value in one copy, randomly assign a guess from your distribution. By design, you’re more likely to pick one of the better guesses, but you’ll also have a small chance of picking one of the less plausible guesses. This process reflects the uncertainty in each guess. Repeat these steps for the missing value in each of the other copies of the data set.
Once you’ve filled in all the missing data, you can analyze each completed data set. You’ll end up with several different predictions for the effectiveness of your drug. Then you can use a recipe known as Rubin’s rules to pool your results and get an average prediction. By following these steps, you can also compute a better estimate of your final prediction’s uncertainty. For drug regulators like the U.S. Food and Drug Administration, being accurate about that uncertainty is crucial: It influences whether or not a drug will get approved.
It's confronting to read about a technique I never heard of. Granted, i don't need to use a lot of statistics in my work, but it's a reminder that in many fields where statistics is important (I'm looking at you, human science), it is likely that the researchers doing the work, are probably not aware of the state of the art knowledge on how to properly treat with uncertainty in their data.
reply
I do think some assumptions need to be made about the missing data for the procedure to be valid though. Like, if the data is missing at random then I think the procedure would work great. If missing-ness is non-random, but only depends on observed variables, then the procedure could also work.
But if missing is non-random and also depends on unobserved correlates, especially if it depends on unobserved correlates of the outcome variable, then I think the procedure is likely to yield biased results.
reply
In the first scenarios, where it seems ok, are you introducing measurement error?
If so, you're going to have attenuation bias.
reply
Hmm, good point. I’ll admit I haven’t thought carefully about imputation. Why wouldn’t any procedure that imputes data without the outcome variable lead to attenuation bias, and why wouldn’t any procedure that uses the outcome lead to endogeneity?
I’m assuming there’s a good answer if I read the literature. But it’s possible I’d be disappointed as well
reply
I hadn't thought about imputation leading to attenuation bias until just now, but it seems like it would (if I understand why measurement error has that effect).
I'm also sure this has been discussed at length in the literature. It surprises me a little that none of my advisors or econometrics professors mentioned it, though.
reply
We tend to use predicted values for missing variables. One of my advisors would recommend doing it a bunch of different ways and hoping they all tell the same story in the end.
reply
I thought we could "trust the science"... :)
reply
Ah yes, the magical "robustness check".
Possibly the most commonly asked for exercise by referees, but the least scientifically grounded one (at least, from the point of view of statistical rigor.)
For those not in the know, it basically means try your results under a variety of different assumptions and if the main result still holds, then the result is "robust".
I have to say, there's a certain logic to it---but it's weird that econometricians obsess with how to formally calculate the asymptotic variance of an estimator, and then on the other hand ask for these totally hand-wavy exercises like robustness checks.
reply
In my advisor's defense, she doesn't care that much about econometric nit-picking. Her preference is very much to find highly defensible natural experiments and then do a simple regression analysis.
I actually hadn't even connected this practice to robustness checks in my mind, because it comes up towards the beginning of the process. It always just seemed like her attitude was "Why don't you see if it's a problem before you spend a bunch of time worrying about it?"
reply
Oh yeah I’m not really criticizing robustness checks.
If anything, the people who trust too much in the formal statistics deserve more criticism (imo). And I’m mainly referring to social science here as well.
reply
I remember my office-mate going down a crazy rabbit hole trying to figure out exactly what the right standard error calculations should be for his job market paper. He probably spent a month obsessing over it and never did come to a definitive conclusion.
Probably no surprise, but each of the half dozen different approaches told pretty much the same story.
reply
Guessing it's part of their job. 😁
reply