Limitations of Common Solutions to Missing Data
06th October 2008
A previous article discussed some of the causes of missing data and some of the consequences of analyzing only complete cases. This newsletter will discuss some other common ways of dealing with missing data, with a discussion of their advantages and disadvantages.
Available case analysis (pairwise deletion) calculates each step of the analysis separately using the cases that have data available for that step. Therefore, a case with data missing on one variable will be used only in steps that do not involve that variable. The advantage is that the sample size for each individual analysis is generally higher than with complete case analysis, but the results are unbiased only if the data are MCAR. It can also lead to mathematical problems in computing estimates of some parameters, and is not recommended.
Most other methods involve imputation-replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values. There are many ways to choose an estimate. The following are common methods:
* Mean: the mean of the observed values for that variable
* Substitution: the value from a new individual who was not selected to be in the sample
* Hot deck: a randomly chosen value from an individual who has similar values on other variables
* Cold deck: a systematically chosen value from an individual who has similar values on other variables
* Regression: the predicted value obtained by regressing the missing variable on other variables
* Stochastic regression: the predicted value from a regression plus a random residual value.
* Interpolation and extrapolation: an estimated value from other observations from the same individual.
Imputation is popular because it is conceptually simple and because the resulting sample has the same number of observations as the full data set. It can be very tempting when complete-case analysis eliminates a large proportion of the data set. But it has limitations. Some imputation methods result in biased parameter estimates, such as means and correlations, unless the data are MCAR. The bias is often worse than with complete-case analysis, especially for mean imputation. The extent of the bias depends on many factors, including the missing data mechanism, the proportion of the data that is missing, and the information available in the data set.
Moreover, all of these imputation methods underestimate standard errors. Since the imputed observations are themselves estimates, their values have corresponding random error. Despite this, imputed values are treated as actual observations in analyses. The extra source of error is ignored, resulting in too-small standard errors and too-small p-values. Furthermore, although imputation is conceptually simple, it is usually difficult to do well in practice. Therefore, these imputation methods are not satisfactory in most circumstances.
Two alternate methods maintain the full sample size and can result in unbiased estimates of parameters and standard errors for ignorable missing data: multiple imputation and maximum likelihood estimation. These techniques are now available in common statistical software. Subsequent newsletters will describe these methods and discuss their availability in software packages.
Copyright © 2008, Karen Grace-Martin
Karen Grace-Martin, founder of The Analysis Factor, has helped social science researchers practice statistics for 9 years, as a statistical consultant at Cornell University and in her own business. She knows the kinds of resources and support that researchers need to practice statistics confidently, accurately, and efficiently, no matter what their statistical background. To answer your questions, receive advice, and view a list of resources to help you learn and apply appropriate statistics to your data, visit www.analysisfactor.com.
Karen Grace-Martin, founder of The Analysis Factor, has helped social science researchers practice statistics for 9 years, as a statistical consultant at Cornell University and in her own business. She knows the kinds of resources and support that researchers need to practice statistics confidently, accurately, and efficiently, no matter what their statistical background. To answer your questions, receive advice, and view a list of resources to help you learn and apply appropriate statistics to your data, visit www.analysisfactor.com.