Statistical analysis in the presence of missing data

Andrea Burton, Doug Altman and Taane Clark

Missing data complicates the data handling and analysis of epidemiological data, in particular for prognostic models. Standard practice is to exclude from the analyses those individuals whose data are incomplete. This approach makes inherent assumptions, loses efficiency and may lead to biased results of an unknown degree unless the patients excluded are a random sub-sample of the entire dataset.

A large amount of data were missing when we were developing a prognostic model for ovarian cancer and, as a solution we implemented a Bayesian imputation method, enabling all cases to be used in the analysis. This work led to tutorial papers on missing data specific for epidemiologists and data analysts.

A three-year research project also evolved from this research to investigate the properties of various strategies for handling missing covariate data when developing prognostic models. The first element of this project reviewed published cancer studies and found that missing covariate data was a common occurrence, but there are serious deficiencies in the reporting of this potentially serious problem. We have proposed guidelines for reporting prognostic studies with missing covariate data, which should help to contribute to the improved reporting in future articles. Simulation studies and re-sampling from large complete datasets provide the main framework for this investigation, since it is only the knowledge of the complete data that allows the true effect of the missing data and performance of the different approaches for handling this missing covariate data to be ascertained. The overall objective of this research is to provide guidelines for constructing prognostic models in the presence of missing covariate data.

Publications: 40, 49, 87, 151