Safe Handling Instructions for Missing Data

In machine learning tasks, it is common to handle missing data by simply removing observations with missing values, or just replacing missing data with the mean value for its feature. To show why this is problematic, we use listwise deletion and mean imputing to recover missing values from artificially created datasets, and we compare those models against ones with full information. Unless quite strong independence assumptions are met, we observe large biases in the resulting coefficients and an increase in the model's prediction error. We conclude by repeating the experiment on a real dataset, and showing the appropriate diagnostic and correction steps to handle missing values.

Dillon  NiederhutSpeaker: Dillon Niederhut, Enthought