There will be missing/incorrect data in nearly every non-trivial data set a data scientist ever encounters. It is as certain as death and taxes. This is especially true with big data and applies to data generated by humans in a social context or by computer systems/sensors. Some predictive models inherently are able to deal with missing data (neural networks come to mind) and others require that the missing values be dealt with separately. The RandomForestClassifier model in scikit-learn is not able to handle missing values, so we'll need to use some different approaches to assign values before training the model. The following is a partial list of ways missing values can be dealt with:
- Throw out any data with missing values - I don't particularly like this approach, but if you've got a lot of data that isn't missing any values it is certainly the quickest and easiest way to handle it.
- Assign a value that indicates a missing value - This is particularly appropriate for categorical variables (more on this in the next post). I really like using this approach when possible because the fact that the value is missing can be useful information in and of itself. Perhaps when a value is missing for a particular variable, that has some underlying cause that makes it correlate more highly with another value. Unfortunately, there isn't really a great way to do this with continuous variables. One interesting trick I learned is that you can do this with binary variables (again, discussed more in the next post) by setting the false value as -1, the true value as 1, and missing values as 0.
- Assign the average value - This is a very common approach because it is simple, and for variables that aren't extremely important it very well may be good enough. You can also incorporate other variables to create subsets and assign the average within the group. In cases of categorical variables, the most common value can be applied rather than the statistical mean.
- Use a regression or another simple model to predict the values of missing variables - This is the approach I used for the Age variable in the Titanic set, because age seemed to be one of the more important variables and I thought this would provide better estimates than using mean values. The general approach is take whatever other feature are available (and populated) and build a model using the examples that do have values for the variable in question. Then predict the value for the others. I used the following code to populate the missing Age variable using a RandomForestClassifier model, but a simple Linear Regression probably would have been fine:
Kaggle Titanic Tutorial in Scikit-learn
Part II - Missing Values