In the last post we covered some ways to derive variables from string fields using intuition and insight. This time we'll cover derived variables that are a lot easier to generate.
Interaction variables capture effects of the relationship between variables. They are constructed by performing mathematical operations on sets of features. The simple approach that we use in this example is to perform basic operators (add, subtract, multiply, divide) on each pair of numerical features. We could also get much more involved and include more than 2 features in each calculation, and/or use other operators (sqrt, ln, trig functions, etc).
This process of automated feature generation can quickly produce a LOT of new variables. In our case, we use 9 features to generate 176 new interaction features. In a larger data set with dozens or hundreds of numeric features, this process can generate an overwhelming number of new interactions. Some types of models are really good at handling a very large number of features (I've heard of thousands to millions), which would be necessary in such a case.
It's very likely that some of the new interaction variables are going to be highly correlated with one of their original variables, or with other interactions, which can be a problem especially for linear models. Highly correlated variables can cause an issue called "multicollinearity". There is a lot of information out there about how to identify, deal with, and safely ignore multicollinearity in a data set so I'll avoid an explanation here, but I've included some great links at the bottom of this post if you're interested.
In our solution for the Titanic challenge, I don't believe that multicollinearity is a problem specifically because Random Forests are not a linear model. Removing highly correlated features is a good idea anyway though, if for no other reason than to improve performance. We'll use a Spearman correlation to identify and remove highly correlated features. We identify highly correlated features using Spearman's rank correlation coefficient but you could certainly experiment with other methods such as Pearson product-moment correlation coefficient.
If you're interested in learning more about multicollinearity, these are some excellent posts worth checking out:
- When Can You Safely Ignore Multicollinearity?
- What Are the Effects of Multicollinearity and When Can I Ignore Them?
- Enough Is Enough! Handling Multicollinearity in Regression Analysis
In the next post, we'll take a look at dimensionality reduction using principle component analysis (PCA).
Kaggle Titanic Tutorial in Scikit-learn
Part V - Feature Engineering: Interaction Variables and Correlation