Kaggle Titanic Competition Part VI - Dimensionality Reduction

In the last post, we looked at how to use an automated process to generate a large number of non-correlated variables. Now we’re going to look at a very common way to reduce the number of features that we use in modelling. You may be wondering why we’d remove variables we just took the time to create. The answer is pretty simple – sometimes it helps. If you think about a predictive model in terms of finding a “signal” or “pattern” in the data, it makes sense that you want to remove noise in the data that hides the signal. Any features that are very weak, or in other words don’t provide useful information, should be removed to make it easier to take advantage of strong variables.

Aside from the signal/noise concept, it can also depend on the type of model. Some models do really well with a large number of variables, and can effectively ignore variables that only provide noise. Any model that uses L1 regularization (also known as “Lasso”) will excel at ignoring weak variables and dimensionality reduction won’t really do much. Models that aren’t able to handle weak variables can end up with high variance (overfitting) if we don’t make an effort to remove them.

One of the most common methods of dimensionality reduction is called Principal Component Analysis. It isn’t too difficult to understand on a high level, but fairly confusing in the details if you’re not up on your linear algebra/matrix math. Basically, it is a mechanism to analyze the variance of each variable with respect to all the other variables. Some variables have a wide range of values that don’t correlate with the others. Variables that don’t vary with respect to the others can be removed without affecting the distribution of information in the data set. After performing some matrix transformations (eigenvalues and eigenvectors, oh my!) on the variables, we’re left with a series of transformed vectors ordered by variability that represent the original data very well.

Did that make sense? Ehhh, probably not. If you’re curious for more rigorous explanations of what’s going on I’ve included some links at the end. The point is – if you have N variables and you think some of them are probably unnecessary/not useful, you can use PCA to automatically convert them to some new set of variables that is smaller than N that will have extraneous information removed.

As usual, it’s pretty easy to do this in scikit-learn:

# Minimum percentage of variance we want to be described by the resulting transformed components
variance_pct = .99

# Create PCA object
pca = PCA(n_components=variance_pct)

# Transform the initial features
X_transformed = pca.fit_transform(X,y)

# Create a data frame from the PCA'd data
pcaDataFrame = pd.DataFrame(X_transformed)

print pcaDataFrame.shape[1], " components describe ", str(variance_pct)[1:], "% of the variance"

In the Titanic competition I’ve tried building models with and without PCA. I initially tried this out while I was experimenting with different model types, as it is typically useful for linear models (again, unless you’re using Lasso). Because I ended on random forest, it turns out using PCA isn’t particularly helpful. Random Forest works very well without any feature transformations at all and even correlated features don’t really impair the model too much. So, there’s no need for this in my model, but it’s nice to know about anyway!

For more information on PCA, these are some great resources:

There are a lot of other possible algorithms to use for dimensionality reduction in scikit-learn. They all live in the sklearn.decomposition module and are probably equally useful for DR in different scenarios.

In the next post, we’ll finally get started with the actual Random Forest modeling, and look at another method for dimensionality reduction!