In the last two posts, we've covered reading in the data set and handling missing values. Now we can start working on transforming the variable values into formatted features that our model can use. Different implementations of the Random Forest algorithm can accept different types of data. Scikit-learn requires everything to be numeric so we'll have to do some work to transform the raw data.
All possible data can be generally considered as one of two types: Quantitative and Qualitative. Quantitative variables are those whose values can be meaningfully sorted in a manner that indicates an underlying order. In the Titanic data set, Age is a perfect example of a quantitative variable. Qualitative variables describe some aspect of an object/phenomenon in a way that can't directly be related to other values in a useful mathematical way. This includes things like names or categories. For example, the Embarked value is the name of a departure port.
Different types of transformations can be applied to different types of variables. Qualitative transformations include:
- Dummy Variables
Also known as Categorical variable or Binary Variables, Dummy Variables can be used most effectively when a qualitative variable has a small number of distinct values that occur somewhat frequently. In the case of the Embarked variable in the Titanic dataset, there are three distinct values -> 'S', 'C', and 'Q'. We can transform 'Embarked' into dummies (so that we can use the information in the scikit-learn RandomForestClassifier code) with some simple code:
Pandas has a method called factorize() that creates a numerical categorical variable from any other variable, assigning a unique ID to each distinct value encountered. This is especially useful for transforming an alphanumeric categorical variable into a numerical categorical variable. In some ways creating a factor variable is similar to dummy variables, in that it allows you to generate a numerical category, but in this case it does this within a single variable. A categorical variable representing the letter of the Cabin can be created with the following code:
Quantitative transformations include:
Scaling is a technique used to address an issue with some models that variables with wildly different scales will be treated in proportion to the magnitude of their values. For example, Age values will likely max out around 100 while household income values may max out in the millions. Some models are sensitive to the magnitude of the values of the variables, so scaling all values by some constant can help to adjust the influence of each variable. Additionally, scaling can be performed in such a way to compress all values into a specific range (typically -1 to 1, or 0 to 1). This isn't necessary for RandomForest models, but is very helpful in other models you may want to try out with this dataset.
Binning is a term used to indicate creating quantiles. This allows you to create an ordered, categorical variable out of a range of values. In algorithms that respond effectively use categorical information this can be useful (probably not so great for linear regression).
Kaggle Titanic Tutorial in Scikit-learn
Part III - Feature Engineering: Variable Transformations