In case you haven't heard of Kaggle, it's a data science competition site where companies/organizations provide data sets relevant to a problem they're facing and anyone can attempt to build predictive models for the data set. The teams that best predict the data are rewarded with fame and fortune (or relatively minor sums of money compared to the value they provide, but that's another post). Some competitions are for learning, some just for fun, and some for the $$$.

One of the current competitions involves data on the passengers of the Titanic. The goal is to attempt to determine if a set of passengers escaped the sinking ship and found their way to a lifeboat. This is one of the learning competitions, but it is particularly challenging due to a small sample size from which to learn as well as the random/chaotic nature of a group of panicking people.

As this is a learning challenge, there is a reference solution provided which is a great starting point. The provided code generates a random forest model implemented in python using the scikit-learn library. There are a lot of different tools and techniques in scikit-learn that we can employ in pursuit of optimizing the reference model, and we'll cover many of these over the next couple weeks. Today's post will cover the basics of reading in the data and preparing it for feature engineering using the Pandas library which scikit-learn is built on.

<p>CODE: https://gist.github.com/anonymous/6383e0e8b5701c45ba80785e04113646.js</p>

 A few things to point out about this script:

  • We combine the data from the two files into one for a simple reason: when we perform feature engineering on the features, it's often useful to know the full range of possible values, as well as the distributions of all known values. This will require that we keep track of the training and test data during our processing, but it turns out to not be too difficult.
  • We are doing a fair amount of maintenance of the dataframe after combining. Pandas is extremely flexible with regards to combining data sets and requires a little extra TLC to make sure that it doesn't throw away any of the original information it maintains unless we explicitly tell it to.

Kaggle Titanic Tutorial in Scikit-learn

Part I - Intro

Part II - Missing Values

Part III - Feature Engineering: Variable Transformations

Part IV - Feature Engineering: Derived Variables

Part V - Feature Engineering: Interaction Variables and Correlation

Part VI - Feature Engineering: Dimensionality Reduction w/ PCA

Part VII - Modeling: Random Forests and Feature Importance

Part VIII - Modeling: Hyperparamter Optimization

Part IX - Validation: Learning Curves

Part X - Validation: ROC Curves

Part XI - Summary