This series was probably too long! I can’t even remember the beginning, but once I started I figured I may as well be thorough. Hopefully it will provide some assistance to people getting started with scikit-learn and could use a little guidance on the basics. All the code is up on Github with instructions for running it locally, if anyone tries it out and has any issues running it on their machine please let me know! I’ll update the README with whatever steps are missing.

Thoughts:

  • It can be tricky figuring out useful ways to transform string features, but with a little exploration and creativity you can find things. I haven’t seen any other Titanic writeup that used the “Ticket” variable, but in my model it turned out that the numeric portion of the ticket number turned out to be #3 or 4 on the important features list. Most pieces of data have useful information in them if you can think of a way to extract it!
  • It’s really helpful to fully grok the mechanism behind the model you’re trying to employ in your pipeline or you’ll waste time doing things that definitely won’t work. Good learning experience, but not particularly efficient. For example, certain model parameters are mutually exclusive, so if you’re trying to test out the effect of both at the same time you’re going to chase your tail.
  • Sometimes even when your validation scores are high, you may still be overfitting. During this process I found that even when I had very high training and validation scores, I would have MUCH lower submission scores. In fact, I still haven’t cracked that issue. I’ve talked to other Kagglers in the same boat – no amount of validation is helping find a model that generalizes to the unlabeled data. Reducing variance seems to be even more important when the training set is small like this one (<1000 examples). The best submission score I had was .79

This has been a ton of fun, and I’m looking forward to working on other Kaggle projects in the near future. The National Data Science Bowl competition was just posted and is about predicting ocean health from images of plankton. Since this task is image recognition that probably means that it’s time to dive into Deep Learning!


Kaggle Titanic Tutorial in Scikit-learn

Part I – Intro
Part II – Missing Values
Part III – Feature Engineering: Variable Transformations
Part IV – Feature Engineering: Derived Variables
Part V – Feature Engineering: Interaction Variables and Correlation
Part VI – Feature Engineering: Dimensionality Reduction w/ PCA
Part VII – Modeling: Random Forests and Feature Importance
Part VIII – Modeling: Hyperparamter Optimization
Part IX – Bias, Variance, and Learning Curves
Part X – Validation: ROC Curves
Part XI – Summary