This series was probably too long! I can't even remember the beginning, but once I started I figured I may as well be thorough. Hopefully it will provide some assistance to people getting started with scikit-learn and could use a little guidance on the basics. All the code is up on Github with instructions for running it locally, if anyone tries it out and has any issues running it on their machine please let me know! I'll update the README with whatever steps are missing. 

Thoughts:

  • It can be tricky figuring out useful ways to transform string features, but with a little exploration and creativity you can find things. I haven't seen any other Titanic writeup that used the "Ticket" variable, but in my model it turned out that the numeric portion of the ticket number turned out to be #3 or 4 on the important features list. Most pieces of data have useful information in them if you can think of a way to extract it!
  • It's really helpful to fully grok the mechanism behind the model you're trying to employ in your pipeline or you'll waste time doing things that definitely won't work. Good learning experience, but not particularly efficient. For example, certain model parameters are mutually exclusive, so if you're trying to test out the effect of both at the same time you're going to chase your tail.
  • Sometimes even when your validation scores are high, you may still be overfitting. During this process I found that even when I had very high training and validation scores, I would have MUCH lower submission scores. In fact, I still haven't cracked that issue. I've talked to other Kagglers in the same boat - no amount of validation is helping find a model that generalizes to the unlabeled data. Reducing variance seems to be even more important when the training set is small like this one (<1000 examples). The best submission score I had was .79

This has been a ton of fun, and I'm looking forward to working on other Kaggle projects in the near future. The National Data Science Bowl competition was just posted and is about predicting ocean health from images of plankton. Since this task is image recognition that probably means that it's time to dive into Deep Learning!

Kaggle Titanic Tutorial in Scikit-learn

Part I - Intro

Part II - Missing Values

Part III - Feature Engineering: Variable Transformations

Part IV - Feature Engineering: Derived Variables

Part V - Feature Engineering: Interaction Variables and Correlation

Part VI - Feature Engineering: Dimensionality Reduction w/ PCA

Part VII - Modeling: Random Forests and Feature Importance

Part VIII - Modeling: Hyperparamter Optimization

Part IX - Validation: Learning Curves

Part X - Validation: ROC Curves

Part XI - Summary