feature engineering

3 Nov, 2014

Kaggle Titanic Competition Part II – Missing Values

2023-01-16T21:15:54-08:00November 3rd, 2014|4 Comments

There will be missing/incorrect data in nearly every non-trivial data set a data scientist ever encounters. It is as certain as death and taxes. This is especially true with big data and applies to data generated by humans in a social context or by computer systems/sensors. Some predictive models inherently are able to deal with missing data (neural networks come to mind) and others require that the missing values be dealt with separately. The RandomForestClassifier model in scikit-learn is not able to handle missing values, so we'll need to use some different approaches to assign values before training the [...]

30 Oct, 2014

Kaggle Titanic Competition Part I – Intro

2023-01-16T21:15:33-08:00October 30th, 2014|1 Comment

In case you haven't heard of Kaggle, it's a data science competition site where companies/organizations provide data sets relevant to a problem they're facing and anyone can attempt to build predictive models for the data set. The teams that best predict the data are rewarded with fame and fortune (or relatively minor sums of money compared to the value they provide, but that's another post). Some competitions are for learning, some just for fun, and some for the $$$. One of the current competitions involves data on the passengers of the Titanic. The goal is to attempt to determine [...]

Go to Top