pandas

7 Nov, 2014

Kaggle Titanic Competition Part IV – Derived Variables

2023-01-16T21:17:40-08:00November 7th, 2014|0 Comments

In the previous post, we began taking a look at how to convert the raw data into features that can be used by the Random Forest model. Any variable that is generated from one or more existing variables is called a "derived" variable. We've discussed basic transformations that result in useful derived variables, and in this post we'll look at some more interesting derived variables that aren't simple transformations. An important aspect of feature engineering is using insight and creativity to find new features to feed the model. You'll read this over and over again, and it really can't [...]

5 Nov, 2014

Kaggle Titanic Competition Part III – Variable Transformations

2023-01-16T21:16:18-08:00November 5th, 2014|4 Comments

In the last two posts, we've covered reading in the data set and handling missing values. Now we can start working on transforming the variable values into formatted features that our model can use. Different implementations of the Random Forest algorithm can accept different types of data. Scikit-learn requires everything to be numeric so we'll have to do some work to transform the raw data. All possible data can be generally considered as one of two types: Quantitative and Qualitative. Quantitative variables are those whose values can be meaningfully sorted in a manner that indicates an underlying order. In [...]

3 Nov, 2014

Kaggle Titanic Competition Part II – Missing Values

2023-01-16T21:15:54-08:00November 3rd, 2014|4 Comments

There will be missing/incorrect data in nearly every non-trivial data set a data scientist ever encounters. It is as certain as death and taxes. This is especially true with big data and applies to data generated by humans in a social context or by computer systems/sensors. Some predictive models inherently are able to deal with missing data (neural networks come to mind) and others require that the missing values be dealt with separately. The RandomForestClassifier model in scikit-learn is not able to handle missing values, so we'll need to use some different approaches to assign values before training the [...]

30 Oct, 2014

Kaggle Titanic Competition Part I – Intro

2023-01-16T21:15:33-08:00October 30th, 2014|1 Comment

In case you haven't heard of Kaggle, it's a data science competition site where companies/organizations provide data sets relevant to a problem they're facing and anyone can attempt to build predictive models for the data set. The teams that best predict the data are rewarded with fame and fortune (or relatively minor sums of money compared to the value they provide, but that's another post). Some competitions are for learning, some just for fun, and some for the $$$. One of the current competitions involves data on the passengers of the Titanic. The goal is to attempt to determine [...]

Go to Top