feature engineering

10 May, 2016

Text Pre-processing Basics with Pandas

2017-01-30T13:47:13-08:00May 10th, 2016|4 Comments

In this post, we'll take a look at the data provided in Kaggle's Home Depot Product Search Relevance challenge to demonstrate some techniques that may be helpful in getting started with feature generation for text data. Dealing with text data is considerably different than numerical data, so there are a few basic approaches that are an excellent place to start. As always, before we start creating features we'll need to clean and massage the data! In the Home Depot challenge, we have a few files which provide attributes and descriptions of each of the products on their website. The [...]

1 Dec, 2014

Kaggle Titanic Competition Part VII – Random Forests and Feature Importance

2023-01-16T21:20:01-08:00December 1st, 2014|0 Comments

In the last post we took a look at how reduce noisy variables from our data set using PCA, and today we'll actually start modelling! Random Forests are one of the easiest models to run, and highly effective as well. A great combination for sure. If you're just starting out with a new problem, this is a great model to quickly build a reference model. There aren't a whole lot of parameters to tune, which makes it very user friendly. The primary parameters include how many decision trees to include in the forest, how much data to include in [...]

26 Nov, 2014

Kaggle Titanic Competition Part VI – Dimensionality Reduction

2023-01-16T21:18:39-08:00November 26th, 2014|0 Comments

In the last post, we looked at how to use an automated process to generate a large number of non-correlated variables. Now we're going to look at a very common way to reduce the number of features that we use in modelling. You may be wondering why we'd remove variables we just took the time to create. The answer is pretty simple - sometimes it helps. If you think about a predictive model in terms of finding a "signal" or "pattern" in the data, it makes sense that you want to remove noise in the data that hides the [...]

10 Nov, 2014

Kaggle Titantic Competition Part V – Interaction Variables

2023-01-16T21:18:10-08:00November 10th, 2014|0 Comments

In the last post we covered some ways to derive variables from string fields using intuition and insight. This time we'll cover derived variables that are a lot easier to generate. Interaction variables capture effects of the relationship between variables. They are constructed by performing mathematical operations on sets of features. The simple approach that we use in this example is to perform basic operators (add, subtract, multiply, divide) on each pair of numerical features. We could also get much more involved and include more than 2 features in each calculation, and/or use other operators (sqrt, ln, trig functions, [...]

7 Nov, 2014

Kaggle Titanic Competition Part IV – Derived Variables

2023-01-16T21:17:40-08:00November 7th, 2014|0 Comments

In the previous post, we began taking a look at how to convert the raw data into features that can be used by the Random Forest model. Any variable that is generated from one or more existing variables is called a "derived" variable. We've discussed basic transformations that result in useful derived variables, and in this post we'll look at some more interesting derived variables that aren't simple transformations. An important aspect of feature engineering is using insight and creativity to find new features to feed the model. You'll read this over and over again, and it really can't [...]

5 Nov, 2014

Kaggle Titanic Competition Part III – Variable Transformations

2023-01-16T21:16:18-08:00November 5th, 2014|4 Comments

In the last two posts, we've covered reading in the data set and handling missing values. Now we can start working on transforming the variable values into formatted features that our model can use. Different implementations of the Random Forest algorithm can accept different types of data. Scikit-learn requires everything to be numeric so we'll have to do some work to transform the raw data. All possible data can be generally considered as one of two types: Quantitative and Qualitative. Quantitative variables are those whose values can be meaningfully sorted in a manner that indicates an underlying order. In [...]

Go to Top