In the previous post, we began taking a look at how to convert the raw data into features that can be used by the Random Forest model. Any variable that is generated from one or more existing variables is called a "derived" variable. We've discussed basic transformations that result in useful derived variables, and in this post we'll look at some more interesting derived variables that aren't simple transformations.
An important aspect of feature engineering is using insight and creativity to find new features to feed the model. You'll read this over and over again, and it really can't be emphasized enough - feature engineering is a hugely important part of the data science pipeline and is where you should spend the most time and effort. The basic transformations and interaction variables that we can automate (more on that later) don't take too much time, so that leaves us with efforts to creatively find new variables from the raw data.
Very basic examples of a useful derived variable might be pulling the country code and/or area code out of telephone numbers, or extracting country/state/city from GPS coordinates. Any time a qualitative variable represents an object in the world that we know something about, there is an opportunity to derive variables from it. Also, if a data set represents a timeseries or other historical behavioral information that can also provide a great opportunity for uncovering derived variables.
The titanic data set is very simple, and doesn't really have a LOT to work with, but there are some text fields which provide us a few opportunities.
The Name variable is useless on it's own, but provides us the most to work with. Two obvious opportunities are:
Names - perhaps if you have more (or less) names that indicates something about your status what would effect your ability to get on a lifeboat?
Title - How you are addressed can definitely indicate status (and gender) which had some influence on getting on a lifeboat
FamilyID - A great example of using creativity to tie together several variables, Trevor Stephens created a really interesting derivied variable by identifying family members from last name and total family size. It's in R and I decided not to duplicate it here, but definitely worth a look
Not a lot to do here, but a little research into the deckplans (or a little common sense) indicates that the letter in the cabin variable is the deck, and the number is the room number. The room numbers increased towards the back of the boat, so perhaps that provides some useful measure of location. Additionally, different decks also provide some information on location as well as socioeconomic status, again valuable determining who gets on the lifeboats.
This variable is clearly ripe for extracting information, but it's not immediately clear what the values mean. Some quick googling didn't turn up any information on decoding the values, so we'll have to make some guesses. After sorting all the values and examining them, a few things give us some clues:
- About a quarter of the tickets have an alphanumeric prefix while the rest consist only of a number
- There are 45 distinct prefixes initially. If we remove '.' and '/' characters (which appear to be superfluous) and make a few other adjustments that number drops to 29.
- The number part of the value seems to have some loose correlations - numbers starting with 1 are usually first class tickets, 2 usually second, and 3 third. I say usually because it holds for a majority of examples but not all. There are also tickets numbers starting with 4-9, and those are rare and almost exclusively third class.
- I can't seem to notice any pattern to whether the ticket number is a 4, 5, or 6-digit number, but that may provide some amount of information as well.
- Several people can share a ticket number. This could be used to create another feature very similar to the familyID, except this would cover situations like nannies, or close friends which would probably act like a family unit that is being captured in the familyID
Here's the code:
In the next post, we'll take a look at automatically generating interaction variables and then testing them to remove redundant values
Kaggle Titanic Tutorial in Scikit-learn
Part IV - Feature Engineering: Derived Variables