About Dave

This author has not yet filled in any details.
So far Dave has created 19 blog entries.
15 Dec, 2016

Using Category Encoders library in Scikit-learn

2017-01-30T13:42:52+00:00 December 15th, 2016|1 Comment

I recently found a relatively new library on github for handling categorical features named categorical_encoding and decided to give it a spin. As a reminder - categorical features are variables in your data that have a finite (ideally small) set of possible values, for example months of the year or hair color. You can't feed [...]

18 Nov, 2016

TF-IDF Basics with Pandas and Scikit-Learn

2017-01-30T13:44:52+00:00 November 18th, 2016|7 Comments

In a previous post we took a look at some basic approaches for preparing text data to be used in predictive models. In this post, well use pandas and scikit learn to turn the product "documents" we prepared into a Tf-idf weight matrix that can be used as the basis of a feature set for [...]

24 Jun, 2016

A Shiny New Python Data Science Sandbox in 30 Minutes Or Less

2017-01-30T11:40:40+00:00 June 24th, 2016|5 Comments

This post will give beginners a full walkthrough to go from nothing to a fully functional linux/python/pandas/scikit-learn environement with jupyter as a front end. For exploratory work, I really like this stack. My native OS is Windows, but since we're using VMs I would imagine the setup for OS X is very similar and probably [...]

20 May, 2016

Investigating missing data with missingno

2017-01-30T11:40:40+00:00 May 20th, 2016|0 Comments

I recently came across a new python package for visualizing missing elements of a data set. This is super useful when you're taking your first look at a new data set and trying to get a feel for what you're working with. Having a sense of the completeness of the data can help inform decisions [...]

10 May, 2016

Text Pre-processing Basics with Pandas

2017-01-30T13:47:13+00:00 May 10th, 2016|4 Comments

In this post, we'll take a look at the data provided in Kaggle's Home Depot Product Search Relevance challenge to demonstrate some techniques that may be helpful in getting started with feature generation for text data. Dealing with text data is considerably different than numerical data, so there are a few basic approaches that are [...]

7 Jul, 2015

Recommend-ify Zillow

2017-01-30T11:40:40+00:00 July 7th, 2015|0 Comments

I love Zillow. It's such an amazing search interface for real estate. But that's it... it's just a search interface. And because it's just search, I have to sort through good properties and bad. Maybe that situation benefits their business model, which I won't pretend to know. However, with a little data science we could [...]

23 Apr, 2015

Tableau 9 – Binning by Aggregate with Level of Detail Expressions

2017-01-30T11:40:40+00:00 April 23rd, 2015|1 Comment

We've recently worked on some visualizations in Tableau and overall it's been great. Tableau is absurdly easy to drag and drop your way to really slick, interactive visualizations. If you need to build visualizations and you've got the money for a license, it's well worth it. One task that was a bit of an issue [...]

16 Dec, 2014

Kaggle Titanic Competition Part XI – Summary

2017-01-30T11:40:40+00:00 December 16th, 2014|6 Comments

This series was probably too long! I can't even remember the beginning, but once I started I figured I may as well be thorough. Hopefully it will provide some assistance to people getting started with scikit-learn and could use a little guidance on the basics. All the code is up on Github with instructions for [...]

16 Dec, 2014

Kaggle Titanic Competition Part X – ROC Curves and AUC

2017-01-30T13:49:35+00:00 December 16th, 2014|0 Comments

In the last post, we looked at how to generate and interpret learning curves to validate how well our model is performing. Today we'll take a look at another popular diagnostic used to figure out how well our model is performing. The Receiver Operating Characteristic (ROC curve) is a chart that illustrates how the true [...]