munging

20 May, 2016

Investigating missing data with missingno

2023-01-16T18:56:08-08:00May 20th, 2016|0 Comments

I recently came across a new python package for visualizing missing elements of a data set. The aptly named "missingno" is super useful when you're taking your first look at a new data set and trying to get a feel for what you're working with. Having a sense of the completeness of the data can help inform decisions about how to best handle missing values. In this post, we'll take a quick look at the small and simple Shelter Animal Outcomes data set from one of the current Kaggle competitions. Matrix visualization The first visualization is the "matrix" display. [...]

10 May, 2016

Text Pre-processing Basics with Pandas

2017-01-30T13:47:13-08:00May 10th, 2016|4 Comments

In this post, we'll take a look at the data provided in Kaggle's Home Depot Product Search Relevance challenge to demonstrate some techniques that may be helpful in getting started with feature generation for text data. Dealing with text data is considerably different than numerical data, so there are a few basic approaches that are an excellent place to start. As always, before we start creating features we'll need to clean and massage the data! In the Home Depot challenge, we have a few files which provide attributes and descriptions of each of the products on their website. The [...]

Go to Top