A step-by-step tutorial for setting up Universal Recommender, the most popular ML engine for PredictionIO — on Ubuntu, from scratch
I recently found a relatively new library on github for handling categorical features named categorical_encoding and decided to give it a spin...
In this post, well use pandas and scikit learn to turn the product "documents" we prepared into a Tf-idf weight matrix that can be used as the basis of a feature set for modeling...
This post will give beginners a full walkthrough to go from nothing to a fully functional linux/python/pandas/scikit-learn environement with jupyter as a front end. For exploratory work, I really like this stack. My native OS is Windows, but since we're using VMs I would imagine the setup for OS X is very similar and probably won't need any modification (other than steps for configuring the VM). If you have a solid internet connection, we should be able to get this all done in under 30 minutes startiiiinnnnnng NOW... 1. Download an Ubuntu Desktop version of your choice. I like 14.04. ...
I recently came across a new python package for visualizing missing elements of a data set. This is super useful when you’re taking your first look at a new data set and trying to get a feel for what you’re working with...
In this post, we’ll take a look at the data provided in Kaggle’s Home Depot Product Search Relevance challenge to demonstrate some techniques that may be helpful in getting started with feature generation for text data
I love Zillow. It’s such an amazing search interface for real estate. With a little data science we could take the treasure trove of data they already have, add a few UI elements to capture some more, and provide personalized recommendations to house hunters.
If you've ever gotten stuck in Tableau 8 with binning by aggregate, here's a fix in Tableau 9...
ROC Curves help us quantify how well a binary classifier performs for both positive and negative examples
Explore the bias and variance of our model with Learning Curves
Hyperparameter optimization: algorithmically searching for the the best set of parameters to use when training a model
Random Forest is one of the easiest models to run, and highly effective as well. A great combination for sure. If you're just starting out with a new problem, this is a great algorithm to quickly build a reference model.
Dimensionality Reduction - mo' variables mo' problems
Interaction variables capture effects of the relationship between variables. They are constructed by performing mathematical operations on sets of features.
Derived Variables - use knowledge to turn raw data into valuable data
Transforming raw data into quantitative and qualitative variables
There will be missing/incorrect data in nearly every non-trivial data set a data scientist ever encounters. It is as certain as death and taxes.