Blog

Dec 0
Dec 0

Kaggle Titanic Competition Part XI – Summary

This series was probably too long! I can’t even remember the beginning, but once I started I figured I may as well be thorough. Hopefully it will provide some assistance to people getting started with scikit-learn and could use a little guidance on the basics. All the code is up on Github with instructions for […]

  Read more
Dec 0
Dec 0

Kaggle Titanic Competition Part X – ROC Curves and AUC

In the last post, we looked at how to generate and interpret learning curves to validate how well our model is performing. Today we’ll take a look at another popular diagnostic used to figure out how well our model is performing. The Receiver Operating Characteristic (ROC curve) is a chart that illustrates how the true […]

  Read more
Dec 0
Dec 0

Kaggle Titanic Competition Part IX – Bias, Variance, and Learning Curves

In the previous post, we took at how we can search for the best set of hyperparameters to provide to our model. Our measure of “best” in this case is to minimize the cross validated error. We can be reasonably confident that we’re doing about as well as we can with the features we’ve provided […]

  Read more
Dec 0
Dec 0

Kaggle Titanic Competition Part VIII – Hyperparameter Optimization

In the last post, we generated our first Random Forest model with mostly default parameters so that we could get an idea of how important the features are. From that we can further reduce the dimensionality of our data set by throwing out some arbitrary amount of the weakest features. We could continue experimenting with […]

  Read more
Dec 0
Dec 0

Kaggle Titanic Competition Part VII – Random Forests and Feature Importance

In the last post we took a look at how reduce noisy variables from our data set using PCA, and today we’ll actually start modelling! Random Forests are one of the easiest models to run, and highly effective as well. A great combination for sure. If you’re just starting out with a new problem, this […]

  Read more
Nov 0
Nov 0

Kaggle Titanic Competition Part VI – Dimensionality Reduction

In the last post, we looked at how to use an automated process to generate a large number of non-correlated variables. Now we’re going to look at a very common way to reduce the number of features that we use in modelling. You may be wondering why we’d remove variables we just took the time […]

  Read more
Nov 0
Nov 0

Kaggle Titantic Competition Part V – Interaction Variables

In the last post we covered some ways to derive variables from string fields using intuition and insight. This time we’ll cover derived variables that are a lot easier to generate. Interaction variables capture effects of the relationship between variables. They are constructed by performing mathematical operations on sets of features. The simple approach that […]

  Read more
Nov 0
Nov 0

Kaggle Titanic Competition Part IV – Derived Variables

In the previous post, we began taking a look at how to convert the raw data into features that can be used by the Random Forest model. Any variable that is generated from one or more existing variables is called a “derived” variable. We’ve discussed basic transformations that result in useful derived variables, and in […]

  Read more
Nov 0
Nov 0

Kaggle Titanic Competition Part III – Variable Transformations

In the last two posts, we’ve covered reading in the data set and handling missing values. Now we can start working on transforming the variable values into formatted features that our model can use. Different implementations of the Random Forest algorithm can accept different types of data. Scikit-learn requires everything to be numeric so we’ll […]

  Read more
Nov 0
Nov 0

Kaggle Titanic Competition Part II – Missing Values

There will be missing/incorrect data in nearly every non-trivial data set a data scientist ever encounters. It is as certain as death and taxes. This is especially true with big data and applies to data generated by humans in a social context or by computer systems/sensors. Some predictive models inherently are able to deal with […]

  Read more
Oct 0
Oct 0

Kaggle Titanic Competition Part I – Intro

In case you haven’t heard of Kaggle, it’s a data science competition site where companies/organizations provide data sets relevant to a problem they’re facing and anyone can attempt to build predictive models for the data set. The teams that best predict the data are rewarded with fame and fortune (or relatively minor sums of money […]

  Read more
Oct 0
Oct 0

Automated Antenna Design with Machine Learning

Welcome to the Ultraviolet Analytics blog! This will be a sounding board for things we’re working on, things we find interesting and things we want to share. This will typically cover data science topics, but I won’t rule out an occasional cat video (what’s the point of a blog if you can’t post cat videos […]

  Read more