Machine learning is often touted as:
A field of study that gives computers the ability to learn without being explicitly programmed.
Despite this common claim, anyone who has worked in the field knows that designing effective machine learning systems is a tedious endeavor, and typically requires considerable experience with machine learning algorithms, expert knowledge of the problem domain, and brute force search to accomplish. Thus, contrary to what machine learning enthusiasts would have us believe, machine learning still requires a considerable amount of explicit programming.
In this article, we’re going to go over three aspects of machine learning pipeline design that tend to be tedious but nonetheless important. After that, we’re going to step through a demo for a tool that intelligently automates the process of machine learning pipeline design, so we can spend our time working on the more interesting aspects of data science.
Let’s get started.
Model hyperparameter tuning is important
One of the most tedious parts of machine learning is model hyperparameter tuning.
Support vector machines require us to select the ideal kernel, the kernel’s parameters, and the penalty parameter C. Artificial neural networks require us to tune the number of hidden layers, number of hidden nodes, and many more hyperparameters. Even random forests require us to tune the number of trees in the ensemble at a minimum.
All of these hyperparameters can have significant impacts on how well the model performs. For example, on the MNIST handwritten digit data set:
If we fit a random forest classifier with only 10 trees (scikit-learn’s default):
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import cross_val_score mnist_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/mnist.csv.gz', sep='\t', compression='gzip') cv_scores = cross_val_score(RandomForestClassifier(n_estimators=10, n_jobs=-1), X=mnist_data.drop('class', axis=1).values, y=mnist_data.loc[:, 'class'.values, cv=10) print(cv_scores) [ 0.93461813 0.96287836 0.94688749 0.94072275 0.95114286 0.94570653 0.94884253 0.94311848 0.93825043 0.95668954 print(np.mean(cv_scores)) 0.946885709001The random forest achieves an average of 94.7% cross-validation accuracy on MNIST. However, what if we tuned that hyperparameter a little bit and provided the random forest with 100 trees instead?
cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1), X=mnist_data.drop('class', axis=1).values, y=mnist_data.loc[:, 'class'.values, cv=10) print(cv_scores) [ 0.96259814 0.97829812 0.9684466 0.96700471 0.966 0.96399486 0.97113461 0.96755752 0.96397942 0.97684391 print(np.mean(cv_scores)) 0.968585789367With such a minor change, we improved the random forest’s average cross-validation accuracy from 94.7% to 96.9%. This small improvement in accuracy can translate into millions of additional digits classified correctly if we’re applying this model on the scale of, say, processing addresses for the U.S. Postal Service.
Never use the defaults for your model. Hyperparameter tuning is vitally important for every machine learning project.
Model selection is important
We all love to think that our favorite model will perform well on every machine learning problem, but different models are better suited for different tasks.
For example, if we’re working on a signal processing problem where we need to classify whether there’s a “hill” or “valley” in the time series:
And we apply a “tuned” random forest to the problem:
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import cross_val_score hill_valley_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/Hill_Valley_without_noise.csv.gz', sep='\t', compression='gzip') cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1), X=hill_valley_data.drop('class', axis=1).values, y=hill_valley_data.loc[:, 'class'.values, cv=10) print(cv_scores) [ 0.64754098 0.64754098 0.57024793 0.61983471 0.62809917 0.61983471 0.70247934 0.59504132 0.49586777 0.65289256 print(np.mean(cv_scores)) 0.617937948787Then we’re going to find that the random forest isn’t well-suited for signal processing tasks like this one when it achieves a disappointing average of 61.8% cross-validation accuracy.
What if we tried a different model, for example a logistic regression?
cv_scores = cross_val_score(LogisticRegression(), X=hill_valley_data.drop('class', axis=1).values, y=hill_valley_data.loc[:, 'class'.values, cv=10) print(cv_scores) [ 1. 1. 1. 0.99173554 1. 0.98347107 1. 0.99173554 1. 1. print(np.mean(cv_scores)) 0.996694214876We’ll find that a logistic regression is well-suited for this signal processing task—in fact, it easily achieves near-100% cross-validation accuracy without any hyperparameter tuning at all.
Always try out many different machine learning models for every machine learning task that you work on. Trying out—and tuning—different machine learning models is another tedious yet vitally important step of machine learning pipeline design.