val evaluator = new BinaryClassificationEvaluator()
Let’s now evaluate our model using Area under ROC as a metric.
import org.apache.spark.ml.param.ParamMap
val evaluatorParamMap = ParamMap(evaluator.metricName -> "areaUnderROC")
val aucTraining = evaluator.evaluate(trainingPredictions, evaluatorParamMap)
aucTraining: Double = 0.9999758519725919
val aucTest = evaluator.evaluate(testPredictions, evaluatorParamMap)
aucTest: Double = 0.6984384037015618
Our model’s AUC score is close to 1.0 for the training dataset and 0.69 for the test dataset. As mentioned earlier, a score closer to 1.0 indicates a perfect model and a score closer to 0.50 indicates a worthless model. Our model performs very well on the training dataset, but not so well on the test dataset. A model will always perform well on the dataset that it was trained with. The true performance of a model is indicated by how well it does on an unseen test dataset. That is the reason we reserved a portion of the dataset for testing.
One way to improve a model’s performance is to tune its hyperparameters. Spark ML provides a CrossValidator class that can help with this task. It requires a parameter grid over which it conducts a grid search to find the best hyperparameters using k-fold cross validation.
Let’s build a parameter grid that we will use with an instance of the CrossValidator class.
This code created a parameter grid consisting of two values for the number of features, three values for the regularization parameters, and two values for the maximum number of iterations. It can be used to do a grid search over 12 different combinations of the hyperparameter values. You can specify more options, but training a model will take longer since grid search is a brute-force method that tries all the different combinations in a parameter grid. As mentioned earlier, using a CrossValidator to do a grid search can be expensive in terms of CPU time.
We now have all the parts required to tune the hyperparameters for the Transformers and Estimators in our machine learning pipeline.
import org.apache.spark.ml.tuning.CrossValidator
val crossValidator = new CrossValidator()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(10)
.setEvaluator(evaluator)
val crossValidatorModel = crossValidator.fit(trainingData)
The fit method in the CrossValidator class returns an instance of the CrossValidatorModel class. Similar to other model classes, it can be used as a Transformer that predicts a label for a given feature Vector.
Let’s evaluate the performance of this model on the test dataset.
val newPredictions = crossValidatorModel.transform(testData)