Optimizing Hyperparameters in Random Forest ClassificationWhat hyperparameters are, how to choose hyperparameter values, and whether or not they’re worth your timeReilly MeinertBlockedUnblockFollowFollowingJun 5In this post, I will be taking an in-depth look at hyperparameter tuning for Random Forest Classification models using several of scikit-learn’s packages for classification and model selection.
I will be analyzing the wine quality datasets from the UCI Machine Learning Repository.
For the purpose of this post, I have combined the individual datasets for red and white wine, and assigned both an extra column to distinguish the color of the wine, where 0 represents a red wine and 1 represents a white wine.
The purpose of this classification model is to determine whether a wine is red or white.
In order to optimize this model to create the most accurate predictions, I will be focusing solely on hyperparameter adjustment and selection.
What is a hyperparameter?Most generally, a hyperparameter is a parameter of the model that is set prior to the start of the learning process.
Different models have different hyperparameters that can be set.
For a Random Forest Classifier, there are several different hyperparameters that can be adjusted.
In this post, I will be investigating the following four parameters:n_estimators: The n_estimators parameter specifies the number of trees in the forest of the model.
The default value for this parameter is 10, which means that 10 different decision trees will be constructed in the random forest.
max_depth: The max_depth parameter specifies the maximum depth of each tree.
The default value for max_depth is None, which means that each tree will expand until every leaf is pure.
A pure leaf is one where all of the data on the leaf comes from the same class.
min_samples_split: The min_samples_split parameter specifies the minimum number of samples required to split an internal leaf node.
The default value for this parameter is 2, which means that an internal node must have at least two samples before it can be split to have a more specific classification.
min_samples_leaf: The min_samples_leaf parameter specifies the minimum number of samples required to be at a leaf node.
The default value for this parameter is 1, which means that every leaf must have at least 1 sample that it classifies.
More documentation regarding the hyperparameters of a RandomForestClassifier() can be found here.
How do you adjust hyperparameters?Hyperparameters can be adjusted manually when you call the function that creates the model.
forest = RandomForestClassifier(random_state = 1, n_estimators = 10, min_samples_split = 1)How do you choose which hyperparameters to adjust?Prior to beginning the adjustment of the hyperparameters, I performed an 80/20 train/test split on my data.
The different hyperparameters would be tested on the training set, and once the optimized parameter values were chosen, a model would be constructed using the chosen parameters and the testing set, and then would be tested on the training set to see how accurately the model is able to classify the types of wine.
forest = RandomForestClassifier(random_state = 1)modelF = forest.
fit(x_train, y_train)y_predF = modelF.
predict(x_test)When tested on the training set with the default values for the hyperparameters, the values of the testing set were predicted with an accuracy of 0.
Validation CurvesThere’s a few different ways you can choose which hyperparameters to adjust for you model.
A good way to visually check for potentially optimized values of model hyperparameters is with a validation curve.
A validation curve can be plotted on a graph to show how well a model performs with different values of a single hyperparameter.
The following code was run to create the four validation curves seen here, with the values of param_name and param_range being adjusted accordingly for each of the four parameters that we are investigating.
train_scoreNum, test_scoreNum = validation_curve( RandomForestClassifier(), X = x_train, y = y_train, param_name = 'n_estimators', param_range = num_est, cv = 3)This validation curve was created with the values [100, 300, 500, 750, 800, 1200] as the different values to be tested for n_estimators.
In this image, we see that, when testing the values, the best value appears to be 750.
It is important to note that, even though there appears to be a large difference between the training and cross-validation score, the training set had an average accuracy of 100% for each of the three cross-validations, and the cross-validation set had between 99.
5% and 99.
6% accuracy for all the values of n_estimators, which shows that this model is very accurate regardless of the number of estimators used.
In this graph, we see that the highest accuracy value on the cross-validation is close to 99.
3% when the max_depth is set to 15, which is the value that we will place in our model.
Whole it may seem better to choose a max_depth of 30, because that value has the highest accuracy for the training score, we elect not to in order to prevent our model from overfitting the training data.
In this graph, we see that the accuracy actually goes down for both the training and cross-validation sets at higher values for min_samples_split, so we will choose 5 as our number for min_samples_split.
In this case, it makes sense that we would want a lower value for min_samples_split, as the default value for this parameter is 2.
As we choose higher values for the minimum number of samples required before splitting an internal node, we will have more general leaf nodes, which would have a negative affect on the overall accuracy of our model.
In this graph, we see that accuracy goes down for both the training and cross-validation sets for each additional increase in value of min_samples_leaf, so we will chose 1 for the value of our parameter, which again makes sense considering the default value for this parameter is 1.
It is important to note that, when constructing the validation curves, the other parameters were held at their default values.
For the purpose of this post, we will be using all of the optimized values together in a single model.
A new Random Forest Classifier was constructed, as follows:forestVC = RandomForestClassifier(random_state = 1, n_estimators = 750, max_depth = 15, min_samples_split = 5, min_samples_leaf = 1) modelVC = forestVC.
fit(x_train, y_train) y_predVC = modelVC.
predict(x_test)This model resulted in an accuracy of 0.
993076923077, which was more accurate than our first model, but only by .
Exhaustive Grid SearchAnother way to choose which hyperparameters to adjust is by conducting an exhaustive grid search or randomized search.
Randomized searches will not be discussed in this post, but further documentation regarding their implementation can be found here.
An exhaustive grid search takes in as many hyperparameters as you would like, and tries every single possible combination of the hyperparameters as well as as many cross-validations as you would like it to perform.
An exhaustive grid search is a good way to determine the best hyperparameter values to use, but it can quickly become time consuming with every additional parameter value and cross-validation that you add.
n_estimators = [100, 300, 500, 800, 1200]max_depth = [5, 8, 15, 25, 30]min_samples_split = [2, 5, 10, 15, 100]min_samples_leaf = [1, 2, 5, 10] hyperF = dict(n_estimators = n_estimators, max_depth = max_depth, min_samples_split = min_samples_split, min_samples_leaf = min_samples_leaf)gridF = GridSearchCV(forest, hyperF, cv = 3, verbose = 1, n_jobs = -1)bestF = gridF.
fit(x_train, y_train)The code shown here took over 25 minutes to run, but did choose hyperparameters that had 100% accuracy in predicting the training models.
The resulting “best” hyperparameters are as follows: max_depth = 15, min_samples_leaf = 1, min_samples_split = 2, n_estimators = 500.
Again, a new Random Forest Classifier was run using these values as hyperparameters inputs.
forestOpt = RandomForestClassifier(random_state = 1, max_depth = 15, n_estimators = 500, min_samples_split = 2, min_samples_leaf = 1) modelOpt = forestOpt.
fit(x_train, y_train)y_pred = modelOpt.
predict(x_test)This model also resulted in an accuracy of 0.
993076923077 when tested using the testing set.
Is adjusting hyperparameters worth it?Carefully and methodically adjusting hyperparameters can be advantageous.
It can make your classification model more accurate, which will lead to more accurate predictions overall.
However, it may not always be worth your while.
Let’s take a look at the results of our different tests:The biggest thing to note is the overall improvement in accuracy.
The hyperparameters chosen based on the results of the grid search and validation curve resulted in the same accuracy when the model was applied to our testing set: 0.
This improved our original model’s accuracy on the testing set by .
Considering it took over 25 minutes to run the exhaustive grid search on our 4 desired hyperparameters, it may not have been worth the time in this case.
Additionally, two of the “optimized” hyperparameter values given to us by our grid search were the same as the default values for these parameters for scikit-learn’s Random Forest Classifier.
When looking at the confusion matrices for each of the two optimized models, we see that both resulted in the same number of incorrect predictions for both red and white wines, as shown here:ConclusionHyperparameter tuning can be advantageous in creating a model that is better at classification.
In the case of a random forest, it may not be necessary, as random forests are already very good at classification.
Using exhaustive grid search to choose hyperparameter values can be very time consuming as well.
However, in cases where there are only a few potential values for your hyperparameters or when your initial classification model isn’t very accurate, it might be a good idea to at least investigate the effect of changing some of the hyperparameter values in your model.
Key Terms/Ideas: hyperparameters, validation curve, exhaustive grid search, cross-validation.