Hyper-Parameter Tuning and Model Selection, Like a Movie StarCoding, analyzing, selecting, and tuning like you really know what you’re doing.

Caleb NealeBlockedUnblockFollowFollowingJun 5Photo by Markus Spiske on Unsplash“Hyper-parameter tuning for random forest classifier optimization” is one of those phrases which would sound just as at ease in a movie scene where hackers are aggressively typing to “gain access to the mainframe” as it does in a Medium article on Towards Data science.

The reality of it, however, is that phrases like that are the unfortunate consequence of combining mathematical and computational concepts in one field, and worse, one name.

Though the concepts in this article will benefit from a solid understanding of fundamental python modelling using scikit-learn and how some of these models work, I’ll attempt to explain everything from the bottom up so readers of all levels can enjoy and learn these concepts; you too can sound (and code) like a Hollywood hacker.

In this article, I will attempt to address:What is a hyper-parameter and how does it differ from a parameter?When should hyper parameters be used?What do hyper-parameters actually do?How can hyper-parameters be tuned?What is grid search?What is pipelining?How are individual hyper-parameters defined?Skip to the end for a summary of all these topics.

What is a hyper-parameter?The term hyper-parameter came about due to the increasing prevalence of machine learning in programming and big data.

Many people who began their journey as data scientists or programmers will know the word parameter to be defined as a value which is passed into a function such that the function performs operations on and/or is informed by these value(s).

However, in machine learning and modelling, parameters are not input by the programmer but rather developed by the machine learning model.

This is due to the fundamental differences between machine learning and traditional programming; in traditional programming, rules and data are input by the programmer in order to output results, whereas in machine learning, data and results are input in order to output rules (usually called parameters in this context).

This Google I/O 2019 talk addresses this flip pretty succinctly in the first few minutes.

If the model itself generates parameters, it would be quite confusing to call what we (programmers, data scientists, whatever) input into the model parameters as well.

This is the birth of the term hyper-parameter.

Hyper-parameters are input into any machine learning model which generates its own parameters in order to influence the values of said generated parameters in the hope of making the model more accurate.

A little bit later in the article, I’ll show specific examples as well as defining what individual hyper-parameters are.

How are these individual hyper parameters defined and what are their effects?Let’s take a quick peek at scikit-learn’s documentation on Logistic Regression to better understand what this question really means.

LogisticRegression(penalty=’l2’, dual=False, tol=0.

0001, C=1.

0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’warn’, max_iter=100, multi_class=’warn’, verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)As we can see here, LogisticRegression() takes in 15 different values, which we now know to be called hyper-parameters.

However, every single one of those 15 values is defined with a default value, meaning that is it very possible, even common, to create a LogisticRegression() object without specifying any hyper-parameters.

This is the case for all models in scikit-learn.

As such, I’ll only take the time to define and explain some of the more relevant and commonly modified hyper-parameters for four common modelling methodologies.

Logistic regression:Penalty: is used to specify the method of penalization of the coefficients of noncontributing variables.

Lasso (L1) performs feature selection as it shrinks the less important feature’s coefficient to zero.

Ridge (L2) all variables are included in model, though some are shrunk.

Less computationally intensive than lasso.

Both penalty values restrict solver choices, as seen here.

C: is the inverse of the regularization term (1/lambda).

It tells the model how much large parameters are penalized, smaller values result in larger penalization; must be a positive float.

Common values: [0.

001,0.

1 …10.

100]class_weight: allows you to place greater emphasis on a class.

For example, if the distribution between class 1 and class 2 is heavily imbalanced, the model can treat the two distributions appropriately.

Default is that all weights = 1.

Class weights can be specified in a dictionary.

“Balanced” will create class weights that are inversely proportional to class frequencies, giving more weight to individual occurrences of smaller classes.

Linear Regression:Fit intercept: specifies whether the intercept of the model should be calculated or set to zero.

If false, intercept for regression line will be 0.

If true, model will calculate the intercept.

Normalize: specifies whether to normalize the data for the model using the L2 norm.

SVMC: is the inverse of the regularization term (1/lambda).

It tells the model how much large parameters are penalized, smaller values result in larger penalization; must be a positive float.

A higher C will cause the model to misclassify less, but is much more likely to cause overfit.

Good range of values : [0.

001, 0.

01, 10, 100, 1000…]class_weight: Set the parameter of class i to be class_weight[i] *C.

This allows you to place greater emphasis on a class.

For example, if the distribution between class 1 and class 2 is heavily imbalanced, the model can treat the two distributions appropriately.

Default is that all weights = 1.

Class weights can be specified in a dictionary.

“Balanced” will create class weights that are inversely proportional to class frequencies, giving more weight to individual occurrences of smaller classes.

K-Nearest Neighborsn_neighbors: determines the number of neighbors used when calculating the nearest-neighbors algorithm.

Good range of values: [2,4,8,16]p: power metric when calculating the Minkowski metric, a fairly mathematically complex topic.

When evaluating models, simply trying both 1 and two here is usually sufficient.

Use value 1 to calculate Manhattan distanceUse value 2 to calculate Euclidean distance (default)Random Forestn_estimators: sets the number of decision trees to be used in the forest.

Default is 100Good range of values: [100, 120, 300, 500, 800, 1200]max_depth: Set the max depth of the tree.

If not set then there is no cap.

The tree will keep expanding until all leaves are pure.

Limiting the depth is good for pruning trees to prevent over-fitting on noisy data.

Good range of values: [5, 8, 15, 25, 30, None]min_samples_split: The minimum number of samples needed before a split (differentiation) is made in an internal nodeDefault is 2Good range of values: [1,2,5,10,15,100]min_samples_leaf: The minimum number of samples needed to create a leaf (decision) node.

Default is 1.

This means that a split point at any depth will only be allowed if there is at least 1 sample for each path.

Good range of values: [1,2,5,10]max_features: Set the number of features to consider for the best node splitDefault is “auto”, which means that the square root of the number of features is used for every split in the tree.

“None” means that all features are used for each split.

Each decision tree in the random forest will typically use a random subset of features for splitting.

Good range of values: [log2, sqrt, auto, None]How can hyper-parameters be tuned and what do they actually do?In order to figure out both of these questions, let’s tackle an example using the classic UC Irvine Iris dataset.

First we’ll load the dataset and import some of the packages we’ll use:# import packagesimport numpy as npfrom sklearn import linear_model, datasetsfrom sklearn.

model_selection import GridSearchCVfrom sklearn.

linear_model import LogisticRegression from sklearn.

ensemble import RandomForestClassifier from sklearn.

model_selection import GridSearchCV from sklearn.

pipeline import Pipeline# Loading datasetiris = datasets.

load_iris()features = iris.

datatarget = iris.

targetNow let’s create a quick model using no additional hyper-parameters and get the score for later evaluation.

logistic.

fit(features, target)print(logistic.

score(features, target))Output:0.

96Now let’s try some methods of hyper-parameter tuning to see if we can improve the accuracy of our model.

What is grid search?Gird search is a method by which we create sets of possible hyper-parameters values for each hyper-parameter, then test them against each other in a “grid.

” For example, if I’d like to test a logistic regression with the values [L1, L2] and the values of C as [1,2] the GridSearchCV() method would test L1 with C=1, then L1 with C=2, then L2 with both values of C, creating a 2×2 grid and a total of four combinations.

Let’s go though an example with out current dataset.

The verbose parameter dictates whetehr the function will print information as it runs, and the cv parameter refers to cross validation folds.

Full documentation for GridSearchCV() can be found here.

# Create range of candidate penalty hyperparameter valuespenalty = ['l1', 'l2']# Create range of candidate regularization hyperparameter values C# Choose 10 values, between 0 and 4C = np.

logspace(0, 4, 10)# Create dictionary hyperparameter candidateshyperparameters = dict(C=C, penalty=penalty)# Create grid search, and pass in all defined valuesgridsearch = GridSearchCV(logistic, hyperparameters, cv=5, verbose=1) # the verbose parameter above will give output updates as the calculations are complete.

# select the best model and create a fitbest_model = gridsearch.

fit(features, target)Now that our model is created based of a larger input space, we can hope that we see improvement.

Let’s check:print('Best Penalty:', best_model.

best_estimator_.

get_params(['penalty']) print('Best C:', best_model.

best_estimator_.

get_params()['C'])print("The mean accuracy of the model is:",best_model.

score(features, target))Output:Best Penalty: l1 Best C: 7.

742636826811269The mean accuracy of the model is: 0.

98That’s an accuracy improvement of 0.

02, using the same model and adding a small variation in hyper-parameters.

Try experimenting with different sets of hyper-parameters and adding them to the hyper-parameter dict and running GridSearchCV() again.

Notice how adding many hyper-parameters quickly increases the computation time.

What is pipelining?What if we want to test more than one algorithm with more than one hyper-parameter in order to find the best model possible?.Pipelining allows us to do this in a code-efficient manner.

Let’s go through an example with our Iris dataset to see if we can improve on our logistic regression model.

# Create a pipelinepipe = Pipeline([("classifier", RandomForestClassifier())])# Create dictionary with candidate learning algorithms and their hyperparameterssearch_space = [ {"classifier": [LogisticRegression()], "classifier__penalty": ['l2','l1'], "classifier__C": np.

logspace(0, 4, 10) }, {"classifier": [LogisticRegression()], "classifier__penalty": ['l2'], "classifier__C": np.

logspace(0, 4, 10), "classifier__solver":['newton-cg','saga','sag','liblinear'] ##This solvers don't allow L1 penalty }, {"classifier": [RandomForestClassifier()], "classifier__n_estimators": [10, 100, 1000], "classifier__max_depth":[5,8,15,25,30,None], "classifier__min_samples_leaf":[1,2,5,10,15,100], "classifier__max_leaf_nodes": [2, 5,10]}]# create a gridsearch of the pipeline, the fit the best modelgridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=0,n_jobs=-1) # Fit grid searchbest_model = gridsearch.

fit(features, target)Notice how long this function takes to run.

In another article, I’ll talk about how to reduce run times and pick effective hyper-parameters, as well as combining a RandomizedSearchCV() with a GridSearchCV.

After running the method, let’s check the results.

print(best_model.

best_estimator_)print("The mean accuracy of the model is:",best_model.

score(features, target))Output:Pipeline(memory=None, steps=[('classifier', LogisticRegression(C=7.

742636826811269, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l1', random_state=None, solver='warn', tol=0.

0001, verbose=0, warm_start=False))], verbose=False) The mean accuracy of the model is: 0.

98According to our pipeline search, a LogisticRegression() with the specified hyper-parameters performs better than a RandomForestClassifier() with any of the given hyper-parameters.

Interesting!Okay, so we’ve used a pipeline method to make this all happen, but what does it actually do, and why did we pass in a RandomForestClassifier()?The pipeline method allows us to pass in preprocessing methods as well an algorithm we’d like to use to create a model with the data.

In this simple example we skipped the preprocessing step, but we still input a model.

The algorithm we input is simply the algorithm used to instantiate the pipe object, but will be replaced by the contents of our search_space variable which we create to be passed into our GridSearchCV() later on.

A simplified post focusing just on pipeline can be found here.

The difference between the accuracies of our original, baseline model, and the model generated with our hyper-parameter tuning shows the effects of hyper-parameter tuning.

By guiding the creation of our machine learning models, we can improve their performance and create better and more reliable models.

SummaryWhat is a hyper-parameter and how does it differ from a parameter?A hyper-parameter is used in machine learning model to better guide the creation of the the parameters which the models use to generate predictions on data.

Hyper-parameters are set by the programmer whereas parameters are generated by the model.

When should hyper parameters be used?Always!.Models usually have built in default hyper-parameters which can serve most purposes.

However in many cases, additional performance can be squeezed out of models using hyper-parameter tuning.

Knowing the limitations and effects of different hyper-parameters can assist in limiting negative effects like overfit while increasing performance.

What do hyper-parameters actually do?Simply, they change the ways in which the model approaches finding parameters for the model.

Individual definitions can be found in the article above.

How can hyper-parameters be tuned?Grid search, random search, and pipelining are common methods.

Random search isn’t addressed in this article but you can read more here.

What is grid search?Grid search is an element-wise test of all the hyper-parameters passed into the GridSearchCV() function.

Quite computationally expensive on large search spaces, grid search is also exhaustive in its testing.

What is pipelining?Pipelining allows the searching of multiple algorithms with many hyper-parameters each.

It is a very code efficient way of testing many models in order to select the best possible one.

Additionally, it can handle reprocessing methods as well, allowing for further control of the process.

Finally, below are some functions which can perform a few different types of hyper-parameter tuning just by passing in the arguments.

A Google Colab notebook containing all of the code use in this article can also be found here.

Using these functions, you can efficiently perform hyper-parameter tuning in just one line!# # # Hyperparameter tuning and model selectionimport numpy as npfrom sklearn import linear_modelfrom sklearn import datasetsfrom sklearn.

linear_model import LogisticRegression from sklearn.

ensemble import RandomForestClassifier from sklearn.

model_selection import GridSearchCV from sklearn.

pipeline import Pipelinefrom sklearn.

model_selection import RandomizedSearchCVfrom sklearn.

model_selection import GridSearchCVfrom sklearn.

ensemble import RandomForestRegressordef perform_gridsearch_log(features, labels, log_params = {'penalty': ['l1', 'l2'], 'C': np.

logspace(0, 4, 10)}, cv=5, verbose = 1): import numpy as np from sklearn import linear_model, datasets from sklearn.

model_selection import GridSearchCV global best_model logistic = linear_model.

LogisticRegression() penalty = log_params['penalty'] C = log_params['C'] hyperparameters = dict(C=C, penalty=penalty)gridsearch = GridSearchCV(logistic, hyperparameters, cv=cv, verbose=verbose) # Fit grid search best_model = gridsearch.

fit(features, target) print(best_model.

best_estimator_) print("The mean accuracy of the model is:",best_model.

score(features, labels))def rand_forest_rand_grid(features, labels, n_estimators = [int(x) for x in np.

linspace(start = 200, stop = 2000, num = 10)], max_features = ['auto', 'sqrt'], max_depth = [int(x) for x in np.

linspace(10, 110, num = 11)], min_samples_split = [2, 5, 10], min_samples_leaf = [1, 2, 4], bootstrap = [True, False]): max_depth.

append(None) global best_model random_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap} rf = RandomForestRegressor() rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=1, random_state=42, n_jobs = -1) best_model = rf_random.

fit(features, labels) print(best_model.

best_estimator_) print("The mean accuracy of the model is:",best_model.

score(features, labels))def rand_forest_grid_search(features, labels, n_estimators = [int(x) for x in np.

linspace(start = 200, stop = 2000, num = 10)], max_features = ['auto', 'sqrt'], max_depth = [int(x) for x in np.

linspace(10, 110, num = 11)], min_samples_split = [2, 5, 10], min_samples_leaf = [1, 2, 4], bootstrap = [True, False]): param_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap} global best_model rf = RandomForestRegressor() grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 1)best_model = grid_search.

fit(train_features, train_labels) print(best_model.

best_estimator_) print("The mean accuracy of the model is:",best_model.

score(features, labels))def execute_pipeline(features,labels, search_space=[ {"classifier": [LogisticRegression()], "classifier__penalty": ['l2','l1'], "classifier__C": np.

logspace(0, 4, 10) }, {"classifier": [LogisticRegression()], "classifier__penalty": ['l2'], "classifier__C": np.

logspace(0, 4, 10), "classifier__solver":['newton-cg','saga','sag','liblinear'] ##This solvers don't allow L1 penalty }, {"classifier": [RandomForestClassifier()], "classifier__n_estimators": [10, 100, 1000], "classifier__max_depth":[5,8,15,25,30,None], "classifier__min_samples_leaf":[1,2,5,10,15,100], "classifier__max_leaf_nodes": [2, 5,10]}], cv=5, verbose=0, n_jobs=-1):global best_model pipe = Pipeline([("classifier", RandomForestClassifier())]) gridsearch = GridSearchCV(pipe, search_space, cv=cv, verbose=verbose,n_jobs=n_jobs) # Fit grid search best_model = gridsearch.

fit(features, labels) print(best_model.

best_estimator_) print("The mean accuracy of the model is:",best_model.

score(features, labels))Thanks for reading!.