Hyperparameter Tuning with callbacks in KerasAbhishek RajbhojBlockedUnblockFollowFollowingMay 5Why is this important ?Applied Machine Learning is an empirical process where you need to try out different settings of hyperparameters and deduce which settings work best for your application.

This technique is popularly known as Hyperparameter Tuning.

These hyperparameters could be the learning rate(alpha), number of iterations, mini batch size, etc.

GoalTuning is generally performed by observing the trend in the cost function over successive iterations.

A good machine learning model has a continuously decreasing cost function until a certain minimum.

This article showcases a simple approach to visualize the minimization of cost function with help a contour plot, for a Keras model.

For our example, we will consider a Univariate Linear Regression problem of predicting the sales of a particular product based on the amount of money spent on advertising.

Note: Though the problem chosen is fairly simple, this technique works for Deep Neural Networks as well.

BackgroundCost Function and Gradient DescentCost Function is the measure of how wrong the model is, in terms of its ability to estimate the relationship between the input and corresponding output.

In simpler terms.

“How badly your model performs”Gradient Descent, on the other hand, is a technique to minimize the cost function by updating the values of the parameters of the network repetitively.

The goal of gradient descent could be thought of as to…“Tweak parameters iteratively till you reach local minimum”The cost function for Linear Regression is usually the mean-squared-error which is explained beautifully here.

Lets start…DescriptionThe Advertising.

csv file contains the advertising budget allotted to various sources (TV, radio, newspaper) and their effect on the sales of a particular product.

As our focus is on univariate regression, we shall consider only the budget allotted to TV as our independent variable.

The code and data for this article can be found here.

After loading the csv file in a pandas dataframe and dropping unnecessary columns…df = pd.

read_csv(‘path/to/file/Advertising.

csv’)df.

drop([‘Unnamed: 0’,’radio’,’newspaper’],axis = 1 , inplace=True)X = df[‘TV’]Y = df[‘sales’]df.

head()… the final dataframe will look likeLater, we split the data into training and test setsX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.

2)Now, note that Keras doesn’t explicitly provide a Linear Regression model like scikit-learn.

But we can emulate Linear Regression using a Dense layer with a single neuron.

model = Sequential()model.

add(Dense(1, activation = 'linear', use_bias = True, input_dim = 1))model.

compile(optimizer = optimizers.

RMSprop(lr = 0.

01), loss = 'mean_squared_error', metrics = ['mae'])The designed model will look something like…Univariate Linear Regression NetworkTraining the model, we get a fairly acceptable prediction plotPlotting the Cost Function( J )The cost function of Linear Regression is given byFrom the equation, its clear that our requirements to visualize cost minimization are the weights(and bias) of the layer updated after each iteration.

If we could somehow access the weights of the layer we’ll easily be able to visualize the cost minimization/gradient descent.

Keras provides a get_weights() function for the users to access the weights of the network layer.

model.

get_weights()But the function returns the final weights (and bias) of the model after training.

We need a way to access the weights at the end of each iteration (or each batch).

To enable this, we will make use of a callback.

Defining a callback in KerasKeras callbacks help you fix bugs more quickly and build better models.

A callback is a set of functions to be applied at given stages of the training procedure.

You can use callbacks to get a view on internal states and statistics of the model during training.

This is exactly what we need as now we can get_weights() after each mini batch(i.

e.

after each iteration).

The weights are stored in a weight_history list to be accessed later.

A separate list is also maintained for the bias terms.

weight_history = []bias_history = []class MyCallback(keras.

callbacks.

Callback): def on_batch_end(self, batch, logs): weight, bias = model.

get_weights() B = bias[0] W = weight[0][0] weight_history.

append(W) bias_history.

append(B) callback = MyCallback()The created callback is passed along with the inputs and outputs for training the model.

model.

fit(X_train, Y_train, epochs = 10, batch_size = 10, verbose = True, callbacks=[callback])Now, the stored weights could be used to plot the cost function(J) with respect to weight(W) and bias(B).

The contour plot is plotted solely on the basis of the weight_history and bias_history.

There is no need to calculate the cost function here.

Cost Function w.

r.

t Weight and BiasInterpreting a contour plotThe basic intuition of a contour plot is that the continuous lines represent constant magnitude (called contour lines) and the magnitude increases as we go from the middle to the outward parts of the plot.

The magnitude of the contour lines has been given and here, it represents the possible values of cost function(J).

You can roughly observe the cost(the red line) starts from close to 5000 and goes on decreasing until it stops at particular point.

This is in correspondence to the loss function values which was also taken as mean-squared-error.

Note: Error between the two plots is due to the fact that the mean squared error (above) is calculated in terms of the validation split, and the contour plot is plotted using the entire training data.

What works too ?Plotting the loss function over iterations, as above, can also be used for Hyperparameter Tuning.

In fact, that is the most commonly used technique by data scientists.

Why use contour plots ?The advantage provided by the contour plots is that, they give a better intuition about the track followed by the gradient descent algorithm w.

r.

t updates in the model/network parameters over iterations.

Additionally…As we already have access to the model parameters it may be worthwhile to observe the trends followed by them over timeWeight updates over timeBias updates over timeTherefore, the trends followed by weight and bias of our model to reach the local minimum of Cost Function can be observed.

Now that you have access to all the plots you can efficiently check whether your model learns slowly or overshoots (learning rate), whether mini-batching yields a observable benefits, ideal no.

of iterations(or even epochs),etc.

.