Exploring Learning Rates to improve model performance in KerasGuncha GargBlockedUnblockFollowFollowingJun 5The learning rate is a hyper parameter that controls how much to change the model in response to the estimated error each time the model weights are updated.
Choosing the learning rate is challenging because a very small value may result in a long training process that could get stuck, whereas a very large value may result in learning a sub-optimal set of weights too fast or an unstable training process.
Transfer LearningWe use Transfer Learning to apply a trained Machine Learning model to a different, but related task.
This works well in Deep Learning, which uses Neural Networks consisting of layers.
Especially in computer vision, earlier layers in these networks tend to learn to recognise more general features.
For example, they detect things like edges, gradients, etc.
This is a proven method to generate much better results in computer vision tasks.
Most of the pre-trained architectures (Resnet, VGG, inception, etc.
) are trained on ImageNet and depending on the similarity of your data to the images on ImageNet, these weights will need to be altered more or less greatly.
In the course fast.
ai, Jeremy Howard explores different learning rate strategies for transfer learning to improve model performance in terms of both speed and accuracy.
Differential LearningThe intuition behind differential learning comes from the fact that while fine-tuning a pre-trained model, the layers closer to the input are more likely to have learned more general features.
Thus, we don’t want to change them much.
However as we move deeper into the model, we would want to modify the weights to a larger extent so as to adapt to the task/data at hand.
The phrase ‘Differential Learning Rates’ implies the use of different learning rates on different parts of the network with lower learning rate in the initial layers and gradually increasing the learning rate in the later layers.
Sample CNN with Differential Learning RateImplementing Differential Learning Rate in KerasIn order to implement differential learning in Keras, we need to modify the optimizer source code.
Adam Optimizer source code in KerasWe modify the above source code to incorporate the following —__init__ function is modified to include:Split layers: split_1 and split_2 are the name of the layers where the first and second split is to be made respectivelyParameter lr is modified to accept a list of learning rates — list of 3 learning rates is accepted (since the architecture is split in 3 different segments)While updating the learning rate of each layer, the initial code iterates through all the layers and assigns it a learning rate.
We alter this to incorporate different learning rate for different layersUpdated Optimizer Code with Differential Learning Rate2.
Stochastic Gradient Descent with Restarts (SGDR)With each batch of stochastic gradient descent (SGD), ideally the network should get closer and closer to the global minimum value for the loss.
Thus, it makes sense to reduce the learning rate as the training progresses, such that the algorithm does not overshoot and settles as close to the minimum as possible.
With cosine annealing, we can decrease the learning rate following a cosine function.
Decreasing learning rate across an epoch containing 200 iterationsSGDR is a recent variant of learning rate annealing that was introduced by Loshchilov & Hutter  in their paper “Sgdr: Stochastic gradient descent with restarts”.
In this technique, we increase the learning rate suddenly from time to time.
Below is an example of resetting learning rate for three evenly spaced intervals with cosine annealing.
Increasing the learning rate to its max value after every 100 iterationsThe rationale behind suddenly increasing the learning rate is that, on doing so, the gradient descent does not get stuck at any local minima and may “hop” out of it in its way towards a global minimum.
Each time the learning rate drops to it’s minimum point (every 100 iterations in the figure above), we call this a cycle.
The authors also suggest making each next cycle longer than the previous one by some constant factor.
Each cycle taking twice as many epochs to complete as the prior cycleImplementing SGDR in KerasUsing Keras Callbacks, we can update the learning rate to follow a particular function.
We can refer this repo which has already implemented cyclical learning.
Please check out the github repository for the entire code of Differential Learning and SGDR.
It also contains a test file to use these techniques on a sample dataset.