Tip: if you’re looking for a more in-depth mathematical comparison of the optimizers, check out this fantastic blog post by Sebastian Ruder, which was of great help for me in writing this post.
Cyclical Learning RateThe CLR paper suggests two very interesting points:It gives us a way to schedule the Learning Rate in an efficient way during training, by varying it between an upper and a lower bound in a triangular fashion.
It gives us a very decent estimate which range of Learning Rates works well for your particular network.
There are a number of parameters to play around with here:step size: during how many epochs will the LR go up from the lower bound, up to the upper bound.
max_lr: the highest LR in the schedule.
base_lr: the lowest LR in the schedule, in practice: the author of the paper suggests to take this a factor R smaller than the max_lr.
Our used factor was 6.
The exact reason why this would work well is difficult to analyse of course.
The evolution of the LR might cause the network to go to a higher loss in the short-term, but this short-term disadvantage proves advantageous in the long term.
It gives the network the ability to jump to another local minimum, if the current one isn’t very stable.
Source: Snapsshot Ensembles (https://arxiv.
00109)One other advantage CLR has over the Adaptive methods described above is that it is less computationally intensive.
In the paper, it is also mentioned that you can play around with a linearly or exponentially decreasing upper bound over time, but this is not implemented in this blog post.
So how does this work in code?…Step 1: find the upper LRUsing a vanilla CNN as an example : step 1 is to calculate the upper bound of the learning rate for your model.
The way to do this is to:define an initial learning rate, the lower boundary of the range you want to test (let’s say 1e-7)define an upper boundary of the range (let’s say 0.
1)define an exponential scheme to run through this step by step:Used formula for the LR finder scheduling (N = number of images, BS = Batch Size, lr = learning rate)Luckily, PyTorch as a LambdaLR object which let’s us define the above in a lambda function:Next, do a run (I used two epochs) through your network.
At each step (each batch size): capture the LR, capture the loss and optimize the gradients:????Note: we don’t take the ‘raw’ loss at each step, but the smoothed loss, being: loss = α .
loss + (1- α).
previous_lossAfter this, we can clearly see the LR followed a nice exponential patern:The loss-lr plot for the basic network (see later) looks as follows:We can clearly see that a too high LR causes divergence in the network loss, and too low of an LR doesn’t cause the network to learn very much at all…In his fast.
ai course, Jeremy Howard mentions that a good upper bound is not on the lowest point, but about a factor of 10 to the left.
Taking this into account, we can state that a good upper bound for the learning rate would be: 3e-3.
A good lower bound, according to the paper and other sources, is the upper bound, divided by a factor 6.
Step 2: CLR schedulerStep 2 is to create a Cyclical learning schedule, which varies the learning rate between the lower and the upper bound.
This can be done in a number of fashions:Various possibilites for the CLR shape (source: jeremy jordan’s blog)We’re going for the plain ol’ triangular CLR schedule.
Programmatically, we just need to create a custom function:Step 3: wrap itIn step 3, this can then be wrapped inside a LambdaLR object in PyTorch:Step 4: trainDuring an epoch, we need to update the LR using the ‘.
step()’ method of the scheduler object:Comparison 1: Vanilla CNNFirst up is the classification taks using a vanilla (non-pretrained) CNN.
I’ve used the following network architecture:To prevent the model from overfitting on the (relatively small) dataset, we use the following techniques:Dropout in the Linear layersBatchnorm layer in the CNN blocksData augmentation:????hint: you need to calc the mean and std for the channel normalization in advance, look in the full notebook to see how to tackle this.
We trained the network for 150 epochs for each of the 6 optimizers.
To cancel out some variability, we did 3 runs for each optimizer.
The training and validation accuracy look like such:Training accuracyValidation accuracyAlrighty boys and girls, what can we see here:????.Adagrad: mediocre performance, as was to be expected????.Adadelta: not a real champ in training acc, but very decent performance in validation????.RMSProp: unless I’m doing something wrong here, I was a bit surprised by the poor performance????.Adam: consistently good????.Adamax: promising training accuracy evolution, but not perfectly reflected in the validation accuracy????.SGD with CLR: much faster convergence in training accuracy, fast convergence in validation accuracy terms, not too shabby…In the end, SGD+CLR, Adam and Adadelta all seem to end at about the same final validation accuracy of about 83%.
Comparison 2: Resnet34 Transfer LearningIf you’re saying: “image classification on a small dataset”, you need to consider Transfer Learning.
So we did just that, using Resnet34, pretrained on ImageNet.
I believe the dataset was fairly close to Imagenet pictures, so I only unfroze the last block of the 5 convolutional blocks, and replaced the last linear layer with a new one:The network was trained, for each of the 6 optimizers, for 100 epochs (due to much faster convergence):Training accuracyValidation accuracyKey notes here:????.In general: much less difference between the optimizers, especially when observing the validation accuracy????.RMSProp: still a bit of an underperformer????.SGD+CLR again good performance in training accuracy, but this does not get reflected immediately in the validation accuracy.
It seems that for Transfer Learning, the absolute reward in tuning your learning rate and carefully selecting your optimizer is less great.
This is probably due to two main effects:the network weights are already largely optimizedthe optimizer typically only gets to optimize a smaller portion of the entire network weights, since most weights remain frozenConclusionThe main point from the blogpost would be:Don’t just take any old of-the-shelf optimizer.
The learning rate is one of the most important hyperparameters, so it pays of to take a closer look at it.
And if you’re comparing, have a look at SGD with a CLR schedule.
Again: all code can be found here, feel free to check it out!Sources and further readinghttps://arxiv.