we have covered many core concepts so far.
Let’s continue…Let’s now train the model on our dataset,fit_one_cycle trains the model for the number of epochs provided, i.
e 4 here.
The epochs number represents the number of times the model looks at the entire set of images.
However, in every epoch, the same image is slightly different following our data augmentation.
Usually, the metric error will go down with each epoch.
It is a good idea to increase the number of epochs as long as the accuracy of the validation set keeps improving.
However, a large number of epochs can result in learning the specific image and not the general class, something we want to avoid.
The training that we just did here is what we referred to as feature extraction, so only the parameters of the head (last layers) of our model were updated.
We shall try fine-tuning all the layers next.
Congratulations!!! The model has been successfully trained to recognize dogs and cat breeds.
Our achieved accuracy above is ≈ 93.
5%Can we do even better? We’ll see after fine-tuning.
Let’s save the current model parameters in case we may want to reload that later.
Results interpretationLet’s now see how to properly interpret the current model results.
ClassificationInterpretation provides a visualization of the misclassified images.
plot_top_losses shows images with top losses along with their: prediction label / actual label / loss / probability of actual image classA high loss implies high confidence about the wrong answer.
Plotting top losses is a great way to visualize and interpret classification results.
Misclassified images with the highest lossesClassification confusion matrixIn a confusion matrix, the diagonal elements represent the number of images for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.
most_confused simply grabs out the most confused combinations of predicted and actual categories; in other words, the ones that it got wrong most often.
We can see that it often misclassified staffordshire bull terrier as an american pitbull terrier, they do actually look very similar :)[('Siamese', 'Birman', 6), ('american_pit_bull_terrier', 'staffordshire_bull_terrier', 5), ('staffordshire_bull_terrier', 'american_pit_bull_terrier', 5), ('Maine_Coon', 'Ragdoll', 4), ('beagle', 'basset_hound', 4), ('chihuahua', 'miniature_pinscher', 3), ('staffordshire_bull_terrier', 'american_bulldog', 3), ('Birman', 'Ragdoll', 2), ('British_Shorthair', 'Russian_Blue', 2), ('Egyptian_Mau', 'Abyssinian', 2), ('Ragdoll', 'Birman', 2), ('american_bulldog', 'staffordshire_bull_terrier', 2), ('boxer', 'american_pit_bull_terrier', 2), ('chihuahua', 'shiba_inu', 2), ('miniature_pinscher', 'american_pit_bull_terrier', 2), ('yorkshire_terrier', 'havanese', 2)]5.
Freezing and UnfreezingBy default in fastai, using a pre-trained model freezes the earlier layers so that the network can only make changes to the parameters of the last layers, as we did above.
Freezing the first layers and training only the deeper layers can significantly reduce a lot of the computation.
We can always train all of the network’s layers by calling unfreeze function, followed by fit or fit_one_cycle.
This is what we called fine-tuning, as we are tuning the parameters of the whole network.
Let's do it,The accuracy now is a little worse than before.
Why is that?It is because we are updating the parameters of all the layers at the same speed, which is not what we desire since the first layers do not need much change as the last layers do.
The hyperparameter that controls the updating amount of the weights is called the learning rate, also referred to as step size.
It adjusts the weights with respect to the gradient of the loss, with the objective to reduce the loss.
For instance, in the most common gradient descent optimizer, the relationship between the weights and learning rate is as follows,which translates to new_weight = old_weight — lr * gradientBy the way, a gradient is simply a vector which is a multi-variable generalization of a derivative.
Therefore, a better approach to fine-tune the model would be to use different learning rates for the lower and higher layers, often referred to as differential or discriminative learning rates.
By the way, I am using parameters and weights interchangeably in this tutorial.
More accurately, parameters are weights and biases, but let’s not worry about this subtlety here.
However, note that hyperparameters and parameters are different; hyperparameters cannot be estimated within training.
Fine-TuningIn order to find the most adequate learning rate for fine-tuning the model, we use a learning rate finder, where the learning rate is gradually increased and the corresponding loss is recorded after each batch.
The fastai library has this implemented in lr_find.
For a further read on this, check out How Do You Find A Good Learning Rate by @GuggerSylvain .
Let’s load the model we had previously saved and run lr_find,recorder.
plot method can be used to plot the losses versus the learning rates.
The plot stops when the loss starts to diverge.
From the resulting plot, we concur that an appropriate learning rate would be around 1e-4 or lower, a bit before the loss starts to increase and go out of control.
We will assign 1e-4 to the last layers and a much smaller rate, 1e-6, to the earlier layers.
Again, this is because the earlier layers are already well trained to capture universal features and would not need as much updating.
In case you are wondering about the learning rate used in our previous experiments since we did not explicitly declare it, it was 0.
003 which is set by default in the library.
Before we train our model with these discriminative learning rates, let’s demystify the difference between fit_one_cycle and fitmethods since both are plausible options to train the model.
This discussion can be very valuable in understanding the training process, but feel free to skip directly to results.
fit_one_cycle vs fit :Briefly, the difference is that fit_one_cycle implements Leslie Smith 1cycle policy, which instead of using a fixed or a decreasing learning rate to update the network's parameters, it oscillates between two reasonable lower and upper learning rate bounds.
Let’s dig a little more on how this can help our training.
➯ Learning Rate Hyperparameter in TrainingA good learning rate hyperparameter is crucial when tuning our deep neural networks.
A high learning rate allows the network to learn faster, but too high of a learning rate can fail the model to converge.
On the other hand, a small learning rate will make training progress very slowly.
Effect of various learning rate on convergence [Source ]In our case, we estimated the appropriate learning rate (lr) by looking at the recorded losses at different learning rates.
It is possible to use this learning rate as a fixed value in updating the network’s parameters; in other words, the same learning rate will be applied through all training iterations.
This is what learn.
A much better approach would be to change the learning rate as the training progresses.
There are two ways to do this, learning rate schedules (time-based decay, step decay, exponential decay, etc.
) or adaptive learning rate methods (Adagrad, RMSprop, Adam, etc.
For more about this, check out CS230 Stanford class notes on Parameter Updates.
Another good resource is An overview of gradient descent optimization algorithms by @Sebastian Ruder.
➯ One Cycle Policy in a nutshellOne cycle policy is one type of learning rate schedulers, that allows the learning rate to oscillate between reasonable minimum and maximum bounds.
What are the values of these two bounds?.The upper bound is what we got from our learning rate finder while the minimum bound can be 10 times smaller.
The advantage of this approach is that it can overcome local minimas and saddle points, which are points on flat surfaces with typically small gradients.
The 1cycle policy has proved to be faster and more accurate than other scheduling or adaptive learning approaches.
Fastai implements the 1cycle policy in fit_one_cycle, which internally calls fit method along with a OneCycleScheduler callback.
Documentation of fastai 1cycle policy implementation can be found here.
One cycle length of 1cycle policy [Source]A slight modification of the 1cycle policy in the fastai implementation is that consists of a cosine annealing in the second phase from lr_max to 0.
➯ 1cycle Policy discoveryLeslie Smith first discovered a method he called Cyclical Learning Rates (CLR) where he showed that CLRs are not computationally expensive and they eliminate the need to find the best learning rate value since the optimal learning rate will fall somewhere between the minimum and maximum bounds.
He then followed that paper with another A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay, where he highlighted various remarks and suggestions to enable faster training of networks to produce optimal results.
One of the propositions was to use CLR with just one cycle to achieve optimal and fast results, which he elaborated in another paper super-convergence.
The authors named the approach 1cycle policy.
The figure below is an illustration of how the super-convergence method reaches higher accuracies than a typical (piecewise constant) training regime in much fewer iterations for Cifar-10, both using a 56 layer residual network architecture.
Super-convergence accuracy test vs a typical training regime with the same architecture on Cifar-10 [Source]If you choose to skip reading Leslie Smith papers, I would still recommend reading this post The 1cycle policy by @GuggerSylvain.
Moment of TruthNow that we picked our discriminative learning rates for our layers, we can unfreeze the model and train accordingly.
The slice function assigns 1e-4 to the last layers and 1e-6 to the first layers; the layers in between get learning rates at equal increments within this range.
We see the accuracy has improved a bit but not much, so we wonder if we needed to fine-tune the model at all?Two key factors to always consider prior to fine-tuning any model, the size of the dataset and its similarity with the dataset of the pre-trained model.
Check out Stanford’s CS231 notes on When and how to fine-tune?.
In our case, our Pet dataset is similar to the images in ImageNet and it is relatively small, and that’s why we achieved a high classification accuracy from the start without fine-tuning the full network.
Nonetheless, we were still able to improve our results a bit and learned so much, so GREAT JOB :)The figure below illustrates the three plausible ways to use and fine-tune a pre-trained model.
In this tutorial, we attempted the first and third strategy.
Strategy 2 is also common in cases where the dataset is small but distinct from the dataset of the pre-trained model or when the dataset set is large but similar to the dataset of the pre-trained model.
Fine-tuning strategies on a pre-trained modelCongratulations, we have successfully covered image classification using a state-of-the-art CNN with a solid foundation of the underlying structure and training process ????You are ready to build an image recognizer on your own dataset.
If you do not already have one, you can scrape images from Google Images and make up a dataset.
I made a very short tutorial just for that ⬇ check it out.
A State-of-the-Art Image Classifier on Your Dataset in Less Than 10 MinutesFast multi-class image classification with code ready, using fastai and PyTorch librariestowardsdatascience.
comAcknowledgment: Thanks to Jeremy Howard and Rachel Thomas for their efforts creating all the fastai content.
I hope you found this short tutorial helpful.
Please give it a share and few claps, so it can reach others as well ????.Feel free to leave any comments, or connect with me on Twitter @ SalimChemlal or Medium for more!“A mind that is stretched by a new experience can never go back to its old dimensions.
” — Oliver Wendell Holmes Jr.