Day 2: Bag of Tricks for Image Classification with Convolutional Neural NetworksFrancisco InghamBlockedUnblockFollowFollowingMar 11Up-to-date tricks to boost your CNN’sTrick or treat, some papers render your code obsolete.
TL-DRDeep learning convolutional networks have had many improvements not directly related to architecture.
This paper examines a collection of tricks that clearly improve performance at almost no complexity cost.
Many of these tricks have been added to fastai ????.
Large batch-sizeThe paper presents four techniques that allow to effectively train networks with large batch sizes.
Linear scaling learning rateSince larger batch sizes mean a lower variance (lower noise) in the gradient of SGD we can be more confident that the gradient is a promising direction.
Thus, it makes sense to increase the learning rate along with batch size.
It was empirically proven that linearly increasing the learning rate with the batch size works empirically for ResNet50 training.
Learning rate warmupAt the beginning of training the weights typically have random values and are far away from the final solution.
Using a learning rate that is too high may result in numerical instability.
The trick here is to use a low learning rate initially and increase it once the training is stable.
Zero yThe residual blocks in ResNet have an output to which the input of the block is added:x + block(x)Sometimes the last layer in the block is batch normalization which normalizes the value and then performs a scale transformation.
If the normalized value is x_hat the output of the batch normalization layer is:y .
x_hat + Bwhere y and B are initialized at 1 and 0.
If we instead initialize y as 0 the residual blocks would start by just returning the input, effectively reducing the number of layers and making it easier to train.
Also the network will only modify the value of y if the transformation in the residual block is worth it (i.
improves performance) and this avoids unnecessary computation.
No bias decayIt is recommended not to apply any regularization (or weight decay) to the bias or batch normalization parameters.
Low-Precision TrainingNew hardware offers serious improvements in speed when using FP16 rather than FP32 (on Nvidia V100 training on FP16 offers a x2/3 increase in performance).
However FP16 may cause overflow and disrupt the training process.
The suggestion to overcome this is to store parameters and activations in FP16 and use FP16 to compute gradients.
All parameters have a copy in FP32 for parameter updates.
For a detailed explanation see.
Model TweaksResNet ArchitectureThese tweaks help increase validation accuracy in ResNet-50 without a significant computational cost (~3% longer to train).
ResNet-BResNet B changes the stride of the first two convolutional layers in Path AThe first improvement consists of changing the stride in the convolutional layers.
The first layer in Path A has a stride of 2 which means that it discards 3/4 of the input’s pixels.
To avoid this the stride of this layer can be changed from 2 to 1 and the next layer from 1 to 2 to compensate and conserve the output dimensions.
Since the next layer has a kernel size of 3×3, even with a stride of 2 the layer takes advantage of all the input information.
ResNet-CResNet-C involves a replacement of big kernel size convolutionsThe computational cost of a convolution is quadratic to the kernel width or height.
A 7 × 7 convolution is 5.
4 times more expensive than a 3 × 3 convolution.
This tweak consists of replacing the 7×7 convolutional layer in the input step by three 3×3 layers (will make the model easier to train).
ResNet-DResNet-D replaces a 2 stride convolution with an AvgPool and a 1 stride convolution to avoid information lossResNet-D is a similar improvement as ResNet-B but with a different approach.
They replaced a 2 stride convolution in Path B by an Average Pooling layer and a 1 stride convolution (this keeps the output dimensions intact).
The authors report that this tweak does not affect speed noticeably.
Training RefinementsThe training refinements have a clear positive impact in the performance not only in ResNet but also in other CV architectures (1)Cosine Learning Rate DecayCosine Decay is a smooth way to progressively decay learning rateThe formula that defines the cosine decay functionTypically, after the learning rate warm-up described earlier, we decrease the learning rate as the training progresses (the intuition being that as you get closer to the optimum, high learning rates might move you away from it).
A smooth function to describe this schedule is the cosine function which we can see above.
Label SmoothingNew target with label smoothingTypically the last layer of a neural network is a fully-connected layer with output dimension equal to the number of categories and a softmax activation function.
If the loss is cross-entropy, for mathematical reasons, the network has an incentive to make the prediction for one category very large and the others very small and this leads to over-fitting.
Label smoothing consists in changing the target from [1, 0, 0, …] to [1-e, 0+e/k-1, 0+e/k-1, …] to reduce the polarity in the target.
It is clear that with label smoothing the distribution centers at the theoretical value and has fewer extreme values.
Knowledge DistillationIn knowledge distillation, we use a teacher model to help train the current model, which is called the student model.
One example is using a ResNet-152 as the teacher model to help training ResNet-50.
Knowledge distillation entails adding a term to the loss function which accounts for the difference between the student model and the teacher model to ensure that the student model does not differ too much from the teacher model.
Loss with knowledge distillation.
T is the temperature hyperparameter, r is the teacher output, z is the student output and p is the target.
MixupThe new example is created by interpolating two existing examplesMixup means linearly interpolating two training examples and creating a new one.
Transfer LearningObject DetectionTraining refinements apply to object detectionThe authors proved that performance of Faster-RCNN on Pascal VOC was improved by adding the refinements presented previously.
Semantic SegmentationOnly cosine smoothing applies to semantic segmentationThe authors trained a Fully-Connected Network on ADE20K and concluded that only cosine smoothing improved the performance in this task (2).
Notes(1) Knowledge distillation hampers performance in two of the three architectures.
According to the authors:Our interpretation is that the teacher model is not from the same family of the student, therefore has different distribution in the prediction, and brings negative impact to the model.
(2) Why did the other improvements not improve performance?While models trained with label smoothing, distillation and mixup favor soften labels, blurred pixel-level information may be blurred and degrade overall pixel-level accuracy.
ReferencesBag of Tricks for Image Classification with Convolutional Neural Networks; He et al.
, AWS, 2018.
Cover Image: www.
org.. More details