The warmup strategy increases the learning rate from 0 to the initial learning rate linearly during the initial N epochs or m batches.Even though Keras came with the LearningRateScheduler capable of updating the learning rate for each training epoch, to achieve finer updates for each batch, here is how you can implement a custom Keras callback to do that.warm_up_lr.learning_rates now contains an array of scheduled learning rate for each training batch, let's visualize it.Zero γ last batch normalization layer for each ResNet blockBatch normalization scales a batch of inputs with γ and shifts with β, Both γ and β are learnable parameters whose elements are initialized to 1s and 0s, respectively in Keras by default.In the zero γ initialization heuristic, we initialize γ = 0 for all BN layers that sit at the end of a residual block..Therefore, all residual blocks just return their inputs, mimics network that has less number of layers and is easier to train at the initial stage.Given an identity ResNet block, when the last BN’s γ is initialized as zero, this block will only pass the shortcut inputs to downstream layers.You can see how this ResNet block is implemented in Keras, and the only change is the line, gamma_initializer='zeros' for the BatchNormalization layer.No bias decayThe standard weight decay applying an L2 regularization to all parameters drives their values towards 0..It consists of applying penalties on layer weights..Then the penalties are applied to the loss function.It’s recommended only to apply the regularization to weights to avoid overfitting..Other parameters, including the biases and γ and β in BN layers, are left unregularized.In Keras, it is effortless to apply the L2 regularization to kernel weights..The option bias_regularizer is also available but not recommended.Training RefinementsCosine Learning Rate DecayAfter the learning rate warmup stage described earlier, we typically steadily decrease its value from the initial learning rate..Compared to some widely used strategies including exponential decay and step decay, the cosine decay decreases the learning rate slowly at the beginning, and thenbecomes almost linear decreasing in the middle, and slows down again at the end..It potentially improves the training progress.Here is a complete example of a cosine learning rate scheduler with warmup stage in Keras, the scheduler updates the learning rate at the granularity of every update step.You are opted to use the hold_base_rate_steps argument in the scheduler which as its name suggests, holds the base learning rate for a specific number of steps before carrying on with the cosine decay..The resulting learning rate schedule will have a plateau looks like below.Label SmoothingCompared to original one-hot encoded inputs, label smoothing changes the construction of the true probability to,Where ε is a small constant..Label Smoothing encourages a finite output from the fully-connected layer make the model generalize better and less prone to overfitting..It is also an efficient and theoretically grounded solution for label noise..You can read more about the discussion here. Here is how you can apply label smoothing on one-hot labels before training a classifier.ResultsBefore smoothing: [0.. More details