Simply put after few iterations the weights are not changing much, especially layer two, three, and four.Batch NormalizationTop Left → Gradient Respect to the Weight at each layerTop Right → Gradient that gets passed along to previous layersBottom Left → Weight at each layerBottom Right → Subtraction between current weight and calculated GradientRight away we can see one dramatic difference between the network that did not have any normalization scheme, Number of Non-Zero Gradients, for each layer.This property of batch normalization, which increases the number of non-zero gradients, is the reason why training accelerates for deep neural networks..And when we view how weights change over time, we can see that the histogram has more overall movement.Layer NormalizationTop Left → Gradient Respect to the Weight at each layerTop Right → Gradient that gets passed along to previous layersBottom Left → Weight at each layerBottom Right → Subtraction between current weight and calculated GradientWe can observe a similar phenomenon when we used layer normalization between every layer..As the number of non-zero gradients increases, the update of weights for each layer becomes more frequent.Instance NormalizationTop Left → Gradient Respect to the Weight at each layerTop Right → Gradient that gets passed along to previous layersBottom Left → Weight at each layerBottom Right → Subtraction between current weight and calculated GradientInstance normalization standardizes every image or feature map, and due to this, I personally believe that the number of the non-zero gradient is maximized when compared to other normalization schemes.Box-Cox TransformationTop Left → Gradient Respect to the Weight at each layerTop Right → Gradient that gets passed along to previous layersBottom Left → Weight at each layerBottom Right → Subtraction between current weight and calculated GradientWhen compared to the network that did not have any normalization scheme we can see that there are more non-zero elements in the gradient respect to each weight..However, when compared to any other normalization scheme we can see that we still have a lot of zeros in our gradient.DiscussionOne very important thing to remember is the fact that every single one of these networks (with/without) normalization schemes have exactly the same number of parameters.Meaning their learning capacity is exactly the same.The reason for this is because I did not add any alpha or beta parameter to batch/layer/instance normalization, so all of the data that gets passed through each layer has to be standardized..Knowing this let us see the accuracy plots.Accuracy for training imagesOrange → Batch NormalizationRed → Instance NormalizationGreen → Layer NormalizationPurple → Box-Cox Transformation Blue → No NormalizationWhen we use a normalization scheme such as batch/layer/instance normalization we can achieve +95% accuracy on training images by 130th epoch..Meanwhile, the network with box-cox transformation as well as the network without any normalization scheme struggles even to pass +60% accuracy.Accuracy for Testing ImagesOrange → Batch NormalizationRed → Instance NormalizationGreen → Layer NormalizationPurple → Box-Cox Transformation Blue → No NormalizationFrom the above plot we can conclude that surprisingly, the network without any normalization scheme did the best..When taking into account that we have much more testing images compared to training images this is somewhat an impressive result..STL 10 dataset has 5000 training images and 8000 testing images, so 55 percent of 8000 images would mean 4400 images.Additionally, we can see a pattern emerging, as the number of parameters in which we calculate the mean and the standard deviation decreases, testing accuracy increases..More precisely, when we have a 4D tensor that has dimensions of (20,96,96,16), where each axis represents (batch size, width, height, channel), when we perform batch normalization we calculate the mean respect to the channel dimension so 16..Meanwhile, layer and instance normalization calculate the mean respect to batch size and both batch size and channel dimensions, respectively.Knowing all of the information above, we can see that the inverse pattern is clearly present..As the number of parameters, we are calculating the mean and standard deviation increases, the model tends to over-fit.Why?Well, I am not an expert in machine learning, but clear speculation is the gradients..More precisely, non-zero gradients.. More details