First, neural networks are complicated functions, with lots of non-linear transformations thrown in our hypothesis function.
The resultant loss function doesn’t look a nice bowl, with only one minima we can converge to.
In fact, such nice santa-like loss functions are called convex functions (functions for which are always curving upwards) , and the loss functions for deep nets are hardly convex.
In fact, they may look like this.
In the above image, there exists a local minima where the gradient is zero.
However, we know that they are not the lowest loss we can achieve, which is the point corresponding to the global minima.
Now, if you initialze your weights at point A, then you’re gonna converge to the local minima, and there’s no way gradient descent will get you out of there, once you converge to the local minima.
Gradient descent is driven by the gradient, which will be zero at the base of any minima.
Local minimum are called so since the value of the loss function is minimum at that point in a local region.
Whereas, a global minima is called so since the value of the loss function is minimum there, globally across the entire domain the loss function.
Only to make things worse, the loss contours even may be more complicated, given the fact that 3-d contours like the one we are considering never actually happen in practice.
In practice, our neural network may have about, give or take, 1 billion weights, given us a roughly (1 billion + 1) dimensional function.
I don’t even know the number of zeros in that figure.
In fact, it’s even hard to visualize what such a high dimensional function.
However, given the sheer talent in the field of deep learning these days, people have come up with ways to visualize, the contours of loss functions in 3-D.
A recent paper pioneers a technique called Filter Normalization, explaining which is beyond the scope of this post.
However, it does give to us a view of the underlying complexities of loss functions we deal with.
For example, the following contour is a constructed 3-D representation for loss contour of a VGG-56 deep network’s loss function on the CIFAR-10 dataset.
A Complicated Loss Landscape Image Credits: https://www.
edu/~tomg/projects/landscapes/As you can see, the loss landscape is ridden with local minimum.
Challenges with Gradient Descent #2: Saddle PointsThe basic lesson we took away regarding the limitation of gradient descent was that once it arrived at a region with gradient zero, it was almost impossible for it to escape it regardless of the quality of the minima.
Another sort of problem we face is that of saddle points, which look like this.
You can also see a saddle point in the earlier pic where two “mountains” meet.
A saddle point gets it’s name from the saddle of a horse with which it resembles.
While it’s a minima in one direction (x), it’s a local maxima in another direction, and if the contour is flatter towards the x direction, GD would keep oscillating to and fro in the y — direction, and give us the illusion that we have converged to a minima.
Randomness to the rescue!So, how do we go about escaping local minima and saddle points, while trying to converge to global minima.
The answer is randomness.
Till now we were doing gradient descent with the loss function that had been created by summing loss over all possible examples of the training set.
If we get into local minima or saddle point, we are stuck.
A way to help GD escape these is to use what is called Stochastic Gradient Descent.
In stochastic gradient descent, instead of taking a step by computing the gradient of the loss function creating by summing all the loss functions, we take a step by computing the gradient of the loss of only one randomly sampled (without replacement) example.
In contrast to Stochastic Gradient Descent, where each example is stochastically chosen, our earlier approach processed all examples in one single batch, and therefore, is known as Batch Gradient Descent.
The update rule is modified accordingly.
Update Rule For Stochastic Gradient DescentThis means, at every step, we are taking the gradient of a loss function, which is different from our actual loss function (which is a summation of loss of every example).
The gradient of this “one-example-loss” at a particular may actually point in a direction slighly different to the gradient of “all-example-loss”.
This also means, that while the gradient of the “all-example-loss” may push us down a local minima, or get us stuck at a saddle point, the gradient of “one-example-loss” might point in a different direction, and might help us steer clear of these.
One could also consider a point that is a local minima for the “all-example-loss”.
If we’re doing Batch Gradient Descent, we will get stuck here since the gradient will always point to the local minima.
However, if we are using Stochastic Gradient Descent, this point may not lie around a local minima in the loss contour of the “one-example-loss”, allowing us to move away from it.
Even if we get stuck in a minima for the “one-example-loss”, the loss landscape for the “one-example-loss” for the next randomly sampled data point might be different, allowing us to keep moving.
When it does converge, it converges to a point that is a minima for almost all the “one-example-losses”.
It’s also been emperically shown the saddle points are extremely unstable, and a slight nudge may be enough to escape one.
So, does this mean in practice, should be always perform this one-example stochastic gradient descent?Batch SizeThe answer is no.
Though from a theoretical standpoint, stochastic gradient descent might give us the best results, it’s not a very viable option from a computational stand point.
When we perform gradient descent with a loss function that is created by summing all the individual losses, the gradient of the individual losses can be calculated in parallel, whereas it has to calculated sequentially step by step in case of stochastic gradient descent.
So, what we do is a balancing act.
Instead of using the entire dataset, or just a single example to construct our loss function, we use a fixed number of examples say, 16, 32 or 128 to form what is called a mini-batch.
The word is used in contrast with processing all the examples at once, which is generally called Batch Gradient Descent.
The size of the mini-batch is chosen as to ensure we get enough stochasticity to ward off local minima, while leveraging enough computation power from parallel processing.
Local Minima Revisited: They are not as bad as you thinkBefore you antagonize local minima, recent research has shown that local minima is not necessarily bad.
In the loss landscape of a neural network, there are just way too many minimum, and a “good” local minima might perform just as well as a global minima.
Why do I say “good”?.Because you could still get stuck in “bad” local minima which are created as a result of erratic training examples.
“Good” local minima, or often referred to in the literature as optimal local minima, can exist in considerable numbers given a neural network’s high dimensional loss function.
It might also be noted that a lot of neural networks perform classification.
If a local minima corresponds to it producing scores between 0.
8 for the correct labels, while the global minima has it producing scores between 0.
98 for the correct labels for same examples, the output class prediction is going to be same for both.
A desirable property of a minima should be it that it should be on the flatter side.
Why?.Because flat minimum are easy to converge to, given there’s less chance to overshoot the minima, and be bouncing between the ridges of the minima.
More importantly, we expect the loss surface of the test set to be slightly different from that of the training set, on which we do our training.
For a flat and wide minima, the loss won’t change much due to this shift, but this is not the case for narrow minima.
The point that we are trying to make is flatter minima generalise better and are thus desirable.
Learning Rate RevisitedRecently, there has been a surge in research on learning rate scheduling to account for sub-optimal minima in the loss landscape.
Even with a decaying learning rate, one can get stuck in a local minima.
Traditionally, either the training is done for a fixed number of iterations, or it can be stopped after, say, 10 iterations after the loss doesn’t improve.
This has been called early stopping in literature.
Having a fast learning rate also helps us scoot over local minimum earlier in training.
People have also combined early stopping with learning rate decay, where learning rate is decayed after every time the loss fails to improve after 10 iterations, eventually stopping after the rate is below some decided threshold.
In recent years, cyclic learning rates have become popular, in which the learning rate is slowly increased, and then decreased, and this is continued in a cyclic fashion.
Something called stochastic gradient descent with warm restarts basically anneals the learning rate to a lower bound, and then restores the learning rate to it’s original value.
We also have different schedules as to how the learning rates decline, from exponential decay to cosine decay.
A very recent paper introduces a technique called Stochastic Weight Averaging.
The authors develop an approach where they first converge to a minima, cache the weights and then restore the learning rate to a higher value.
This higher learning rate then propels the algorithm out of the minima to a random point in the loss surface.
Then the algorithm is made to converge again to another minima.
This is repeated for a few times.
Finally, they average the predictions made by all the set of cached weights to produce the final prediction.
A technique called Stochastic Weight AveragingConclusionSo, this was the introductory post on gradient descent, that has been the working horse for deep learning optimization since the seminal paper on backpropagation that showed you could train neural nets by computing gradients.
However, there’s still one missing block about Gradient descent that we haven’t talked about in this post, and that is addressing the problem of pathological curvature.
Extensions to vanilla Stochastic Gradient Descent, like Momentum, RMSProp and Adam are used to overcome that vital problem.