In the next section, we will see how backpropagation helps us deal with this problem.
Quick Review of Gradient DescentThe gradient of a function is the vector whose elements are its partial derivatives with respect to each parameter.
For example, if we were trying to minimize a cost function, C(B0, B1), with just two changeable parameters, B0 and B1, the gradient would be:Gradient of C(B0, B1) = [ [dC/dB0], [dC/dB1] ]So each element of the gradient tells us how the cost function would change if we applied a small change to that particular parameter — so we know what to tweak and by how much.
To summarize, we can march towards the minimum by following these steps:Illustration of Gradient DescentCompute the gradient of our “current location” (calculate the gradient using our current parameter values).
Modify each parameter by an amount proportional to its gradient element and in the opposite direction of its gradient element.
For example, if the partial derivative of our cost function with respect to B0 is positive but tiny and the partial derivative with respect to B1 is negative and large, then we want to decrease B0 by a tiny amount and increase B1 by a large amount to lower our cost function.
Recompute the gradient using our new tweaked parameter values and repeat the previous steps until we arrive at the minimum.
BackpropagationI will defer to this great textbook (online and free!) for the detailed math (if you want to understand neural networks more deeply, definitely check it out).
Instead we will do our best to build an intuitive understanding of how and why backpropagation works.
Remember that forward propagation is the process of moving forward through the neural network (from inputs to the ultimate output or prediction).
Backpropagation is the reverse.
Except instead of signal, we are moving error backwards through our model.
Some simple visualizations helped a lot when I was trying to understand the backpropagation process.
Below is my mental picture of a simple neural network as it forward propagates from input to output.
The process can be summarized by the following steps:Inputs are fed into the blue layer of neurons and modified by the weights, bias, and sigmoid in each neuron to get the activations.
For example: Activation_1 = Sigmoid( Bias_1 + W1*Input_1 )Activation 1 and Activation 2, which come out of the blue layer are fed into the magenta neuron, which uses them to produce the final output activation.
And the objective of forward propagation is to calculate the activations at each neuron for each successive hidden layer until we arrive at the output.
Forward propagation in a neural networkNow let’s just reverse it.
If you follow the red arrows (in the picture below), you will notice that we are now starting at the output of the magenta neuron.
That is our output activation, which we use to make our prediction, and the ultimate source of error in our model.
We then move this error backwards through our model via the same weights and connections that the we use for forward propagating our signal (so instead of Activation 1, now we have Error1 — the error attributable to the top blue neuron).
Remember we said that the goal of forward propagation is to calculate neuron activations layer by layer until we get to the output?.We can now state the objective of backpropagation in a similar manner:We want to calculate the error attributable to each neuron (I will just refer to this error quantity as the neuron’s error because saying “attributable” again and again is no fun) starting from the layer closest to the output all the way back to the starting layer of our model.
Backpropagation in a neural networkSo why do we care about the error for each neuron?.Remember that the two building blocks of a neural network are the connections that pass signals into a particular neuron (with a weight living in each connection) and the neuron itself (with a bias).
These weights and biases across the entire network are also the dials that we tweak to change the predictions made by the model.
This part is really important:The magnitude of the error of a specific neuron (relative to the errors of all the other neurons) is directly proportional to the impact of that neuron’s output (a.
activation) on our cost function.
So the error of each neuron is a proxy for the partial derivative of the cost function with respect to that neuron’s inputs.
This makes intuitive sense — if a particular neuron has a much larger error than all the other ones, then tweaking the weights and bias of our offending neuron will have a greater impact on our model’s total error than fiddling with any of the other neurons.
And the partial derivatives with respect to each weight and bias are the individual elements that compose the gradient vector of our cost function.
So basically backpropagation allows us to calculate the error attributable to each neuron and that in turn allows us to calculate the partial derivatives and ultimately the gradient so that we can utilize gradient descent.
Hurray!An Analogy that Helps — The Blame GameThat’s a lot to digest so hopefully this analogy will help.
Almost everyone has had a terrible colleague at some point in his or her life — someone who would always play the blame game and throw coworkers or subordinates under the bus when things went wrong.
Well neurons, via backpropagation, are masters of the blame game.
When the error gets backpropagated to a particular neuron, that neuron will quickly and efficiently point the finger at the upstream colleague (or colleagues) who is most at fault for causing the error (i.
layer 4 neurons would point the finger at layer 3 neurons, layer 3 neurons at layer 2 neurons, and so forth).
Neurons blame the most active upstream neuronsAnd how does each neuron know who to blame, as the neurons cannot directly observe the errors of other neurons?.They just look at who sent them the most signal in terms of the highest and most frequent activations.
Just like in real life, the lazy ones that play it safe (low and infrequent activations) skate by blame free while the neurons that do the most work get blamed and have their weights and biases modified.
Cynical yes but also very effective for getting us to the optimal set of weights and biases that minimize our cost function.
To the left is a visual of how the neurons throw each other under the bus.
And that in a nutshell is the intuition behind the backpropagation process.
In my opinion, these are the three key takeaways for backpropagation:It is the process of shifting the error backwards layer by layer and attributing the correct amount of error to each neuron in the neural network.
The error attributable to a particular neuron is a good approximation for how changing that neuron’s weights (from the connections leading into the neuron) and bias will affect the cost function.
When looking backwards, the more active neurons (the non-lazy ones) are the ones that get blamed and tweaked by the backpropagation process.
Tying it All TogetherIf you have read all the way here, then you have my gratitude and admiration (for your persistence).
We started with a question, “What makes deep learning special?” I will attempt to answer that now (mainly from the perspective of basic neural networks and not their more advanced cousins like CNNs, RNNs, etc.
In my humble opinion, the following aspects make neural networks special:Each neuron is its own miniature model with its own bias and set of incoming features and weights.
Each individual model/neuron feeds into numerous other individual neurons across all the hidden layers of the model.
So we end up with models plugged into other models in a way where the sum is greater than its parts.
This allows neural networks to fit all the nooks and crannies of our data including the nonlinear parts (but beware overfitting — and definitely consider regularization to protect your model from underperforming when confronted with new and out of sample data).
The versatility of the many interconnected models approach and the ability of the backpropagation process to efficiently and optimally set the weights and biases of each model lets the neural network to robustly “learn” from data in ways that many other algorithms cannot.
Author’s Note: Neural networks and deep learning are extremely complicated subjects.
I am still early in the process of learning about them.
This blog was written as much to develop my own understanding as it was to help you, the reader.
I look forward to all of your comments, suggestions, and feedback.
Cheers!Sources:Neural Networks and Deep Learning by Michael A.
NielsenWikipedia: Backpropagation.. More details