Backpropagation for people who are afraid of math

We will go over each expression.

When trying to update and optimize the network’s weights, we are trying to find-, The derivative of the loss with regard to the weights (“how does a change to the weights effects the loss”), and using the chain rule, divide this task into three:The derivative of the loss with regard to the next layer.

This is the loss “passed upstream” from the output layer.

or in other words- “How does a change to the next layer, effects the loss”.

The derivative of the next layer with regard to the current layer (which can be interpreted as “how a change in the current layer effects the next layer”), which is simply the weights connecting to the next layer multiplied by the derivative of its activation function.

and– (“How does a change to the weights affect the current layer”), which is the previous layer’s O values, multiplied by the current layer’s activation function derivative.

To make things clearer, I’ve written the actual calculations, color coded to our network, for the last two layers of our network, L100 and L99.

Notice that the derivative term related to each calculation appears below it.

The two derivatives associated with the loss, appearing in red, are of utmost importance, as they are used in the calculations for the previous layers.

This can be seen clearly in the next diagram:Notice how ∂Loss propagates down the layers.

Looking at this pattern, you should start seeing how this could be implemented in code.

Also notice that I’ve highlighted the last two layers, which form the before mentioned Perceptron.

Note that I didn’t mention multiplying the entire expression by the learning rate (α) in this diagram, as it seemed too crowded and shadowed the take home messages, which is the application of the chain rule.

you should definitely play with different α values to get the best performance.

In any case, α does appear in the next diagram.

Backpropagating over a batch of instancesAn important point to notice is that each layer in the schematic representations we saw, is in fact a vector, representing the computations done for a single instance.

Usually we would like to input a batch of instances into the network.

This will be clearer after going over the next diagram, which shows the calculation for a batch of n instances.

Notice that this is the same exact network (5 neurons for layer L95, 2 neurons for layer L96 and so on…), only that we are now looking at n instances and not just one.

The most challenging part for me, when implementing backpropagation, was to get the sizes of the different layers and weights and gradient matrices to play nice.

This illustration aims to set things in order.

On top you will see the schematic network.

The actual size of n makes no difference (for the calculations.

obviously it does make a difference in the larger scheme of things…), for as you will notice, when we perform matrix multiplications while backpropagating, we always sum over n.

that is to say, the length of n is “lost” during matrix multiplication.

And this is exactly what we want, to sum over the loss from all instances in our batch.

The rest of the diagram is divided into two sections:Forward pass.

Backpropagation.

The forward pass should be pretty obvious to most of you.

If it’s not, I would recommend reading about matrix multiplications before moving any further.

One thing I will point out is the fact that each weight matrix takes a layer of size (n,k) and outputs a layer of size (n,j).

such weight matrix will be of size (k,j).

You will probably notice that this diagram is missing the bias unit.

That is because I wanted to keep it as clear as possible, and focus on how the different matrices sizes fit into the backpropagation process.

There is a short section on adding the bias unit below.

The backpropagation part is a “bit” trickier… :)This section of the diagram is divided into three subsections:1.

VariablesHere I list the different elements of the calculation, and most importantly, their shape.

A few notes on this part:Z refers to the layer’s values before activation.

O refers to the layer’s values after activation.

σ refers to the activation function.

g’ refers to the derivative of the activation function.

Notice that this section groups the variables constitutingandat the top (highlighted in red), and those constitutingat the bottom (highlighted in blue).

2.

CalculationThis is where all the drama takes place.

There is actually nothing new here.

These are the same exact calculations seen on the previous diagrams, but with matrix sizes clearly written and depicted.

Also, when to use element wise multiplication, and when to use matrix multiplication is clearly stated (matrix multiplication is denoted as @, as this is the shorthand for Numpy.

dot.

You will see it in action in the code section below), as well as when you need to transpose a matrix.

The diagram and following code, assumes a squared loss function.

Its derivative is defined as output – labels.

3.

Weight updateAll that’s left is to update our weights by adding ∆w to each weight matrix.

Notice that we only perform the updates once we finished backpropagating.

CodeA few notes about this section:For readability purposes, the code presented here is pseudocode.

As this post is meant to be used as a practical guide, I encourage you to go over the diagrams and try to write your own implementation before looking at the code example.

The diagrams have all the information you will need to successfully build it yourself.

Look at the diagrams and see how the gradient is passed from one layer to the next.

Make sure you understand what is multiplied by what, and what axes are summed over.

See how we obtain a ∆W matrix in the shape fitting our layer’s weights in each iteration.

There are plenty of solutions online.

This one I highly recommend as it is very simple to understand.

My own implementation was largely based on it.

The code assumes using the sigmoid activation function.

Due to the fact that the derivative of the sigmoid function (σ(z) *(1-σ(z))) requires only the O values (which are of course σ(z)), we don't need the neuron's values before activation (Z).

For implementation using different activation functions, you will need to save the Z values when doing a forward pass.

Using a loopUsing recursionAdding a bias unitYou may have noticed the previous diagrams were missing the bias units.

I chose to leave out the bias from these diagrams as I wanted to keep them as simple and intuitive as possible , but you should definitely consider adding it!You can add a bias “manually” for each layer, and then calculate the derivative of the loss in respect to that bias:We already know how to calculate, andis just the activation function derivative of the current layer.

You could also add the bias to the weights matrix.

This basically just means appending a vector of bias neurons (a vector of 1’s) to each layer, and initialize the weight matrices shape accordingly (just like you do in simple linear regression).

The one thing to keep in mind though, is that the bias units themselves should never be updated in the forward pass, as they are connected to the neurons of the next layer, but NOT to the neurons of the previous layer (See diagram).

One way you can approach this is to avoid updating these neurons, but this can get tricky, especially in the backward pass (Full disclosure, this is what I did, and I don’t recommend it…).

A simpler solution is to do the forward and backward pass normally, but re-initialize the bias neurons to 1 after each layer update.

Some useful tipsDo NOT update the weights while back-propagating!!.Remember that the next iteration (the previous layer) will need these (non-updated) weights to compute the loss.

You can either save the ∆w’s, and update the weights at the end of the backpropagation part (like I’m doing in the code examples), or constantly update the weights two layers ahead, which in my opinion, is confusing, and overly complicated.

If a layer is not activated by a nonlinear function (for example the output layer), the gradient is just 1.

The fact that your program doesn’t crash, doesn’t mean that it works.

Make sure your network converges, and that the loss decreases.

The fact that your network converges and your loss decreases, doesn’t mean it’s working optimally.

compare your results to other implementations.

play around with the learning rate, and the structure of the network.

Try different initialization methods for the weights.

This can have a huge effect on performance.

SummaryBackpropagation can be a tough nut to crack, but if you wish to have a good understanding of how neural networks work, you should avoid jumping into higher level solutions such as TensorFlow or Pytorch before implementing a simple network yourself.

This is the basis for all deep learning and is crucial for successfully working with more complicated networks.

It’s also fun (when it works).

Good luck!.

. More details

Leave a Reply