Time to get into it.

We’ll pick back up where my introduction to CNNs left off.

We were using a CNN to tackle the MNIST handwritten digit classification problem:Sample images from the MNIST datasetOur (simple) CNN consisted of a Conv layer, a Max Pooling layer, and a Softmax layer.

Here’s that diagram of our CNN again:Our CNN takes a 28×28 grayscale MNIST image and outputs 10 probabilities, 1 for each digit.

We’d written 3 classes, one for each layer: Conv3x3, MaxPool, and Softmax.

Each class implemented a forward() method that we used to build the forward pass of the CNN:You can view the code or run the CNN in your browser.

It’s also available on Github.

Here’s what the output of our CNN looks like right now:MNIST CNN initialized![Step 100] Past 100 steps: Average Loss 2.

302 | Accuracy: 11%[Step 200] Past 100 steps: Average Loss 2.

302 | Accuracy: 8%[Step 300] Past 100 steps: Average Loss 2.

302 | Accuracy: 3%[Step 400] Past 100 steps: Average Loss 2.

302 | Accuracy: 12%Obviously, we’d like to do better than 10% accuracy… let’s teach this CNN a lesson.

2.

Training OverviewTraining a neural network typically consists of two phases:A forward phase, where the input is passed completely through the network.

A backward phase, where gradients are backpropagated (backprop) and weights are updated.

We’ll follow this pattern to train our CNN.

There are also two major implementation-specific ideas we’ll use:During the forward phase, each layer will cache any data (like inputs, intermediate values, etc) it’ll need for the backward phase.

This means that any backward phase must be preceded by a corresponding forward phase.

During the backward phase, each layer will receive a gradient and also return a gradient.

It will receive the gradient of loss with respect to its outputs (∂L / ∂out) and return the gradient of loss with respect to its inputs (∂L / ∂in).

These two ideas will help keep our training implementation clean and organized.

The best way to see why is probably by looking at code.

Training our CNN will ultimately look something like this:See how nice and clean that looks?.Now imagine building a network with 50 layers instead of 3 — it’s even more valuable then to have good systems in place.

3.

Backprop: SoftmaxWe’ll start our way from the end and work our way towards the beginning, since that’s how backprop works.

First, recall the cross-entropy loss:where p_c is the predicted probability for the correct class c (in other words, what digit our current image actually is).

Want a longer explanation?.Read the Cross-Entropy Loss section of my introduction to CNNs.

The first thing we need to calculate is the input to the Softmax layer’s backward phase, ∂L / ∂out_s, where out_s is the output from the Softmax layer: a vector of 10 probabilities.

This is pretty easy, since only p_i shows up in the loss equation:Reminder: c is the correct class.

That’s our initial gradient you saw referenced above:We’re almost ready to implement our first backward phase — we just need to first perform the forward phase caching we discussed earlier:We cache 3 things here that will be useful for implementing the backward phase:The input's shape before we flatten it.

The input after we flatten it.

The totals, which are the values passed in to the softmax activation.

With that out of the way, we can start deriving the gradients for the backprop phase.

We’ve already derived the input to the Softmax backward phase: ∂L / ∂out_s.

One fact we can use about ∂L / ∂out_s is that it’s only nonzero for c, the correct class.

That means that we can ignore everything but out_s(c)!First, let’s calculate the gradient of out_s(c) with respect to the totals (the values passed in to the softmax activation).

Let t_i be the total for class i.

Then we can write out_s(c) as:You should recognize the equation above from the Softmax section of my CNNs tutorial.

Now, consider some class k such that k is not c.

We can rewrite out_s(c) as:and use Chain Rule to derive:Remember, that was assuming k doesn’t equal c.

Now let’s do the derivation for c, this time using Quotient Rule:Phew.

That was the hardest bit of calculus in this entire post — it only gets easier from here!.Let’s start implementing this:Remember how ∂L / ∂out_s is only nonzero for the correct class, c?.We start by looking for c by looking for a nonzero gradient in d_L_d_out.

Once we find that, we calculate the gradient ∂out_s(i) / ∂t (d_out_d_totals) using the results we derived above:Let’s keep going.

We ultimately want the gradients of loss against weights, biases, and input:We’ll use the weights gradient, ∂L / ∂w , to update our layer’s weights.

We’ll use the biases gradient, ∂L / ∂b , to update our layer’s biases.

We’ll return the input gradient, ∂L / ∂input , from our backprop() method so the next layer can use it.

This is the return gradient we talked about in the Training Overview section!To calculate those 3 loss gradients, we first need to derive 3 more results: the gradients of totals against weights, biases, and input.

The relevant equation here is:These gradients are easy!Putting everything together:Putting this into code is a little less straightforward:First, we pre-calculate d_L_d_t since we'll use it several times.

Then, we calculate each gradient:d_L_d_w: We need 2d arrays to do matrix multiplication (@), but d_t_d_w and d_L_d_t are 1d arrays.

np.

newaxis lets us easily create a new axis of length one, so we end up multiplying matrices with dimensions (input_len, 1) and (1, nodes).

Thus, the final result for d_L_d_w will have shape (input_len, nodes), which is the same as self.

weights!d_L_d_b: This one is straightforward, since d_t_d_b is 1.

d_L_d_inputs: We multiply matrices with dimensions (input_len, nodes) and (nodes, 1) to get a result with length input_len.

Try working through small examples of the calculations above, especially the matrix multiplications for d_L_d_w and d_L_d_inputs.

That's the best way to understand why this code correctly computes the gradients.

With all the gradients computed, all that’s left is to actually train the Softmax layer!.We’ll update the weights and bias using Stochastic Gradient Descent (SGD) just like we did in my introduction to Neural Networks and then return d_L_d_inputs:Notice that we added a learn_rate parameter that controls how fast we update our weights.

Also, we have to reshape() before returning d_L_d_inputs because we flattened the input during our forward pass:Reshaping to last_input_shape ensures that this layer returns gradients for its input in the same format that the input was originally given to it.

Test Drive: Softmax BackpropWe’ve finished our first backprop implementation!.Let’s quickly test it to see if it’s any good.

We’ll start implementing a train()method from my CNNs introduction:Running this gives results similar to:MNIST CNN initialized![Step 100] Past 100 steps: Average Loss 2.

239 | Accuracy: 18%[Step 200] Past 100 steps: Average Loss 2.

140 | Accuracy: 32%[Step 300] Past 100 steps: Average Loss 1.

998 | Accuracy: 48%[Step 400] Past 100 steps: Average Loss 1.

861 | Accuracy: 59%[Step 500] Past 100 steps: Average Loss 1.

789 | Accuracy: 56%[Step 600] Past 100 steps: Average Loss 1.

809 | Accuracy: 48%[Step 700] Past 100 steps: Average Loss 1.

718 | Accuracy: 63%[Step 800] Past 100 steps: Average Loss 1.

588 | Accuracy: 69%[Step 900] Past 100 steps: Average Loss 1.

509 | Accuracy: 71%[Step 1000] Past 100 steps: Average Loss 1.

481 | Accuracy: 70%The loss is going down and the accuracy is going up — our CNN is already learning!4.

Backprop: Max PoolingA Max Pooling layer can’t be trained because it doesn’t actually have any weights, but we still need to implement a method for it to calculate gradients.

We’ll start by adding forward phase caching again.

All we need to cache this time is the input:During the forward pass, the Max Pooling layer takes an input volume and halves its width and height dimensions by picking the max values over 2×2 blocks.

The backward pass does the opposite: we’ll double the width and height of the loss gradient by assigning each gradient value to where the original max value was in its corresponding 2×2 block.

Here’s an example.

Consider this forward phase for a Max Pooling layer:An example forward phase that transforms a 4×4 input to a 2×2 outputThe backward phase of that same layer would look like this:An example backward phase that transforms a 2×2 gradient to a 4×4 gradientEach gradient value is assigned to where the original max value was, and every other value is zero.

Why does the backward phase for a Max Pooling layer work like this?.Think about what ∂L / ∂inputs intuitively should be.

An input pixel that isn’t the max value in its 2×2 block would have zero marginal effect on the loss, because changing that value slightly wouldn’t change the output at all!.In other words, ∂L / ∂inputs = 0 for non-max pixels.

On the other hand, an input pixel that is the max value would have its value passed through to the output, so ∂output / ∂input = 1, meaning ∂L / ∂input = ∂L / ∂output.

We can implement this pretty quickly using the helper method we wrote in my introduction to CNNs.

I’ll include it again as a reminder:For each pixel in each 2×2 image region in each filter, we copy the gradient from d_L_d_out to d_L_d_input if it was the max value during the forward pass.

That’s it!.On to our final layer.

5.

Backprop: ConvWe’re finally here: backpropagating through a Conv layer is the core of training a CNN.

The forward phase caching is simple:Reminder about our implementation: for simplicity, we assume the input to our conv layer is a 2d array.

This only works for us because we use it as the first layer in our network.

If we were building a bigger network that needed to use Conv3x3 multiple times, we'd have to make the input be a 3d array.

We’re primarily interested in the loss gradient for the filters in our conv layer, since we need that to update our filter weights.

We already have ∂L / ∂out for the conv layer, so we just need ∂out / ∂filters.

To calculate that, we ask ourselves this: how would changing a filter’s weight affect the conv layer’s output?The reality is that changing any filter weights would affect the entire output image for that filter, since every output pixel uses every pixel weight during convolution.

To make this even easier to think about, let’s just think about one output pixel at a time: how would modifying a filter change the output of one specific output pixel?Here’s a super simple example to help think about this question:A 3×3 image (left) convolved with a 3×3 filter (middle) to produce a 1×1 output (right)We have a 3×3 image convolved with a 3×3 filter of all zeros to produce a 1×1 output.

What if we increased the center filter weight by 1? The output would increase by the center image value, 80:Similarly, increasing any of the other filter weights by 1 would increase the output by the value of the corresponding image pixel! This suggests that the derivative of a specific output pixel with respect to a specific filter weight is just the corresponding image pixel value.

Doing the math confirms this:We can put it all together to find the loss gradient for specific filter weights:We’re ready to implement backprop for our conv layer!We apply our derived equation by iterating over every image region / filter and incrementally building the loss gradients.

Once we’ve covered everything, we update self.

filters using SGD just as before.

Note the comment explaining why we're returning – the derivation for the loss gradient of the inputs is very similar to what we just did and is left as an exercise to the reader :).

With that, we’re done!.We’ve implemented a full backward pass through our CNN.

Time to test it out…6.

Training a CNNWe’ll train our CNN for a few epochs, track its progress during training, and then test it on a separate test set.

Here’s the full code:Example output from running the code:MNIST CNN initialized!— Epoch 1 —[Step 100] Past 100 steps: Average Loss 2.

254 | Accuracy: 18%[Step 200] Past 100 steps: Average Loss 2.

167 | Accuracy: 30%[Step 300] Past 100 steps: Average Loss 1.

676 | Accuracy: 52%[Step 400] Past 100 steps: Average Loss 1.

212 | Accuracy: 63%[Step 500] Past 100 steps: Average Loss 0.

949 | Accuracy: 72%[Step 600] Past 100 steps: Average Loss 0.

848 | Accuracy: 74%[Step 700] Past 100 steps: Average Loss 0.

954 | Accuracy: 68%[Step 800] Past 100 steps: Average Loss 0.

671 | Accuracy: 81%[Step 900] Past 100 steps: Average Loss 0.

923 | Accuracy: 67%[Step 1000] Past 100 steps: Average Loss 0.

571 | Accuracy: 83%— Epoch 2 —[Step 100] Past 100 steps: Average Loss 0.

447 | Accuracy: 89%[Step 200] Past 100 steps: Average Loss 0.

401 | Accuracy: 86%[Step 300] Past 100 steps: Average Loss 0.

608 | Accuracy: 81%[Step 400] Past 100 steps: Average Loss 0.

511 | Accuracy: 83%[Step 500] Past 100 steps: Average Loss 0.

584 | Accuracy: 89%[Step 600] Past 100 steps: Average Loss 0.

782 | Accuracy: 72%[Step 700] Past 100 steps: Average Loss 0.

397 | Accuracy: 84%[Step 800] Past 100 steps: Average Loss 0.

560 | Accuracy: 80%[Step 900] Past 100 steps: Average Loss 0.

356 | Accuracy: 92%[Step 1000] Past 100 steps: Average Loss 0.

576 | Accuracy: 85%— Epoch 3 —[Step 100] Past 100 steps: Average Loss 0.

367 | Accuracy: 89%[Step 200] Past 100 steps: Average Loss 0.

370 | Accuracy: 89%[Step 300] Past 100 steps: Average Loss 0.

464 | Accuracy: 84%[Step 400] Past 100 steps: Average Loss 0.

254 | Accuracy: 95%[Step 500] Past 100 steps: Average Loss 0.

366 | Accuracy: 89%[Step 600] Past 100 steps: Average Loss 0.

493 | Accuracy: 89%[Step 700] Past 100 steps: Average Loss 0.

390 | Accuracy: 91%[Step 800] Past 100 steps: Average Loss 0.

459 | Accuracy: 87%[Step 900] Past 100 steps: Average Loss 0.

316 | Accuracy: 92%[Step 1000] Past 100 steps: Average Loss 0.

460 | Accuracy: 87%— Testing the CNN —Test Loss: 0.

5979384893783474Test Accuracy: 0.

78Our code works!.In only 3000 training steps, we went from a model with 2.

3 loss and 10% accuracy to 0.

6 loss and 78% accuracy.

Want to try or tinker with this code yourself?. More details