Recall that bias signifies addition.

The bias value assigned to this layer is -2.

That means we subtract 2 from the value we just calculated: 4.

7–2 = 2.

7.

The value 2.

7 is thus our final value that we feed to hidden node E.

In the figure above, the weights leading to hidden node F are all highlighted in gold.

The values of these weights from top to bottom are -1.

5, -3, 7.

1, and 5.

2.

We perform the same calculations as before: (a) Multiply the input values (red) with the corresponding weight value (gold); (b) sum these together; © add the bias term (green).

In the figure above, we see the weights leading to hidden node G highlighted in gold, as well as the calculation of the value fed to node G.

NonlinearitySo far, we have only done multiplication and addition operations.

However, using only multiplication and addition limits the kinds of transformations we can do from the input to the output.

We are assuming that the relationship between the input and the output is linear.

When modeling the real world, it’s nice to have more flexibility, because the true relationship between the input and the output might be nonlinear.

How can we allow the neural network to represent a nonlinear function?We add in “nonlinearities”: each neuron applies a nonlinear function to the value it receives.

One popular choice is the sigmoid function.

So, here, node E would apply the sigmoid function to 2.

7, node F would apply the sigmoid function to -11.

67, and node G would apply the sigmoid function to 1.

51.

This is the sigmoid function:Key points:(1) the sigmoid function is not linear, which helps the neural network learn more complicated relationships between the input and output(b) the sigmoid function squashes its input values to be between 0 and 1.

Representing a neural network as a matrixIf you read about neural networks online, you will probably see matrix equations of the form “Wx+b” used to describe the computations of a neural network.

It turns out that all the addition and multiplication I just described can be summarized using matrix multiplication.

Recap of how to perform matrix multiplication:Matrix multiplication applied to our example:Here, I have taken all of the first-layer weights and arranged them in the gold matrix labeled W.

I have taken the inputs x and arranged them as the red vector.

The bias units are shown in green, and are all the same for a given layer (in this case they are all -2).

If you work through the matrix multiplication and bias addition, you see that we obtain exactly the same answer that we got before: 2.

7 for node E, -11.

67 for node F, and 1.

51 for node G.

Recall that each of these numbers is subsequently squashed between 0 and 1 by the sigmoid function:Obtaining the outputThe final calculation for the output is shown here.

Node E output value 0.

937, node F output value 0.

000009, and node G output value 0.

819.

Proceeding in the same way as before, we multiply each of these numbers by the corresponding weight (gold), sum them together, add the bias, and then apply the nonlinearity to get the final output of 0.

388.

That is the “forward pass”: how neural networks compute an output prediction based on input data.

What about the backward pass?Here, I will refer you to the excellent post by Matt Mazur, “A Step by Step Backpropagation Example.

” If you work through his backpropagation example, I guarantee you will come away with a great understanding of how the backward pass works, i.

e.

, how the neural network tweaks each of the weights to get a more correct answer the next time around.

The process of modifying the neural network to get a more correct answer is also referred to as “gradient descent.

” The “gradient” is just the derivative of the loss function, and “descent” indicates that we are trying to go “down the derivative”, or make the loss function smaller.

If you think of the loss function as a hill, we are trying to “descend” the hill so that our loss is smaller and our answer is less wrong (more correct).

Skiing down the loss functionThe end!.Stay tuned for future posts about multiclass vs.

multilabel classification, different kinds of nonlinearities, and stochastic vs.

batch vs.

minibatch gradient descent!Originally published at http://glassboxmedicine.

com on January 17, 2019.

.