Everything you need to know about Neural Networks and Backpropagation — Machine Learning Made Easy and FunNeural Network explanation from the ground including understanding the math behind itGavril OgnjanovskiBlockedUnblockFollowFollowingJan 14I find it hard to get step by step and detailed explanations about Neural Networks in one place.

Always some part of the explanation was missing in courses or in the videos.

So I try to gather all the information and explanations in one blog post (step by step).

I would separate this blog in 8 sections as I find it most relevant.

Model RepresentationModel Representation MathematicsActivation FunctionsBias NodeCost FunctionForward Propagation CalculationBackpropagation AlgorithmCode ImplementationSo let’s start…Model RepresentationArtificial Neural Network is computing system inspired by biological neural network that constitute animal brain.

Such systems “learn” to perform tasks by considering examples, generally without being programmed with any task-specific rules.

Image 1: Neural Network ArchitectureThe Neural Network is constructed from 3 type of layers:Input layer — initial data for the neural network.

Hidden layers — intermediate layer between input and output layer and place where all the computation is done.

Output layer — produce the result for given inputs.

There are 3 yellow circles on the image above.

They represent the input layer and usually are noted as vector X.

There are 4 blue and 4 green circles that represent the hidden layers.

These circles represent the “activation” nodes and usually are noted as W or θ.

The red circle is the output layer or the predicted value (or values in case of multiple output classes/types).

Each node is connected with each node from the next layer and each connection (black arrow) has particular weight.

Weight can be seen as impact that that node has on the node from the next layer.

So if we take a look on one node it would look like thisImage 2: Node from Neural NetworkLet’s look at the top blue node (“Image 1”).

All the nodes from the previous layer (yellow) are connected with it.

All these connections represent the weights (impact).

When all the node values from the yellow layer are multiplied with their weight and all this is summarized it gives some value for the top blue node.

The blue node has predefined “activation” function (unit step function on “Image 2”) which defines if this node will be “activated” or how “active” it will be, based on the summarized value.

The additional node with value 1 is called “bias” node.

Model Representation MathematicsIn order to understand the mathematical equations I will use a simpler Neural Network model.

This model will have 4 input nodes (3 + 1 “bias”).

One hidden layer with 4 nodes (3 + 1 “bias”) and one output node.

Image 3: Simple Neural NetworkWe are going to mark the “bias” nodes as x₀ and a₀ respectively.

So, the input nodes can be placed in one vector X and the nodes from the hidden layer in vector A.

Image 4: X (input layer) and A (hidden layer) vectorThe weights (arrows) are usually noted as θ or W.

In this case I will note them as θ.

The weights between the input and hidden layer will represent 3×4 matrix.

And the weights between the hidden layer and the output layer will represent 1×4 matrix.

If network has a units in layer j and b units in layer j+1, then θⱼ will be of dimension b×(a+1).

Image 5: Layer 1 Weights Matrix (θ)Next, what we want is to compute the “activation” nodes for the hidden layer.

In order to do that we need to multiply the input vector X and weights matrix θ¹ for the first layer (X*θ¹)and then apply the activation function g.

What we get is :Image 6: Compute activation nodesAnd by multiplying hidden layer vector with weights matrix θ for the second layer(A*θ) we get output for the hypothesis function:Image 7: Compute output node value (hypothesis)This example is with only one hidden layer and 4 nodes there.

If we try to generalize for Neural Network with multiple hidden layers and multiple nodes in each of the layers we would get next formula.

Image 8: Generalized Compute node value functionWhere we have L layers with n nodes and L-1 layer with m nodes.

Activation FunctionsIn Neural Network the activation function defines if given node should be “activated” or not based on the weighted sum.

Let’s define this weighted sum value as z.

In this section I would explain why “Step Function” and “Linear Function” won’t work and talk about “Sigmoid Function” one of the most popular activation functions.

There are also other functions which I will leave aside for now.

Step FunctionOne of the first ideas would be to use so called “Step Function” (discrete output values) where we define threshold value and:if(z > threshold) — “activate” the node (value 1)if(z < threshold) — don’t “activate” the node (value 0)This looks nice but it has drawback since the node can only have value 1 or 0 as output.

In case when we would want to map multiple output classes (nodes) we got a problem.

The problem is that it is possible multiple output classes/nodes to be activated (to have the value 1).

So we are not able to properly classify/decide.

Linear FunctionAnother possibility would be to define “Linear Function” and get a range of output values.

However using only linear function in the Neural Network would cause the output layer to be linear function, so we are not able to map any non-linear data.

The proof for this is given by:then by function composition we getwhich is also a linear function.

Sigmoid FunctionIt is one of the most widely used activation function today.

It equation is given with the formula below.

Image 9: Sigmoid Equation.

source: wikipediaImage 10: Sigmoid Function.

source: wikipediaIt has multiple properties which makes it so popular:It’s non-linear functionRange values are between (0,1)Between (-2,2) on x-axis the function is very steep, that cause function to tend to classify values ether 1 or 0Because of this properties it allows the nodes to take any values between 0 and 1.

In the end, in case of multiple output classes, this would result with different probabilities of “activation” for each output class.

And we will choose the one with the highest “activation”(probability) value.

Bias NodeUsing “bias” node is usually critical for creating successful learning model.

In short, a bias value allows to shift the activation function to the left or right and it helps getting better fit for the data (better prediction function as output).

Below there are 3 Sigmoid functions that I draw where you can notice how multiplication/add/subtract the variable x by some value can influence the function.

Multiplying x — makes the function steeperAdd/Subtract x — shift the function left/rightImage 11: Sigmoid Functions.

source: desmos.

comCost FunctionLet’s start with defining the general equation for the cost function.

This function represent the sum of the error, difference between the predicted value and the real (labeled) value.

Image 12: General Cost functoin.

source: coursera.

orgSince this is type of a classification problem y can only take discrete values {0,1}.

It can only be in one type of class.

For example if we classify images of dogs (class 1), cats (class 2) and birds (class 3).

If the input image is dog.

The output classes will be value 1 for dog class and value 0 for the other classes.

This means that we want our hypothesis to satisfyImage 13: Hypothesis function range valuesSo that’s why we will define our hypothesis asImage 14: Hypothesis functionWhere g in this case will be Sigmoid function, since this function has range values between (0,1).

Our goal is to optimize the cost function so we need to find min J(θ).

But Sigmoid function is a “non-convex” function (“Image 15”) which means that there are multiple local minimums.

So it’s not guaranteed to converge (find) to the global minimum.

What we need is “convex” function in order gradient descent algorithm to be able to find the global minimum (minimize J(θ)).

In order to do that we use log function.

Image 15: Convex vs Non-convex function.

source: researchgate.

comSo that’s why we use following cost function for neural networksImage 16: Neural Network cost function.

source: coursera.

orgIn case where labeled value y is equal to 1 the hypothesis is -log(h(x)) or -log(1-h(x)) otherwise.

The intuition is pretty simple if we look at the function graphs.

Let first look at the case where y=1.

Then -log(h(x)) would look like the graph below.

And we are only interested in the (0,1) x-axis interval since hypothesis can only take values in that range (“Image 13”)Image 17: Cost function -log(h(x)) .

source: desmos.

comWhat we can see from the graph is that if y=1 and h(x) approaches value of 1 (x-axis) the cost approaches the value 0 (h(x)-y would be 0) since it’s the right prediction.

Otherwise if h(x) approaches 0 the cost function goes to infinity (very large cost).

In the other case where y=0, the cost function is -log(1-h(x))Image 18: -log(1-h) cost function.

source: desmos.

comFrom the graph here we can see that if h(x) approaches value of 0 the cost would approach 0 since it’s also the right prediction in this case.

Since y (labeled value) is always equal to 0 or 1 we can write cost function in one equation.

Image 19: Cost function equation.

source: coursera.

orgIf we fully write our cost function with the summation we would get:Image 20: Cost function in case of one output node.

source: coursera.

orgAnd this is for the case where there is only one node in the output layer of Neural Network.

If we generalize this for multiple output nodes (multiclass classification) what we get is:Image 21: Generalized Cost function.

source: coursera.

orgThe right parts of the equations represent cost function “regularization”.

This regularization prevent the data from “overfitting”, by reducing the magnitude/values of θ.

Forward Propagation CalculationThis process of Forward propagation is actually getting the Neural Network output value based on a given input.

This algorithm is used to calculate the cost value.

What it does is the same mathematical process as the one described in section 2 “Model Representation Mathematics”.

Where in the end we get our hypothesis value “Image 7”.

After we got the h(x) value (hypothesis) we use the Cost function equation (“Image 21”) to calculate the cost for the given set of inputs.

Image 22: Calculate Forward propagationHere we can notice how forward propagation works and how a Neural Network generates the predictions.

Backpropagation AlgorithmWhat we want to do is minimize the cost function J(θ) using the optimal set of values for θ (weights).

Backpropagation is a method we use in order to compute the partial derivative of J(θ).

This partial derivative value is then used in Gradient descent algorithm (“Image 23”) for calculating the θ values for the Neural Network that minimize the cost function J(θ).

Image 23: General form of gradient descent.

source: coursera.

orgBackpropagation algorithm has 5 steps:Set a(1) = X; for the training examplesPerform forward propagation and compute a(l) for the other layers (l = 2…L)Use y and compute the delta value for the last layer δ(L) = h(x) — yCompute the δ(l) values backwards for each layer (described in “Math behind Backpropagation” section)Calculate derivative values Δ(l) = (a(l))^T ∘ δ(l+1) for each layer, which represent the derivative of cost J(θ) with respect to θ(l) for layer lBackpropagation is about determining how changing the weights impact the overall cost in the neural network.

What it does is propagating the “error” backwards in the neural network.

On the way back it is finding how much each weight is contributing in the overall “error”.

The weights that contribute more to the overall “error” will have larger derivation values, which means that they will change more (when computing Gradient descent).

Now that we have sense of what Backpropagation algorithm is doing we can dive deeper in the concepts and math behind.

Why derivatives ?The derivative of a function (in our case J(θ)) on each variable (in our case weight θ) tells us the sensitivity of the function with respect to that variable or how changing the variable impacts the function value.

Let’s look at a simple example neural networkImage 24: Simple Neural NetworkThere are two input nodes x and y.

The output function is calculating the product x and y.

We can now compute the partial derivatives for both nodesImage 25: Derivatives to respect to y and x of f(x,y) = xy functionThe partial derivative with respect to x is saying that if x value increase for some value ϵ then it would increase the function (product xy) by 7ϵ and the partial derivative with respect to y is saying that if y value increase for some value ϵ then it would increase the function by 3ϵ.

As we defined, Backpropagation algorithm is calculating the derivative of cost function with respect to each θ weight parameter.

By doing this we determine how sensitive is the cost function J(θ) to each of these θ weight parameters.

It also help us determine how much we should change each θ weight parameter when computing the Gradient descent.

So at the end we get model that best fits our data.

Math behind BackpropagationWe will by using the neural network model below as starting point to derive the equations.

Image 26: Neural NetworkIn this model we got 3 output nodes (K) and 2 hidden layers.

As previously defined, the cost function for the neural network is:Image 27: Generalized Cost function.

source: coursera.

orgWhat we need is to compute the partial derivative of J(θ) with respect to each θ parameters.

We are going to leave out the summarization since we are using vectorized implementation (matrix multiplication).

Also we can leave out the regularization (right part of the equation above) and we will compute it separately at the end.

Since it is addition the derivative can be computed independently.

NOTE: Vectorized implementation will be used so we calculate for all training examples at once.

We start with defining the derivative rules that we will use.

Image 28: Derivative RulesNow we define the basic equation for our neural network model where l is layer notation and L is for the last layer.

Image 29: Initial Neural Network model equationsIn our case L has value 4, since we got 4 layers in our model.

So let’s start by computing the partial derivative with respect to weights between 3rd and 4th layer.

Image 30: Derivative of θ parameters between 3rd and 4th layerStep (6) — Sigmoid derivativeTo explain the step (6) we need to calculate the partial derivative of sigmoid function.

Image 31: Derivative of Sigmoid functionIn case of the last layer L we got,Image 32: Output layer equationso,Image 33: Output layer equationStep (11) — Get rid of the summarization (Σ)Also in the last step (11) it’s important to note that we need to multiply δ by a transpose in order to get rid of the summarization (1…m for training examples).

δ — matrix with dimensions [number_of_training_examples, output_layer_size] so this also means that we will get rid from the second summarization (1…K for number of output nodes).

a — matrix with dimensions [hidden_layer_size, number_of_training_examples]Now we continue with the next derivative for the θ parameters between 2nd and 3rd layer.

For this derivation we can start from step (9) (“Image 30”).

Since θ(2) is inside a(3) function we need to apply the “Chain Rule” when calculating the derivative (step(6) from derivative rules on “Image 28”).

Image 34: Derivative of θ parameters between 2nd and 3rd layerNow we got the derivative for θ parameter between 2nd and 3rd layer.

What we left to do is compute the derivative for θ parameter between input layer and 2nd layer.

By doing this we will see that the same process (equations) will be repeated so we can derive general δ and derivative equations.

Again we continue from step (3) (“Image 34”).

Image 35: Derivative of θ parameters between input and 2nd layerFrom the equation above we can derive equations for δ parameter and derivative with respect to θ parameter.

Image 36: Recursive δ equationImage 37: Derivative of J (cost) with respect to θ in layer l equationAt the end we get is three matrices (same as θ weight matrices) with same dimensions as the θ weight matrices and calculated derivatives for each θ parameter.

Add the regularizationAs already mentioned regularization is needed for preventing the model from overfitting the data.

We have already defined regularization for our cost function which is the right part of the equation defined on “Image 21“.

Image 38: Regularization equation for Cost functionIn order to add the regularization for the gradient (partial derivative) we need to compute the partial derivative for the regularization above.

Image 39: Regularization equation for gradient (partial derivative)Which means just adding the sum of all theta values from each layer to the partial derivatives with respect to θ.

Code ImplementationWe can now implement all the equations in code where we will calculate the Cost and derivatives (using Backpropagation) so we can use them in Gradient descent algorithm later to optimize θ parameters for our model.

Image 38: Code implementation of Neural Network Cost function and Backpropagation algorithmConclusionHopefully this was clear and easy to understand.

If you think that some part needs better explanation please feel free to add a comment or suggestion.

For any questions feel free to contact me.

Hope you enjoyed it!Helpful linksIntroduction to DerivativesMath explained in easy language, plus puzzles, games, quizzes, worksheets and a forum.

For K-12 kids, teachers and…www.

mathsisfun.

com.