In order to do so, we need a labeled data from which the neural network can learn from.

We call this as a training data set.

Let’s see how the training process works.

Train & Test DataFor training the neural network we will use the MNIST dataset which contains tens of thousands of scanned images of handwritten digits along with their correct classifications.

It gets its name from NIST, which stands for United State’s National Institute of Standards and Technology.

It is modified subsets of two datasets collected by NIST.

MNIST data comes in two parts.

The first part contains 60,000 images to be used as training data.

These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students.

The images are greyscale and 28 by 28 pixels in size.

The second part of the MNIST data set is 10,000 images to be used as test data.

Again, these are 28 by 28 greyscale images.

We’ll use the test data to evaluate how well our neural network has learned to recognize digits.

To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data.

We’ll use the notation “x” to denote a training input.

Each training input (each scanned image) “x” will be 28 * 28 = 784 dimensional vectors.

And each entry in the vector represents the grey value for a single pixel in the image.

The desired output (true value) is “y” (the form would be y = y(x)) which is a 10 dimensional vector.

For an example, if a particular training image (x) is 4, then y(x) = (0, 0, 0, 0, 1, 0, 0, 0, 0, 0) transpose vector is the desired output from the network.

The reason we do a transpose, as a vector by default is defined as a column vector.

So, we flipped a row vector to a column vector.

Learning with Gradient DescentGiven the above setup, we need an algorithm which will find ‘weights’ and ‘biases’ so that the output from the network approximates y(x) for all training inputs x.

To quantify how well we are achieving this objective, we define a Cost function.

These set of functions are also called as Objective function or Loss function.

Cost function for our neural networkHere, “w” denotes the collection of all weights in the network, “b” all the biases, “n” is the total number of training inputs, “a” is the vector of outputs from the network when “x” is input, and the sum is over all training inputs, x.

The notation ||v|| just denotes the normal length function for a vector “v” also known as “norm” of a vector.

“C” is the quadratic cost function, it’s also known as MSE (Mean Squared Error).

Now let’s understand what values of Cost function signifies that our network is doing good.

As, for all values of x we sum the square of the difference, so it is always non-negative.

So, C(w, b) is non-negative.

Further the cost becomes small i.

e.

C(w, b) ≈ 0, precisely when y(x) i.

e.

the true output approximately equals to the network output “a” for all training input x.

This means our training algorithm is doing well, if it has found such weights (w) and biases (b) so that C(w, b) ≈ 0.

On the contrary our algorithm is not doing well, if C(w, b) is very large, which means that “a” is different from “y” for majority of the input “x”.

So, the aim of our training algorithm is to minimize the cost, C(w, b) as a function of weights (w) and biases (b).

We will do this using an algorithm known as Gradient Descent.

We use a smooth cost function like the quadratic cost where it is easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost.

(we also discussed this part in the previous article where we moved from step function to sigmoid function).

Although quadratic cost is not the best in terms of result, but it is surely the best in terms of explaining the fundamental concepts of Gradient Descent.

Later on we will experiment with other cost functions.

Now that we know what to do, let’s draw our focus to understand what is Gradient Descent.

For next few minutes we will forget that we are solving a neural network.

We will try to understand gradient descent on its own.

What is Gradient Descent ?Let’s say we have a function of many variables and we want to minimize that function.

We are going to develop a technique called “Gradient Descent” which will be used to solve such minimization problems (optimization problems).

Say we are trying to minimize a function, C(v).

This could be any real-valued function of many variables v = v1, v2, v3, … .

Let’s plot C(v) as a function of v1, v2 (say two variables for the plotting purpose).

Plotting a real-valued function C(v) of two variables v1 & v2So, what we are interested in is to find the values of v1 & v2 at which function “C” achieves global minimum.

Now looking at this function probably you can eyeball and locate the global minimum.

Yes, when v1 = v2 = 0, then C(v) which is in the 3rd axis achieves global minimum whose value is also 0.

Thus we can say the functional form of C(v) isC(v) = v1*v1 + v2*v2Now, this is a very simple function.

In reality, a function comprises of many many variables where you can eyeball and find the global minimum or global maximum (remember it is an optimization problem and it can be either maximize or minimize based on the problem statement).

So, one way to tackle these kind of problems is to use Calculus to find the minimum analytically.

We could compute derivatives and then try using them to find places where C is an extremum (either minimum or maximum).

So, computing derivatives for functions with few variables is OK, but neural networks will have thousands of weights and biases in a very complicated way.

It’s difficult to compute partial derivatives of such functions.

There is another type of functions called “Implicit” function where you cant’t compute derivatives.

So, what do you do with all these complicated functions if you were to find the extremum for those.

Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well.

Start by thinking of our function as a kind of a valley.

Imagine our function is a kind of ValleyAlso imagine a ball rolling down the slope of the valley.

Our experience tells that the ball will eventually roll to the bottom of the valley.

Perhaps we can use this idea as a way to find a minimum for the function.

We’d randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley.

Now, let’s get into the specifics.

Keeping the above function in mind, let’s think about what happens when we move the ball a small amount Δv1 in the v1 direction, and a small amount Δv2 in the v2 direction.

Calculus tells us the function “C” changes as follows -We are going to find a way of choosing Δv1 and Δv2 so as to make ΔC negative, i.

e.

we will choose them so that ball is rolling down into the valley (that’s why we are minimizing the cost function).

To figure out how to make such a choice.

With these two vectors we can now represent ΔC, which is defined earlier as a small change in the Cost function which we are trying to minimize.

Here is the one-This equation tells you that gradient vector C is a relation between small changes in v and C.

Now with these let’s select Δv so to make ΔC negative.

Here, ΔC is always decreasing if we change Δv as per the above equation.

This is exactly what we wanted.

So, we will roll the ball and change the position of the ball from v to v-prime in the following order -Changes is V in every moveThen we’ll use this update rule again, to make another move.

If we keep doing this, over and over, we’ll keep decreasing C until we reach a global minimum.

So, in Gradient Descent algorithm we repeatedly compute the gradient vector of C, and then move down the slope of the valley.

If we have done it enough we hope to reach the global minimum of the valley or the cost function.

Here the parameter eta (η) i.

e.

the learning rate has to be chosen small because you want to slowly come down the slope of the function.

If it is chosen high then it might jump to the other side of the function where ΔC > 0.

If it is too small the algorithm will not converge, meaning it will take a very long time to reach the global minimum.

The reason is that the change in “v” i.

e.

Δv is a function of the learning rate (you can see in the above equation).

We will see later how to optimize this learning rate, but I think the concept is clear about Gradient Descent.

For multi-variables the concept remains the same, but the notation changes a bit in the vector representations.

Gradient Descent to learn in a Neural NetworkThe idea is to use Gradient Descent to find the weights w and biases b which minimizes the below cost function -Cost function for our neural networkGradient Descent rule for our networkBy repeatedly applying this update rule, we can “roll down the hill” and hopefully find the global minimum of the cost function.

At that point we get the optimal value of weights (w*) and biases (b*) which are used to identify any handwritten digits.

In other words, this is a rule which can be used to learn in a neural network.

Challenges with Gradient DescentIf we look at the below objective function, in order to optimize it, we needCost function for our neural networkto compute gradient (∇C) for each individual training example and in order to do that we have to compute the gradients ∇Cx (gradient of all inputs x) separately for each training input, x, and then average them.

It is like this-Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.

An idea called Stochastic Gradient Descent can be used to speed up learning.

The idea is to estimate the gradient ∇C by computing ∇Cx for a small sample of randomly chosen training inputs.

By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C, and this helps speed up gradient descent, and thus learning.

Stochastic Gradient DescentStochastic gradient descent works by randomly picking out a small number “m” of randomly chosen training inputs.

We’ll label those random training inputs X1,X2,…,Xm and refer to them as a mini-batch.

Provided the sample size “m” is large enough we expect that the average value of the ∇CXj will be roughly equal to the average over all ∇Cx, i.

e.

So, we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.

So, for the neural network learning, suppose wk and bl denote weights & biases respectively.

Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those where sums are over all the training examples Xj in the current mini batch.

Stochastic Gradient Descent equationsThen we pick out another randomly chosen mini-batch and train with those.

And so on, until we’ve exhausted the training inputs, which is said to complete an epoch of training.

At that point we start over with a new training epoch.

For example, if we have a training set of size n=60,000 as in MNIST, and choose a mini-batch size of (say) m=10, this means we’ll get a factor of 6,000 speedup in estimating the gradient!In our next article we will use this framework to implement a neural network to classify digits using Python programming.

SourcesNeural Networks & Deep Learning by Michael Nielsen.