We use the plot between number of iterations and the loss/error described by the cost function:Note: The x-axis is the number of iterations we train our model (oftentimes called, the model complexity)What we can infer from this graph is that, as the number of iterations are increasing, the training error is decreasing.

It means that our model is able to capture the essence of the patterns present in the training dataset.

On the other hand, if you were to look at the validation error curve, which is evaluated on the cross validation dataset, the error is decreasing initially, but after a certain point, the error starts increasing.

This indicates us that our same model is performing well on training set, but not so good on cross validation set.

By this, we can understand that our model is not generalising well.

The idea, therefore, is to stop training when our training error is as minimum as possible given the constraint that the validation error does not starts to increase.

This sweet spot gives our model the ability to learn, rather than to memorise or perform poorly.

Options for overcoming high biasUse bigger network: This approach certainly helps because the model which you are currently using, which is determined by your network size, might not be capable of separating data which is non-linear.

In simple words, your network might be trying to fit the data using a straight line, whereas the data, if we were to visualise, is non-linear in nature.

Run gradient descent for longer time: This might help because while performing gradient descent we might not have thoroughly descended or approached the minima yet.

Running for a longer time certainly doesn’t hurt in this case, as long as we are not over-fitting the model, but since this is a high-bias case it is recommended to run it a bit longer and then see the results.

Try out some advanced optimisation algorithms: We will be discussing them in the subsequent sections.

Change the neural network architecture itself (one which better suits the problem under consideration): This solution may or may not help but it is worth a shot.

This is a highly domain-specific solution and hence it has very little probability of working out.

Options for overcoming high varianceTrain on more data: We can justify this solution by induction.

Suppose you have only one training example, then regardless of what model you use, you can fit it to the example perfectly, resulting in no-bias case.

But as soon as you start introducing new examples, your model’s performance starts degrading.

This is because our model has not seen any “representative set” of the data of the problem which we are trying to solve.

In other words, we have not learned anything “useful.

”Regularization: Penalising weights removes the problem of overfitting and allows the model to generalise well on unseen examples.

We will be discussing them in the subsequent section.

Change the neural network architecture itself (one which better suits this problem): Same description as above (for high bias case)Note: There are cases where your model might be suffering from both, high-bias as well as high variance.

In such a scenario, you are better off tackling the high bias scenario first by using some of the above mentioned solutions and then tackle the high-variance.

I have prepared a sample Jupyter Notebook (using Keras API of Tensorflow framework) which can be used to analyse bias and variance by varying the number of iterations.

You can refer to it here on my Github repository.

2.

Different Optimisation TechniquesThe gradient descent technique which you have seen in a few of my previous articles is its vanilla implementation.

It is also called Batch Gradient Descent because we will be looking at all the examples before making updates to our model.

There are several alternatives available which dramatically improve the performance of our model by speeding up the learning.

Mini-batch gradient descent: This variation of gradient descent splits the dataset into batches, each of size ‘b’ (such that ‘b’ * total_number_of_batches = ‘M’ i.

e.

the total number of training examples.

) An important terminology to note here is ‘epoch.

’ An epoch is defined as an instant when the model has seen the entire dataset (containing ‘M’ examples once.

) In traditional batch gradient descent, the update occurs once every epoch.

Whereas, in mini-batch gradient descent, the update occurs ‘M/b’ times (which is nothing but the total number of batches) every epoch.

Consequently, the model utilising this algorithm will train faster.

However, it is crucial to note that the variation in loss/error at every iteration of this algorithm won’t be uniformly decreasing.

In fact, at some instant, the model’s loss/entropy might increase with increase in iterations.

Nevertheless, in practise, it has been observed that the model will eventually reach the minima.

Stochastic gradient descent: This is a special case of mini-batch gradient descent when the batch size ‘b’ is one.

The advantage of this algorithm is its very fast nature, because the updates are done very frequently.

However, clearly, it suffers from a lot of variation in loss/error at every iteration of the algorithm.

This algorithm, therefore, is not used right-out-of-the-box.

But rather it is combined with few of the algorithms mentioned below.

Adadelta, Adagrad, Adam, Rmsprop: These are the four advanced algorithms which are used quite frequently while building deep neural networks.

The way they work is by incorporating exponentially weighted averages.

In other words, instead of moving along the direction specified by gradient descent update, these algorithms compute (each by its own means) a loosely defined “average” direction.

Cumulatively this has the effect of moving along the direction of minima, rather than moving along a zig-zag path.

I have prepared a sample Jupyter Notebook (using Keras API of Tensorflow framework) which can be used to compare various optimisation algorithms (on fashion MNIST dataset).

You can refer to it here on my Github repository.

Temporarily, you can see the following code snippet which illustrates the same:The plots of variation of loss with respect to the number of iterations of the above algorithms is as follows:Note: All the above algorithms are trained on only 5 epochs.

As evident by the above plots, if we were to increase the number of epochs, the loss is likely to decrease even further.

3.

Types of RegularizationRegularization is a technique which prevents your model from overfitting.

The idea behind regularization is the penalisation of weights in the model.

Consider the below diagram:As you can see, the cost function is only concerned about minimising the weights of the model.

The consequence of this action is that the weights get fine-tuned to the training data and therefore fail to generalise.

Furthermore, unless and until you test the performance of your model on cross validation set, you do not know whether your model is actually learning something useful or is overfitting to the training data.

This makes neural network models highly susceptible to the problem of overfitting.

Therefore to alleviate this problem, we introduce the regularization term:Mathematically, the regularization term is simply increasing the coefficients of the weights present in the model.

By doing such an operation, the cost function, J(θ), is now leftover with only a single option to minimise the entire cost— which is reducing the weights of the model.

Intuitively, consider you have a function f(x) = x*k, if we want the function to minimise ‘x’, we can increase the value of ‘k’.

This same reasoning is applied in the cost function.

Note: λ is called the regularization parameter.

Moreover, the above form of regularization is often referred to as L2 regularizationThere is yet another kind of regularization called Dropout regularization, which is used frequently while building CNN’s (Convolutional Neural Networks.

) Dropout works as follows:The principle behind dropout is “We can’t rely on any single feature, so we have to spread out weights.

” More formally, the following quote is taken from the original dropout paper:In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing.

Therefore, units may change in a way that they fix up the mistakes of the other units.

This may lead to complex co-adaptations.

This in turn leads to overfitting because these co-adaptations do not generalise to unseen data.

We hypothesise that for each hidden unit, Dropout prevents co-adaptation by making the presence of other hidden units unreliable.

Therefore, a hidden unit cannot rely on other specific units to correct its mistakes.

Consider the third layer (i.

e.

the second hidden layer) of the previous diagram.

Each neuron in this layer is a feature albeit an intermediate one, computed from input features.

The output neuron will be decided based on the input neuron values.

Since we will be randomly knocking out neurons, the cost function, or for that matter the gradient descent algorithm, will be more inclined to “NOT” give a particular feature more weight (or being biased to a particular feature).

Hence the weights will be evenly distributed.

Coming to the implementation details, dropout can be implemented as follows (in Tensorflow):I have prepared a sample Jupyter Notebook (using Keras API of Tensorflow framework) illustrating dropout regularization (on fashion MNIST dataset).

You can refer to it here on my Github repository.

As we can see, the dropout regularization asks for different probability values (between 0 and 1) for keeping the neurons at each layer.

For instance, if we give the probability as 0.

5 for the first hidden layer, then half of the neurons will be dropped on average.

Moreover, during each iteration of gradient descent we might be leftover with different neurons.

As a result, the cost function decay will not be smooth:SummaryIn this article we have learned about bias-variance analysis, various optimisation techniques, different types of regularizations, and finally the impact of initialisations.

Furthermore, the code which is linked throughout the article uses basic Tensorflow framework and the Keras API, hence you would also have learned about these frameworks.

I hope that you were able to learn something useful in this article.

Cheers!.