")And we gety [[0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]]yh [[0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]]Both match perfectly, because we have achieved 100% accuracy on our validation set.

Therefore, the function learnt pretty well to adapt to both the training and validation sets.

One great way to analyze the accuracy is by plotting a confusion matrix.

First, we declare a custom plotting function.

def plotCf(a,b,t): cf =confusion_matrix(a,b) plt.

imshow(cf,cmap=plt.

cm.

Blues,interpolation='nearest') plt.

colorbar() plt.

title(t) plt.

xlabel('Predicted') plt.

ylabel('Actual') tick_marks = np.

arange(len(set(expected))) # length of classes class_labels = ['0','1'] tick_marks plt.

xticks(tick_marks,class_labels) plt.

yticks(tick_marks,class_labels) # plotting text value inside cells thresh = cf.

max() / 2.

for i,j in itertools.

product(range(cf.

shape[0]),range(cf.

shape[1])): plt.

text(j,i,format(cf[i,j],'d'),horizontalalignment='center',color='white' if cf[i,j] >thresh else 'black') plt.

show();(This custom confusion matrix function comes from this public Kaggle created by JP)Then, we run the pred function again twice, and plot confusion matrices for both the training and validation sets.

nn.

X,nn.

Y=x, y target=np.

around(np.

squeeze(y), decimals=0).

astype(np.

int)predicted=np.

around(np.

squeeze(nn.

pred(x,y)), decimals=0).

astype(np.

int)plotCf(target,predicted,'Cf Training Set')nn.

X,nn.

Y=xval, yval target=np.

around(np.

squeeze(yval), decimals=0).

astype(np.

int)predicted=np.

around(np.

squeeze(nn.

pred(xval,yval)), decimals=0).

astype(np.

int)plotCf(target,predicted,'Cf Validation Set')We can see even more clearly that our validation set has perfect accuracy on its 183 samples.

As for the training set, there are 19 mistakes among the 500 samples.

Now, at this point you may say that in a topic as delicate as diagnosing a tumor, setting our prediction to be 1 if the sigmoid output gives a value above 0.

5 is not really good.

The network should be really confident before giving a prediction of malignancy.

I totally agree, that’s very correct.

And these are the kinds of decisions that you need to take depending on the nature of the challenge and topic you are dealing with.

Let’s then create a new variable called threshold.

It will control our confidence threshold, how close to 1 the output of the network needs to be before we decide that a tumor is malignant.

By default we set it to 0.

5self.

threshold=0.

5Out prediction function is now updated to use that confidence threshold.

def pred(self,x, y): self.

X=x self.

Y=y comp = np.

zeros((1,x.

shape[1])) pred, loss= self.

forward() for i in range(0, pred.

shape[1]): if pred[0,i] > self.

threshold: comp[0,i] = 1 else: comp[0,i] = 0 print("Acc: " + str(np.

sum((comp == y)/x.

shape[1]))) return compLet’s now compare our results as we gradually raise the confidence threshold.

Confidence threshold: 0.

5 .

Output values need to be higher than 0.

5 for the output to be considered malignant.

As seen previously, the validation accuracy is 100%, the training one is 96%.

Confidence threshold: 0.

7 .

Output values need to be higher than 0.

7 for the output to be considered malignant.

The validation accuracy remains at 100%, the training one decreases a bit to 95%.

Confidence threshold: 0.

8 .

Output values need to be higher than 0.

8 for the output to be considered malignant.

The validation accuracy for the first time decreases very, very slightly to 99.

45%.

In the confusion matrix we see that 1 single sample of the 183 is not recognized correctly.

The training accuracy decreases a bit more till 94.

2%Confidence threshold: 0.

9.

Finally, in the case of 0.

9, output values need to be higher than 0.

9 for the output to be considered malignant.

We are looking for almost complete confidence.

The validation accuracy decreases a bit more till 98.

9%.

In the confusion matrix we see that 2 samples of the 183 were not recognized correctly.

The training accuracy decreases further till 92.

6%.

Therefore, by controlling the confidence threshold, we adapt to the specific needs of our challenge.

If we want to lower the loss value related to our training set (because we are failing to recognize a small percentage of the training samples), we can try to train for longer, and also use different learning rates.

For example, if we set the learning rate to 0.

07 and train for 65000 iterations, we obtain:Cost after iteration 63500: 0.

017076Cost after iteration 64000: 0.

016762Cost after iteration 64500: 0.

016443Acc: 0.

9980000000000003Acc: 0.

9945054945054945Now, with our confidence threshold set to 0.

5, the network is accurate with every sample in both sets, except with one of each.

If we raise the confidence threshold to 0.

7, performance is still excellent, only 1 validation sample and 2 training samples are not predicted correctly.

Finally, if we are really demanding and set the confidence threshold to 0.

9, the network fails to guess correctly 1 of the validation samples and 10 of the training ones.

Although we have done quite well, considering that we are using a basic network without regularization, it is typical for things to get much harder when you are dealing with more complex data.

Often, the loss landscape gets very complex and it’s easier to fall in the wrong local minima or fail to converge to a good enough loss.

Also, depending on the initial conditions of the network, we may converge to a good minima or we may get stuck at a plateau somewhere and fail to get out of it.

It’s useful at this stage to picture again our initial animation.

Navigating the Loss Landscape.

Values have been modified and scaled up to facilitate visual contrast.

Picture that landscape, full of hills and valleys, places where the loss is really high, and places where the loss gets very low.

The landscape of the loss function related to a complex scenario is often not uniform (though it can be made more smooth using different methods, but that’s a whole different topic).

It’s full of hills and valleys of different depths and angles.

The way you move around the landscape is by changing the loss value of the network when you run the gradient descent algorithm.

And the speed at which you move is controlled by the learning rate:If you are moving very slowly and somehow arrive to a plateau or a valley that is not low enough, you may get stuck there.

If you move too fast, you may arrive to a low enough valley but rush through it and move away from it just as fast.

So there are some very delicate issues that have an enormous impact on how your network will perform.

The initial conditions: in what part of the landscape do you drop the ball at the beginning of the process?The speed at which you move the ball, the learning rate.

A lot of the progress achieved recently in improving the speed with which neural networks train is connected to different techniques that dynamically manage the learning rate and also to new ways of setting those initial conditions in better ways.

Regarding the initial conditions:Remember that each layer computes a combination of the weights and the inputs of the preceding layer (weighted sum of the inputs) and pass that computation to that layer’s activation functions.

Those activation functions have shapes that can either accelerate or stop all together the dynamics of the neurons, depending on the combination between the range of the inputs and the way they respond to that range.

If the sigmoid function, for example, receives values that trigger a result that is close to the extremes of its output range, the output of the activation function on that part of its range becomes really flat.

If it stays flat for some time, the derivative, the rate of change at that point becomes zero or very small.

Recall that it is the derivative what helps us decide in what direction to move next.

Therefore, if the derivative is not giving us meaningful information, it will be very difficult for the network to know in what direction to move next from that point.

It is as if you had reached a plateau in the landscape and you were really confused as to where to go next, and you just kept moving in circles around that point.

This may happen also with ReLU, although ReLU has only 1 flat side as opposed to the 2 of Sigmoid and Tanh.

Leaky-ReLU is a variation of ReLU that slightly modifies that side of the function (the flat one) to try to prevent vanishing gradients.

It is therefore critical to set the initial values of our weights in the best way possible so that the computations of the units at the start of the training process produce outputs that fall within the best possible range of our activation functions.

That could make the whole difference between beginning at a really high hill of the loss landscape or way lower.

Managing the learning rate to prevent the training process from being too slow or too fast, and to adapt its value to the changing conditions of the process and of each parameter, is another complex challengeTalking about the many ways of dealing with the initial conditions and the learning rate would take a few articles.

I will briefly describe some of them to give an idea of some of the methods experts use to deal with these challenges.

Xavier initialization: A way of initializing our weights so that neuron’s won’t start in a saturated state (trapped at the delicate parts of their output ranges, where derivatives cannot provide enough information for the network to know where to go next).

Learning rate annealing: high learning rates can push the algorithm to bypass and miss good minima at the loss landscape.

A gradual decrease of the learning rate can prevent that.

There are different ways to implement this decrease, including: exponential decay, step decay and 1/t decay.

Fast.

ai Lr_find(): An algorithm of the fast.

ai library that finds the ideal range of values for the learning rate.

Lr_find trains the model through a few iterations.

It first tries to use a very low learning rate, and at each mini batch it changes the rate gradually until it reaches a very high value.

The loss is recorded at each iteration and a chart helps us visualize the loss against the learning rate.

We can then decide what are the optimal values of the learning rate that decrease the loss in the most efficient way.

Differential learning rates: Using different learning rates in different parts of our network.

SGDR, Stochastic Gradient Descent with Restarts: Resetting our learning rate every x iterations.

This can help us get out of plateaus or local minima that are not low enough, if we get stuck in one of them.

A typical process is to start with a high learning rate.

You then decrease it gradually at each mini batch.

After x number of Epochs you reset it back to its initial high value and the same process repeats again.

The concept is that moving gradually from a high rate to a lower one makes sense because we first quickly move down from the high points of the landscape (initial high loss value) and then move slower to prevent bypassing the minima of the landscape (low loss value areas).

But if we get stuck at some plateau or a valley that is not low enough, restarting our rate to a high value every x iterations will help us jump out of that situation and continue exploring the landscape.

1 Cycle Policy: A way of dynamically changing the learning rate proposed by Leslie N.

Smith, in which we begin with a low rate value and gradually increase it until we reach a maximum.

Then, we proceed to gradually decrease it till the end of the process.

The initial gradual increase allows us to explore large areas of the loss landscape, increasing our chances of reaching a low area that is not bumpy; in the second part of the cycle, we settle in the low, flat area we have reached.

Momentum: A variation of stochastic gradient descent that helps accelerate the path through the loss landscape while keeping the overall direction controlled.

Recall that SGD can be noisy.

Momentum averages the changes in the path, smooths that path and accelerates the movement towards the goal.

Adaptive learning rates: Methods that calculate and use different learning rates for different parameters of the network.

AdaGrad (Adaptive Gradient Algorithm): Connecting with the previous point, AdaGrad is a variation of SGD that instead of using a single learning rate for all the parameters, uses a different rate for each parameter.

Root Mean Square Propagation (RMSProp): Like Adagrad, RMSProp uses different learning rates for each parameter, and adapts those rates depending on the average of how fast they are changing (this helps when dealing with noisy contexts).

Adam: It combines some aspects of RMSprop and SGDR with momentum.

Like RMSprop, it uses squared gradients to scale the learning rate, and it also uses the average of the gradient to make use of momentum.

If you are new to all these names, don’t get overwhelmed.

Behind most of them are the very same roots: back-propagation and gradient descent.

Also, a lot of these methods are selected automatically for you within modern frameworks such as the fast.

ai library.

It is though really useful to understand how they work, as you are then in a better position to take your own decisions and even to research and test different variations and options.

Understanding means more optionsWhen we understand the core of the network, the basic back-propagation algorithm and the basic gradient descent process, we have more options to explore and experiment whenever we face hard challenges.

Because we understand the process, we realize for example that in deep learning, the initial place where we drop the ball within the loss landscape is key.

Some initial positions will soon push the ball (the training process) to get stuck in some part of the landscape.

Others will quickly drive us to a good minima.

When the mystery function becomes more complex, it is the time to incorporate some of the advanced solutions I mentioned earlier.

It is also time to study in more depth the architecture of the entire network and to go deeper into the different hyper-parameters.

Navigating the landscapeThe shape of our loss landscape is very much influenced by the design of the architecture of our networks as well as hyper-parameters like the learning rate, the size of our batches, the optimizer algorithm we use, etc.

For a discussion about those influences, check the paper: Visualizing the Loss Landscape of Neural Nets by Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein.

A very interesting point coming out of recent research is how the skip connections model in neural nets can smooth our loss landscape and make it dramatically simpler and more convex, increasing our chances to converge to a good result.

Navigating the Loss Landscape.

Values have been modified and scaled up to facilitate visual contrast.

Skip connections have helped a lot to train very deep networks.

Basically, skip connections are extra connections that link nodes of separate layers, skipping one or more non-linear layers in between.

As we experiment with different architectures and parameters, we are modifying our loss landscape, making it more rugged or smooth, increasing or decreasing the number of local optima.

And as we optimize the way we initialize the parameters of the network, we are improving our starting position.

Let’s keep on exploring new ways to navigate the loss landscapes of the most fascinating challenges in the world.

Navigating the Loss Landscape.

Values have been modified and scaled up to facilitate visual contrast.

This article covered the basics and from here, the sky is the limit!Links to the 3 parts of this article:Part 1 | Part 2 | Part 3Github Repository with all the code of this projectjavismiles/Deep-Learning-predicting-breast-cancer-tumor-malignancyPredicting Cancer Malignancy with a 2 layer neural network coded from scratch in Python.

…github.

com.. More details