Why Data should be Normalized before Training a Neural NetworkAnd Why Tanh Generally Performs Better Than SigmoidTimo StöttnerBlockedUnblockFollowFollowingMay 16Photo by Clint Adair on UnsplashAmong the best practices for training a Neural Network is to normalize your data to obtain a mean close to 0.

Normalizing the data generally speeds up learning and leads to faster convergence.

Also, the (logistic) sigmoid function is hardly ever used anymore as an activation function in hidden layers of Neural Networks, because the tanh function (among others) seems to be strictly superior.

While this might not be immediately evident, there are very similar reasons for why this is the case.

The tanh function is quite similar to the logistic sigmoid.

The main difference, however, is that the tanh function outputs results between -1 and 1, while the sigmoid function outputs values that are between 0 and 1 — therefore they are always positive.

I could hardly find any articles explaining why this speeds up training.

The explanations I could find were either too superficial, hardly understandable without more context or simply wrong.

So I decided to dig deeper and write this article on what I found.

We’ll first have a look at sigmoid and tanh, and we’ll then discuss normalization based on what we’ve found.

Setting the Scene: Tanh and the Logistic SigmoidObviously there are many more activation functions used in neural networks than tanh and sigmoid, but for now we’ll only have a look at the differences between the two.

(Note that the tanh function is, strictly speaking, also a sigmoid function, but in the context of neural networks the ‘sigmoid’ function usually refers to the logistic sigmoid.

So I’ll follow that convention here.

)Let’s quickly have a look at the two activation functions and their derivatives to get the basics straight.

The tanh function and its derivative look as follows:tanh function and its derivativeAs you can see, the tanh function is centered around 0.

Its values range from -1 to 1.

It can be represented asThe sigmoid function and its derivative, on the other hand, look as follows:logistic sigmoid and its derivativeThe values of the logistic sigmoid range from 0 to 1 and are therefore always positive.

It can be written as:If you compare the derivatives of the two, you can see that the derivative of the tanh function tends to be much larger than the sigmoid’s derivative:Comparison of sigmoid and tanh derivativesThis will get relevant later when we have a look at what happens during gradient descent.

Tanh and Sigmoid as Activation FunctionsIn a neural network, the outputs of the nodes in one layer are used as the inputs for the nodes in the next layer.

Therefore, the activation function determines the range of the inputs to the nodes in the following layer.

If you use sigmoid as an activation function, the inputs to the nodes in the following layer will all range between 0 and 1.

If you use tanh as an activation function, they will range between -1 and 1.

Now, let’s consider a neural network that is used for binary classification.

It has a bunch of hidden layers and one node in the output layer with a sigmoid activation function.

For our discussion we are not interested in what happens in all those hidden layers for now, we are only interested in what happens in the output layer.

So let’s just treat everything else as a black box:Neural Network with one output node — rest of network is treated as black boxDepending on the activation functions we use in the last hidden layer, the input to our node in the output layer will vary.

Since we use a sigmoid function in the output layer, this last part of the network is basically a logistic regression.

The node receives some inputs from our previous layer, multplies them with some weights and applies the logistic sigmoid to the result of that.

To understand why tanh tends to be better than sigmoid, we need to have a look at what happens during gradient descent.

Because our output node basically performs a logistic regression, we can simplify things a little by looking at gradient descent for logistic regressions.

What Happens During Gradient DescentFor binary classification, we typically use binary cross-entropy as the loss function:where ????.is the predicted output of our model for the specific training instance and y is the true class label.

For logistic regression (and therefore also for our output layer from the example above), the derivative of the loss function L with respect to a weight wᵢ is equal towhere y is the true class label and xᵢ is the input feature corresponding to the weight wᵢ.

For every weight wᵢ the second term of the partial derivative, (a−y), will be the same.

The differences between the gradients of the different weights solely depend on the inputs xᵢ.

If the inputs are all of the same sign, the gradients will also be of the same sign.

Therefore, when having sigmoid as the activation function of the nodes of the previous layer, the weights of the node can either all increase or all decrease at the same time in a single step of gradient descent.

It is simply not possible for some weights of the node to increase while some others decrease.

If the weight vector needs to change direction, it can only do so by “zigzagging”: Adding and removing different amounts to the weights until the change in direction is completed.

This is highly inefficient.

(We will have a look at a plot illustrating this later.

)For tanh, on the other hand, the sign of our inputs xᵢ can vary — some will be below zero and others will be larger than 0.

Therefore, the direction of the updates are independent of one another.

This lets the weight vector change direction more easily.

In case this is all still quite abstract, let’s have a look at a concrete example and some visualizations of what’s going on.

The complete code related to this article can be found on GitHub.

Looking at Some Data: Is Tanh Really Better?We’re going to construct two little networks that are exactly the same — the only difference is that one is going to use tanh as the activation function in the hidden layer and the other one is going to use sigmoid activations.

Then we compare the results and take a detailed look at what happens to the weights in the output layers.

We’ll train the networks on some randomly generated dummy data:X contains random values between -0.

5 and +0.

5.

Its rows constitute the training examples and the columns their feature values.

Y contains the class labels, which are 1 if a record’s mean is greater than 0 and 1 otherwise.

First, let’s check if tanh actually does perform better than sigmoid.

Maybe all that superiority of tanh is just a well-established rumor.

We’ll define two simple models with one hidden layer consisting of 4 nodes and one output layer consisting of 1 node.

Their only difference is the activation function used in the hidden layer.

Let’s train them on the generated data for a few epochs to see how they perform:The tanh model learned much more quickly.

After 50 epochs, the loss of the tanh network is less than a third of the sigmoid model’s loss.

However, due to the fact that the derivatives of the sigmoid and tanh functions differ quite a lot in their range, this is a bit of an unfair comparison.

We used the same learning rate in both cases.

In each step of gradient descent, each weight gets updated according towhere ????.is the learning rate.

If the gradients are larger, the updates will be larger as well and the network will learn faster — as long as the updates don’t get too large.

So we might want to account for that by choosing a larger learning rate for the sigmoid network.

To get an understanding of the effect the learning rates have, let’s train both networks for a few epochs with different learning rates each time and compare the results.

When plotting the loss after 10 epochs of training for different learning rates, we get something like the following:Loss after 10 epochs of training for different learning ratesThe result isn’t as clear-cut anymore, but tanh still performs better in general.

When reaching a learning rate of more than 6, the updates obviously get too large in the tanh network so they overshoot the minimum.

Considering that tanh’s gradient tends to be much larger than the sigmoid’s gradient, it makes a lot of sense that the sigmoid network still performs decently while the updates for the tanh network get too large.

However, as you can see in the graph, the best loss obtained after 10 epochs with the tanh network is significantly smaller than the best loss obtained with the sigmoid network.

Also, for higher learning rates above 5, the results of the sigmoid model also start to fluctuate, indicating that the learning rate is getting too large for the sigmoid network as well.

After establishing that tanh indeed seems to be better, let’s have a closer look at why this is the case.

Plotting the Weights of the Output LayerFor some intution about why tanh performs better than sigmoid, let’s have a look at the individual weights of the output layers.

Remember, the output layer only retrieves values between 0 and 1 in case we use a sigmoid activation function in the hidden layer.

Specifically, we are interested in how every individual update of gradient descent affects the weights of the output layer.

When plotting the weights after each individual step of gradient descent, the result will look as follows:Weights of output layer after each step of gradient descent with sigmoid activations in hidden layerAs you can see, the weights always change in the same directions.

They either all decrease or they all increase during one step of gradient descent resulting in a “zigzag” movement.

The magnitude of the change differs, but the sign of the change is the same.

If the weight vector needs to change its direction, e.

g.

the lowest weight needs to become the heighest weight, it can only do so by zigzagging up and down for quite some time.

If you plot the same weights for our tanh network, the result will look something like this:Weights of output layer after each step of gradient descent with tanh activations in hidden layerHere the weight updates appear completely independent from one another.

This makes the learning more flexible.

If the direction of the weight vector needs to change, gradient descent doesn’t need to zigzag up and down like with the sigmoid activation function, it simply updates the individual weights until you have the direction you need.

How does this all relate to Normalization?Following from the discussion above, one reason why normalization helps should be quite clear: If you have a feature that is all positive or all negative, this will make learning harder for the nodes in the layer that follows.

They will have to zigzag like the ones following a sigmoid activation function.

If you transform your data so it has a mean close to zero, you will thereby make sure that there are both positive values and negative ones.

The second reason why normalization helps is connected to the scale of the inputs.

Normalization ensures that the magnitude of the values that a feature assumes are more or less the same.

Recall that our steps during gradient descent, and therefore the speed of learning in nodes with logistic activation functions, depend onwhere xᵢ is the ith input to the node.

The larger xᵢ, the larger the updates and vice versa.

The speed of learning is proportional to the magnitude of the inputs.

(For tanh activation functions, the gradient will be slightly different but it will still depend on the inputs in a similar manner.

)If the inputs are of different scales, the weights connected to some inputs will be updated much faster than other ones.

This generally hurts the learning process — unless we know in advance what features are more important than others, in which case we can adjust the scales to have our neural network focus its learning on the more important ones.

But in practice it is quite unlikely that we can predict how this will benefit learning in advance.

To summarize, normalization helps because it ensures (a) that there are both positive and negative values used as inputs for the next layer which makes learning more flexible and (b) that the network’s learning regards all input features to a similar extent.

Also, due to the fact that the sigmoid activation outputs only positive values, which stifles learning, you should generally prefer other activation functions in the hidden layer.

(Using sigmoid activations in the output layer is completely fine, of course.

)I hope this gave you a better understanding of why you should normalize data for a neural network and why tanh is generally superior as an activation function than sigmoid.

If you have any feedback or questions, let me know in the comments below!.