Well, if you think about a generic loss function with only one weight, the graphic representation will be something like that:You want to minimize the loss, so ideally you want your current w to slide towards the minimum.
The procedure should be the following:The correspondent function is:Where the first term is your current weight, the second term is the gradient of your function (in this one-dimension case, it will be the first derivate of your loss function with respect, obviously, to your only weight).
Remember that the first derivate has a negative value if the steep of the tangent segment is negative, that’s why we put a minus in the middle of the two terms (intuition: if the steep is negative, the weight should move towards right, as in the example).
This optimization procedure is called Gradient Descent.
Now let’s add a new term to the formula:This gamma is our learning rate, and it tells the algorithm how important should be the impact of the gradient on the weight.
The problem of a small gamma is that the NN will converge (if it will converge) very slowly, and we might incur in the problem of so-called ‘Vanishing Gradient’.
On the other side, if gamma is very big, the risk is missing the minimum and incur in the scenario of ‘Exploding Gradient’.
A good strategy might be starting with a value around 0.
1, and then exponentially reduce it: at some point, the value of the loss function starts decreasing in the first few iterations and that’s the signal the weight took the right direction.
· Momentum: it is a technique used during the backpropagation phase.
As said regarding the learning rate, parameters are updated so that they can converge towards the minimum of the loss function.
This process might be too long and affecting the efficiency of the algorithm.
Hence, one possible solution is taking track of the previous directions (that are the gradients of the loss function with respect to weights) and keeping them as embedded information: this is what momentum is thought for.
It basically increases the speed of convergence not in terms of learning rate (how much a weight is updated each time) but in terms of embedded memory of past re-calibration (the algorithm knows the previous direction of that weight was, let’s say, right, and it will directly proceed towards this direction during the next propagation).
We can visualize it if we consider the projection of a two-weights loss function (specifically, a paraboloid):You can find the source for making these 3D graphs here.
As you can see, if we add momentum hyperparameter the descending phase is faster, since the model keeps traces of the past gradient directions.
If you decide for high values of momentum, it means it will massively take into account the past directions: it might result in an incredibly fast learning algorithm, but the risk of missing some correct ‘deviations’ is high.
The suggestion is always starting with low values and then increasing them little by little.
· Activation function: it is the function through which we pass our weighed sum, in order to have a significant output, namely as a vector of probability or a 0–1 output.
The major activation functions are Sigmoid (for multiclass classification, a variant of this function is used, called SoftMax function: it returns as output a vector of probability whose sum is equal to one), Tanh and RELU.
Note that activation function can be located at any point in the NN, as many times as you want.
However, you always have to think about efficiency and velocity.
Namely, the ReLU function is very quick in terms of training, while the Sigmoid is more complex and it takes more time.
Hence, a good practice might be using ReLU for hidden layers and then, in the last layer, inserting your Sigmoid.
· Minibatch size: when you are facing billions of data, it might result inefficient (as well as counterproductive) feeding your NN with all of them.
A good practice is feeding it with smaller samples of your data, called batches: by doing so, every time the algorithm train itself, it will train on a sample of the same size of the batch.
The typical size is 32 or higher, however you need to keep in mind that, if the size is too big, the risk is an over generalized model which won’t fit new data well.
· Epochs: it represents how many time you want your algorithm to train on your whole dataset (note that epochs are different from iterations: those latter are the number of batches needed to complete one epoch).
Again, the number of epochs depend on the kind of data and task you are facing.
An idea could be imposing a condition such that epochs stop when the error is close to zero.
Or, more easily, you can start with a relatively low number of epochs and then increase it progressively, tracking some evaluation metrics (like accuracy).
· Dropout: this technique consists of removing some nodes so that the NN is not too heavy.
This can be implemented during the training phase.
The idea is that we do not want our NN to be overwhelmed by information, especially if we consider that some nodes might be redundant and useless.
So, while building our algorithm, we can decide to keep, for each training stage, each node with probability p (called ‘keep probability’) or drop it with probability 1-p (called ‘drop probability’).
StrategiesStrategies are approaches and best practices we might want to have towards our algorithm to make it more performing.
Among these there are the following:· Parameter initialization: we have been talking about that in the first paragraph.
· Data normalization: while inspecting your data, you might notice that some features are represented on different scales.
This might affect the performance of your NN, since the convergence is slower.
Normalizing data means converting all of them to the same scale, within the range [0–1].
You can also decide to Standardize your data, that means making them normally distributed with mean equal to 0 and standard deviation equal to 1.
While data normalization happens before training your NN, another way you can normalize your data is through the so-called Batch Normalization: it happens directly during your NN training, specifically after the weighted sum and before the activation function.
· Optimization algorithm: in the previous paragraph, I mentioned the gradient descent as the optimization algorithm.
However, we have many variants of this latter: Stochastic Gradient Descent (it minimizes the loss according to the gradient descent optimization, and for each iteration it randomly selects a training sample — that’s why it’s called stochastic), the RMSProp (that differs from the previous since each parameter has an adapted learning rate) and the Adam Optimizer (it is a RMSProp + momentum).
Of course, this is not the full list, yet it is sufficient to understand that Adam optimizer is often the best choice, since it allows you to set different hyperparameters and customize your NN.
· Regularization: this strategy is pivotal if you want to keep your model simple and avoid overfitting.
The idea is that regularization adds a penalty to the model if weights are great/too many.
Indeed, it adds to our loss function a new term which tends to increase (hence, the loss increases too) if the re-calibration procedure increases weights.
There are two kinds of regularization: the Lasso regularization (L1) and Bridge regularization (L2):The L1 regularization tends to shrink weights to zero, with the risk of getting rid of some inputs (since they will be multiplied with a null value), whereas the L2 might shrink weights to very low values, but not to zero (hence inputs are preserved).
It is interesting to note that this concept is strongly related to the Information Criteria in time series analysis.
Indeed, while optimizing the Maximum Likelihood function of our Autoregressive model, we might incur in the same problem of overfitting, since this procedure tends to increase the number of parameters: that’s why it is a good practice to add a penalty if this latter increases.
ConclusionThis article is not supposed to be an exhaustive list of all the characteristic elements of a NN, but it is important to understand the key differences among them and the key ideas of initialization and tuning.
.. More details