# Demystifying Maths of Gradient Boosting

This is where the idea of boosting starts to manifest- going from a high bias to a low bias model.

We start with a constant function (no other function can have more bias than a constant function, unless the dataset is so boring that even a constant model fits it) and then go on through a number steps finding a function that has a reasonably low bias.

In some texts, you may find initialising the model with 0 (zero).

That is also a constant function, but quite possibly we can start with a slightly better option.

Let’s say the initial model is a constant function γ.

The loss function for a constant function then is as follows:Therefore, γ_optimal is determined by the solving following optimization problem:This is for sure better than a model initialised with zero.

Particularly for squared loss error, F_0(x) is equal to the mean of the actual y-values, i.

e.

, F_0(x) = y_mean when squared loss is used.

Often in texts, this model is not counted as one of the base learners although the final model is going to be an additive model obtained from base learners and this model will also be one of addends.

There is one difference though- all the models that are called as base learners in the literature, are fitted on the pseudo-residuals and not on the y-values of the actual dataset (we shall see how it is done in step 2).

Since F_0(x) is fitted on the y-values, probably that is why it is not considered as one of the base learners in the literature.

At this point F_M(x) = F_0(x).

F_M(x), however, is subject to updations until all the M base learners are added to it with suitable weights.

Step 2: This step is to be followed for each base learner from m = 1 to m=M.

Note that gradient boosting adds one model at a time, one after the other.

Gradient boosting algorithm has to be configured with a suitable value of hyperparameter M (number of base learners) prior to execution.

Step 2.

1.

Compute pseudo-residuals:Pseudo-residuals are computed for each ith training example :Note that F_m-1(x) is the model obtained by adding m-1 weighted base learners and the initial constant function.

The m_th (m superscript th) base learner has not yet been added.

Each residual calculation r_im for ith training example corresponding to the current base learner m (which will be trained and added in step 2.

2) is done on the weighted sum of base learners from 1 to m-1 and the initial constant function (step 1).

Recall that after step 1, F_0(x) = γ does not include any term corresponding to any base learner (recall that F_0(x) is most often not called as a base learner in the literature.

It is treated only as an initial value of the model).

At this point, we have computed the pseudo-residual values for each training example.

Step 2.

2.

Fit base learners on the pseudo residualsFor this step, a new dataset is derived from the given dataset.

The dataset D_modified is defined as shown in figure 3.

Fig.

3.

Modified datasetA base learner h_m(x) is trained and fitted on this dataset.

At this point, we have everything required to determine F_m(x).

We do it as following:Why does doing this make sense?It is straightforward to see why this equation makes sense if we compare it with weight updations done in Gradient Descent (figure 4).

Weights in gradient descent are moved in the direction in which loss L decreases.

How much the weights should be moved is governed by α (the learning rate).

Similar to that, the function h_m(x) is fitted to the rate of change of loss L w.

r.

t.

F_m-1(x).

The function h_m(x) (which is expected to approximate the behaviour of derivate of loss w.

r.

t.

F_m-1(x) suitably) represents the direction in which the loss function decreases w.

r.

t.

F_m-1(x).

γ corresponds to the hyperparameter α in terms of the utility (both determine by what amount update should be made).

This is similar to the weight updation equation in Gradient Descent , except that γ is trainable while α is a hyperparameter.

We shall see how an optimal value of γ is learned in step 2.

3 .

Fig.

4.

Weight updation in Gradient DescentNote that γ is the only parameter w.

r.

t.

which F_m(x) needs to be optimized on the original dataset.

However, base learners learn more parameters on dataset D_modified since they are the actual functions which are added together to give the final model.

At this point, we have a model F_m(x) which will be used in the next step to compute y_pred values.

Step 2.

3.

Find the optimum γWe shall take the the original dataset D (Figure 5).

Fig.

5.

Original datasetWe take the model F_m(x) on the original dataset D, then compute the loss L.

Note that this loss is a function of γ.

We are supposed to find γ_optimum.

This can be done by solving the following optimization problem that minimizes the :At this point we have the optimum value of γ.

Step 2.

4.

Model UpdationThe model F_m(x), thus is obtained as:At the end of each iteration number m, F_M(x) is updated to the value of F_m(x).

Step 3.

Step 2 is run for each base model m = 1 to M.

After M iterations of step 2, we obtain the final model F_M(x).

Why the name ‘Gradient Boosting’?We just saw the role ‘gradient’ plays in this algorithm- we always fit a base learner to the gradient of the loss function w.

r.

t.

the model F_m-1(x).

The term ‘boosting’ refers to the fact that a high bias model, which performs really bad on the dataset, is boosted to finally become a reasonable classifier and possibly a strong classifier.

In general, Boosting is a family of algorithms in which a number ‘weak classifier’ (one which has an error rate of just under 0.

5) base learners are combined to give a strong classifier (one which has an error rate close to 0).

In gradient boosting, we start with a constant model (which maybe an extremely weak classifier depending on the dataset).

At the end of an m_th iteration of learning a base learner, we end up with a weak learner, but relatively a better classifier F_m(x) which progressively moves towards becoming strong classifier.