Tree-Based Methods: ClassificationKushal ValaBlockedUnblockFollowFollowingMar 22The article is based on the Classification task by Decision Tree Algorithm, which is used more predominantly.

It briefs about various methods to split the node and also to improve the performance of the model using various techniques.

Picture Courtesy: PinterestA classification tree is very similar to the regression tree, except that the output is qualitative rather than quantitative.

Recall that in a regression setting, the predicted response for observation is given by the mean response of training observations that belong to the same terminal node.

In contrast, for classification, we predict the most commonly occurring class in the node.

The task of growing a decision tree in classification is quite similar to the task of growing a regression tree, we use recursive binary splitting for classification too.

In regression, we used RSS (Residual Sum of Squares) as the Impurity Measure to choose the terminal node and splitting point.

A natural alternative to RSS is Classification Error RateIn classification, we are interested in knowing the class proportion in a particular region.

Figure-1: Proportion of ClassesAbove equation formulates the proportion of class k in node m.

We classify the test case into the majority of the proportional class in that particular node.

Now based on this proportion, there are certain Impurity Measures namely Misclassification Error, Gini-Index, and Entropy.

Figure-2: Impurity Measures of ClassificationAll three measures are similar, but Gini-Index and Cross-Entropy are differentiable, which is an added advantage in case of Optimization Problem.

In addition, Gini-Index and Cross-Entropy are more sensitive to changes in node probabilities.

In the following code, I have done classification on Heart Data-set, which has features which determine the patient is likely to have heart disease or not.

Code-1: R-Implementation of Decision Tree.

The following code results in the output of a large tree (unpruned).

Figure-3: Output of Code-1Decision Tree results in overfitted models which are then pruned using Cost-Complexity Method.

But there are other powerful methods to boost the working of the tree-based model.

Bagging ( Bootstrap Aggregation )The decision tree has a high variance.

The standard approach to reduce the variance is to train the decision tree with independent sampled data from population set, so as to reduce the variance of the model.

But in a real-world problem, there is a limitation to get access to a huge amount of data.

So that's where Bootstrap Aggregation or Bagging method comes into the picture.

We obtain distinct datasets from the population by repeatedly sampling observations from the original data-set.

The sampling is performed with replacement and observation can occur more than once in the bootstrap dataset.

Figure-4: Bootstrapping the datasetTo apply Bagging to regression trees, we simply construct B regression trees using B bootstrapped training sets, and average the resulting predictions.

Figure-5: Estimated Output for Regression Tree using BaggingIn a Classification problem, for a given test observation, we can record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring majority class among the B predictions.

Out-of-Bag Error Estimation: It is a metric to measure the test error of a bagged model, without the need for cross-validation.

Empirically, one can show that each bagged tree uses two-thirds of the observation.

Remaining one-third observation are called Out-of-bag (OOB) Observations.

We can predict the response for ith observation using each of the trees in which that observation was OBB.

The resulting metric will yield B/3 predictions for ith observation.

In order to obtain a single prediction, we can average this predicted response (regression) or take a majority vote (classification).

Using Bagged Models, there is an increase in accuracy with loss of interpretability of model.

Collection of Bagged Tree is relatively more difficult to comprehend than a single decision tree.

But one can obtain an overall summary of predictor using RSS ( for bagged regression trees) or Gini-Index (for bagged classification trees).

Code-2: Bagging of Decision TreeWe have built 100 Bagged Classification Tree for Heart dataset, giving us an OBB estimate error of 19.

53%.

In R programming, we can get a measure of variable importance.

Code-3: Variable Importance in Bagged ModelRandom ForestsRandom forests provide an improvement over bagged trees by way of small tweaking that decorrelates the tree.

In bagging, we build a number of bootstrapped models for prediction, A random sample of m predictors are considered as split candidates from the full set of p predictors.

The split is allowed to use only m predictors.

In random forests, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors ( typically square root of the full set).

The rationale behind this way of splitting is that if there is a strong predictor in the set along with moderate predictors.

For most of the bagged models, the strong predictor will take the top split.

So the majority of bagged models will have a similar structure which makes it highly correlated.

Averaging of many highly correlated models will not result in a substantial decrease in variance.

The main difference between bagging and random forests is the choice of predictor subset size for each split.

In the Heart Dataset, In bagging, we used 10 predictors for each split, whereas for random forest method we will use the square root of total predictors i.

e 4.

Code-4: Random Forest Model in RAs we can observe that OBB error has gone down to 17.

85% in case of this method.

BoostingSimilar to the bootstrapping or bagging technique, this method can be applied to various machine learning algorithms.

Bagging involves creating multiple copies of a dataset and fitting a decision tree in each of them and combining all of them to create a single predictor model.

Boosting works in a similar way, trees are grown sequentially i.

e each tree is grown using information from previously grown trees.

In boosting, we do not use bootstrap sampling, instead, each tree is fitted on the modified version of the original tree.

Unlike fitting a single large decision tree to the data, and getting a potentially overfit model, boosting approach learns slowly.

Given a current model, we fit a decision tree to the residuals from the model.

We fit a tree to the residuals from the model rather than the qualitative output Y.

By fitting the model to residuals, we improve the model efficiency gradually.

The shrinkage parameter controls the learning process and can be varied as per the requirement of the model.

Boosting has three tuning parameters:Number of Trees (B), Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all.

We use cross-validation to select B.

Shrinkage Parameter (Lambda), This controls the rate at which boosting learns.

Typical Values are 0.

001 or 0.

01.

Number of Splits (d) in each tree, which controls the complexity of the model.

In R, I have used the gbm library, which is used for gradient boosting.

Code-5: Gradient Boosting Method on Decision TreeWe can summarize the model in R, which gives a variable importance plot.

Figure-6: Variable Importance of Boosted TreeClosing RemarksIn this article, we have explored a Machine Learning algorithm (Decision Tree) which is used exhaustively for Classification Task.

Modifications to Decision Tree can improve its performance: Bagging and Boosting.

Nowadays, Algorithms like XGBoost, Light GBM, CatBoost are used extensively on large data-sets.

Kushal Vala, Junior Data Scientist at Datametica Solutions Pvt LtdReferences:[1] Introduction to Statistical Learning with Applications in R, Gareth James Daniela Witten, Trevor Hastie, Robert Tibshirani.

[2] Elements of Statistical Learning Edition-2, Trevor Hastie, Robert Tibshirani, Jerome Friedman.

[3] The R Book, Michael J.

Crawley, Imperial College London at Silwood Park, UK.

.