Tree-Based Methods: Regression TreesKushal ValaBlockedUnblockFollowFollowingMar 18This article gives a detailed review of the Decision Tree Algorithm used for Regression task-setting.

At the core, Decision Tree models are nested if-else conditions.

Interpretability of the result is much more pronounced than Least Squared Approach, but there is a considerable loss of accuracy involved.

To overcome that, we use strategies like Bagging, Boosting and Random Forests.

Picture Courtesy: PinterestBasics of Decision TreeIn this section, we will be working on a Pollution dataset, which consists of 7 explanatory variables, and the target being Pollution Level.

We will first make a skeleton model using the R Programming’s tree module and observe the results.

Code-1: Implementing Decision TreeIn this code, we have imported a tree module in CRAN packages, which has the functionality of Decision Trees.

The result of the above code is as follows:Figure-1: Decision Tree of Pollution DatasetDecision Tree is an upside-down schema.

In the following tree, we might interpret Industry to be the most important factor for splitting on Subspace.

Subspace is a p-dimensional space of ‘p’ explanatory variables/features unto which the regression task is to be determined.

Industry<748, Population<190, Wet Days<108, Temp<59.

35,Wind<9.

65 .

These inequalities are known as terminal nodes of a tree.

This terminal splitting occurs based on certain criteria which we will discuss in the following section.

Terminal nodes have branches, which might terminate or further split.

if Population < 190 : Pollution Level = 43.

43 , This number is based on the mean observation in training set which splits that particular region.

A Mathematical PerspectiveIn the earlier section, we used R programming to make a decision tree of a use-case.

But it is paramount to know and understand what goes behind the code.

Our algorithm has to automatically decide on splitting variables and split points, and also what topology the tree should have.

There are majorly two steps involved:We divide the predictor space set of possible feature variables into distinct and non-overlapping regions.

For every observation that falls into a region, we make the prediction which is the mean of response in the training set in that particular region.

For step 1: In theory, we can make any shape, but we choose to divide the space using high-dimensional rectangles or boxes and because of ease of simplicity and interpretation.

The goal is to find boxes which minimize the RSS.

Figure-2: RSS of Decision TreeIt is computationally infeasible to consider every possible partition of feature space into J boxes.

So to overcome that we go for a top-down, greedy approach known as recursive binary splitting.

The approach is top-down as it starts at the top of the tree ( at which all observations fall into a single region) and successively splits the predictor space into two new branches.

It is greedy because, at each step, it chooses the best split at that particular region and not for the next steps.

In order to perform Recursive binary splitting, we select the predictor and cutpoint that leads to the greatest reduction in RSS.

Figure-3: For any variable j and splitting point sWe seek the value of j and s that minimize the equation:Figure-4: RSS of Recursive SplittingCode-2: R Code for Regression TreeIn this code, we see the detailed description of the tree (Figure-2), as seen the deviance or RSS is reducing with each split.

Tree PruningThe process described in the above section is overly optimistic about training data i.

e it can overfit the data and can perform poorly on test set performance.

The overfit model ascertains high variance, but a smaller tree with fewer splits might lead to lower variance with better interpretation but with little higher bias.

One strategy would be to split the nodes only if the decrease in RSS of the split exceeds a certain threshold.

But this strategy does not guarantee better results.

Tree size is a tuning parameter governing the model’s complexity and the optimal tree size should be chosen from the data.

To overcome the overfitting, we apply Cost Complexity Pruning Algorithm.

The preferred strategy is to grow a large tree and stopping the splitting process only when some minimum node size (usually 5) is reached.

We define a subtree T that can be obtained by pruning, by collapsing the number of internal nodes.

We index the terminal nodes by m, with node m representing the Region Rm.

Let |T| denote the number of terminal nodes in T.

Figure-5: Mathematical Formulation of Cost-Complexity PruningThe tuning parameter governs the tradeoffs between tree-size and its goodness of fit.

Large values of alpha results in smaller trees and conversely.

This is analogous to the procedure in Ridge Regression, where an increase in the value of tuning parameter will decrease the weights of coefficients.

Estimation of alpha is achieved by five- or ten-fold cross-validation.

Code-3: Cost-Complexity Pruning and Manual PruningIn the tree module, there is a method called prune.

tree which gives a graph on the number of nodes vs deviance based on the cost complexity pruning.

We can even manually select the nodes based on the graph.

Figure-6: Size vs Deviance using Pruning MethodClosing RemarksIn this article, I have explained this Machine Learning algorithm for Regression Task.

In the next article, I will be exploring this for Classification Task and methods to improve the model (Bagging, Boosting, Random Forests).

Kushal Vala, Junior Data Scientist at Datametica Solutions Pvt LtdReferences:[1] Introduction to Statistical Learning with Applications in R, Gareth James Daniela Witten, Trevor Hastie, Robert Tibshirani.

[2] Elements of Statistical Learning Edition-2, Trevor Hastie, Robert Tibshirani, Jerome Friedman.

[3] The R Book, Michael J.

Crawley, Imperial College London at Silwood Park, UK.

.