Let’s assume our data has p inputs and a response for each of N observations.
To construct a regression tree:Consider all data points, then select a splitting variable j and a split point s.
Define 2 regions R₁ and R₂ based on j and s.
Seek the splitting variable j and the split point s that solveFor any choice j and s, we can solve the minimization by findingFor each splitting variable, it is possible to find the best pair of ( j, s ).
After finding the best split, partition the data into the 2 regions and repeat the splitting process on each of the 2 regions.
Repeat this process on all of the resulting regions.
Clearly, we would need to tune the tree size to control for the model’s complexity using our data.
One simple approach is splitting tree nodes only if the decrease in sum-of-squares due to the split exceeds some thresholds.
However, a better strategy is to grow a large tree and stop the splitting process when some minimum node size is reached.
This large tree is pruned using cost-complexity pruning.
The cost complexity criterion is defined as below, where T ⊂ T₀ is any tree obtained by pruning the large tree T₀.
Node m representing region Rm, and |T| is the number of terminal nodes in T.
cost complexity criterionIdea: find subtree T for each α to minimize the cost complexity.
Large values of α → smaller trees T.
To construct a Classification tree: The only difference is the criteria for splitting nodes and pruning the tree.
Here we define the proportion of class k observations in node m as below.
Then we can classify the observations in node m to class k which is the majority class in node m (equivalent to the maximum value of ????mk).
There are different measures Qm(T) of node impurity that one can use:While all three measures are similar, cross-entropy and the Gini index are differentiable, hence easier for numerical optimization.
Besides, these 2 measures are more sensitive to changes in the node probabilities than the misclassification error.
Thus, they tend to be used to grow the tree.
To guide cost-complexity pruning, the misclassification rate is preferable.
However, a simple model like Decision Tree tends to not have good predicting power.
More complicated methods, such as random forests and boosting, usually yield better results, though there is a trade-off between interpretability and prediction accuracy.
Random forests involve building many decision trees on a bootstrapped training set.
When building decision trees, a random selection of m predictors among the full set of p predictors is chosen in each considered split.
The split can only use one of those m selected predictors.
This method of bagging is great to overcome the problem of choosing only a few strong candidates in the splits.
It decorrelates the trees, making the results of averaging trees more robust.
XGBoost stands for eXtreme Gradient Boosting, implementing the gradient boosting techniques but being optimized for speed and performance.
The boosting idea involves growing trees sequentially, meaning that each tree is built based on the information from previously grown trees.
XGBoost, however, applies more penalties in the boosting equation when updating trees and residuals compared to the traditional boosting method, while leveraging the structure of hardware to speed up computing time.
Therefore, it has been considered the best model among all tree-based models.
However, the interpretation is harder to capture.
Project multiple trees of XGBoost model into one treeSupport Vector Machines (SVM)The idea of SVM is to find nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.
Simply, you have 2 categories in an x-y plane with points overlapping each other, and you can’t find a straight line that can perfectly separate them.
However, when you rotate the plane that contains our data to see from a different angle, you will probably find a hyperplane that can separate them.
For example, you can add one more dimension to the feature space.
Key things to construct SVM:We define our hyperplane as Xβ = 0.
We might need to transform our data (ex.
basis expansions), and the transformation usually uses kernel functions.
We can have linear kernel, polynomial kernel, radial basis kernel or neural network kernel to transform our problem.
Select a hyperplane that maximizes the margin (distance) M between the plane and the classification groups.
Intuitively, we are more certain about the points far from the dividing plane than the points close to the plane.
By maximizing the distance, we can also capture the situation when there are points close to the plane in the full dataset but not in the training set.
Hence, we maximize the chance of correct classification.
As the chance for perfect separation is low, we need some “slack” for the margin, allowing for some points to be on the wrong side of the margin.
We define the slack variables as ξ5.
The constraint is then modified:While the first choice measures overlap in actual distance from the margin, the second choice measures the overlap in relative distance.
The first choice leads to a nonconvex optimization problem, but the second choice is a convex one.
Thus, we tend to use the second choice.
The value ξ in the second constraint represents the proportional amount by which the prediction from Xβ is on the wrong side of its margin.
The tuning parameter for this tolerance of misclassification is called “cost” parameter C.
A large value of C gives lower margin, showing an overfit wiggly boundary in the original feature space.
In contrast, a small value of C leads to a larger margin or a smoother boundary in the original feature space.
One more parameter that you need to specify in an SVM algorithm (R or Python) is Gamma.
It defines how far a single training example can influence the calculation of the separation line.
A low value of Gamma means that far away points can be also considered, while a high value of Gamma means that only nearby points to the possible separation line are considered.
SVM can also be used in regression problems, known as Support Vector Regression (SVR).
The method is very similar to SVM for classification with some minor differences.
In SVR, we introduce a new term, the “ϵ-insensitive” error measure.
It is defined as below with r being residual.
What we are trying to do here is finding a function f(x, ω) such that the majority of our training points fall within the ϵ-insensitive zone.
We do not care about the errors as long as they are less than ϵ.
The value of ϵ affects the number of support vectors that are used to construct the regression function.
We also have slack variables ξ to measure the deviation of training samples outside the ϵ-insensitive zone, known as soft margin.
The problem is then formulated as minimizing the function:So which one is better?.Here is the outcome of running different models using my team’s Chicago House Dataset.
, Tibshirani, R.
, & Friedman, J.
The elements of statistical learning: Data mining, inference, and prediction.
New York, NY, USA: Springer.
Support Vector Machine Regression.