# ML Algorithms: One SD (σ)- Decision Trees Algorithms

ML Algorithms: One SD (σ)- Decision Trees AlgorithmsSagi ShaierBlockedUnblockFollowFollowingFeb 4An intro to machine learning decision trees algorithmsThe obvious questions to ask when facing a wide variety of machine learning algorithms, is “which algorithm is better for a specific task, and which one should I use?”Answering these questions vary depending on several factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What do you want to do with the data.

This is one section of the many algorithms I wrote about in a previous article.

In this part I tried to display and briefly explain the main algorithms (though not all of them) that are available for decision trees tasks as simply as possible.

Decision Tree Algorithms:Decision Trees gives us a great Machine Learning Model which can be applied to both Classification problems (Yes or No value), and Regression Problems (Continuous Function).

Decision trees are tree-like model of decisions.

They are one way to display an algorithm that only contains conditional control statements.

· Iterative Dichotomiser 3 (ID3)To start with, dichotomisation means dividing into two completely opposite things.

The algorithm divides attributes into two groups which are the most dominant attribute and others to construct a tree.

It then calculates the entropy and information gains of each attribute.

In this way, the most dominant attribute can be founded.

This algorithm builds a tree top-down, starting from the root by choosing which attribute will be tested at each given node.

Basically, each attribute is evaluated through statistical means to see which attribute splits the dataset the best.

The best attribute is made the root, with its attribute values branching out.

The process continues with the rest of the attributes.

Once an attribute is selected, it is not possible to backtrack.

Some things to consider:ID3 is a precursor to both C4.

5 algorithm and C5.

0 algorithmID3 doesn’t guarantee an optimal solution; it can get stuck in local optimumsID3 can overfit to the training data (in order to avoid overfitting, smaller decision trees should be preferred over larger ones)ID3 uses a greedy approach by selecting the best attribute to split the dataset on each iteration (one improvement that can be made on the algorithm can be to use backtracking during the search for the optimal decision tree)ID3 is harder to use on continuous dataID3 doesn’t work with missing data· C4.

5 and C5.

0 (different versions of a powerful approach)C4.

5, Quinlan’s next iteration is a newer version of ID3.

The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles incomplete data points; (iii) solves over-fitting problem by bottom-up technique usually known as “pruning” (Pruning refers to the removal of those branches in our decision tree that do not contribute significantly to our decision process); and (iv) different weights can be applied the features that comprise the training data.

C5.

0, the most recent Quinlan iteration.

This implementation is covered by patent and probably as a result, is rarely implemented (outside of commercial software packages).

Some things to consider:If you find that ID3 works well for you, try C4.

5 or C5.

C4.

5 improvements over ID3:· Can work with both discrete and continuous attributes· Doesn’t care if you have missing attribute values· Pruning trees (replacing irrelevant branches with leaf nodes)C5.

0 improvements over C4.

5:· Several orders of magnitude faster· Memory efficiency· Smaller decision trees· Boosting (more accuracy)· Ability to weight different attributes· Winnowing (reducing noise)· Classification and Regression Tree (CART)The term Classification And Regression Tree (CART) is just a bigger term that refers to both regression and classification decision trees.

CART implementation is very similar to C4.

5; the one main difference is that CART constructs the tree based on a numerical splitting criterion recursively applied to the data, whereas C4.

5 includes the intermediate step of constructing rule sets.

Some things to consider:CART splits variables based on numeric values· Chi-squared Automatic Interaction Detection (CHAID)CHAID builds non-binary trees (i.

e.

, trees where more than two branches can attach to a single root or node).

Both CHAID and CART will construct trees, where each (non-terminal) node identifies a split condition.

Hence, both types of algorithms can be applied to analyze regression problems or classification.

CHAID creates all possible cross tabulations for each categorical predictor until the best outcome is achieved and no further splitting can be performed.

CHAID builds a predictive model (a tree), to help determine how variables best merge to explain the outcome in the given dependent variable.

The main difference between CHAID and CART is that CHAID uses multiway splits (more than two nodes).

Whereas, CART does binary splits (each node is split into two daughter nodes).

Also, CHAID prevents the overfitting problem- a node is only split if a significance criterion is fulfilled.

Some things to consider:In CHAID analysis, nominal, ordinal, and continuous data can be used, where continuous predictors are split into categories with approximately equal number of observationsCHAID is particular useful when you are looking for patterns in datasets with lots of categorical variables· Decision StumpA machine learning model consisting of a one-level decision tree; a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves).

A decision stump makes a prediction based on the value of just a single input feature.

Due to its simplicity, the stump often demonstrates a low predictive performance though, this model can be very efficient when boosted with operators like the AdaBoost operator.

The decision stump can be used as a base model in many machine learning ensemble methods, such as bagging and boosting.

Some things to consider:Easy to trainReduces overfittingWill do really well in ensemble mode· M5M5 combines a conventional decision tree with the possibility of linear regression functions at the nodes.

Besides accuracy, it can take tasks with very high dimension — up to hundreds of attributes.

M5 model tree is a decision tree learner for regression task, meaning that it is used to predict values of numerical response variable Y.

While M5 tree employs the same approach with CART tree in choosing mean squared error as impurity function, it does not assign a constant to the leaf node but instead it fit a multivariate linear regression model.

While M trees can perform well, the trees can also have large overlap.

Some things to consider:M5 can be used with linear regression functions at the nodesUntil next time,Bobcat.