Decision Tree Classifier from Scratch

So, let’s define our Node class first.

class Node: def __init__(self): self.

split = None self.

feature_values = None self.

leaf = False self.

left = None self.

right = None self.

result = None self.

gini = 0 self.

data = None self.

level=0 def __repr__(self): return '['+self.

split+' '+str(self.

feature)+']'There are quite a few number of attributes.

Below is the explanation of each of the attributesplit : The feature on which the node is split on.

Eg : Node can be split on Taste, Texture or Temperature**feature_values : Values to be considered while splitting the node.

Eg : If the Split is based on the feature ‘Taste’, feature_values can be [Sweet or Salty].

So, if the feature is ‘Taste’ and feature_values is [‘Sweet’,’Salty’] , then all the foods with taste either Sweet or Salty goes to the left subtree and all the foods with taste ‘Spicy’ goes to the right subtree.

leaf : This attribute tells if the node is a leaf node or not.

A leaf node is the node which do not have any child nodes.

left and right : This represents left and right child nodes.

Both are null for a leaf noderesult : This tells if the output is true or false at this node.

It’ll be used when predicting the output value for a new set of datagini : In this tutorial, I’m using Gini Index to decide best split.

This gives us value of gini index at this node.

data : Data at this nodelevel : Represents depth of the tree.

Now, let’s see the DecisionTree classclass DecisionTree: #takes data and target column def __init__(self,data,target='Eat',depth=None,v=0): self.

data = data self.

target = target self.

root = None self.

depth = depth self.

v = v #This helper function calculates gini index of the dataset def giniIndex(dset,target): return 1-(len(dset[dset[target]==True])/len(dset))**2-(len(dset[dset[target]==False])/len(dset))**2 #This method builds the Decision tree def build(self,data,level=0): if self.

v==1: print('======================================') print('Building Tree for the data') print('======================================') print(data) left_dataset = None right_dataset = None min_gini = np.

inf best_feature = None feature_values = None gini = len(data)*DecisionTree.

giniIndex(data,self.

target) if self.

v==1: print('GiniIndex for total data = ',gini) node = Node() node.

data = data node.

level = level node.

gini = gini if self.

depth is not None and level==self.

depth: node.

leaf = True node.

result = len(data[data[self.

target]==True])>=len(data[data[self.

target]==False]) return node #If there is no impurity in this data, make it a leaf node if gini==0: if self.

v==1: print('The data is pure, no split is needed ') node.

gini=0 node.

leaf = True node.

result = len(data[data[self.

target]==True])>=len(data[data[self.

target]==False]) return node for feature in data.

columns: #We need not split on target column if feature == self.

target: continue #Find all unique values for the feauture unique = data[feature].

unique() if self.

v==1: print('________________________________________________________') print('Evaluating all possible splits for the feature "',feature,'"') print('________________________________________________________') print('All the values for this feature are ',unique) #Initialize gini, left and right datasets and best feature values tmngini = np.

inf tldset = None trdset = None tbftr = None#We can't split based on a single value,There must be atleast 2 unique values to be able to split if len(unique)==1: print('Ignoring this feature as it has only a single unique value') continue #To find the best values for split on the given feature for st in range(1,2**len(unique)-1): lvals = [unique[x] for x in [t for t in enumerate(list(bin(st)[2:])[::-1]) if t=='1']] #Find left data set lset = data[data[feature].

isin(lvals)] rvals = list(set(unique)-set(lvals)) #Find right data set rset = data[data[feature].

isin(rvals)] #Avoid dealing with duplicate sets if len(lvals)>len(rvals): continue #Find gini index for left split lgini = DecisionTree.

giniIndex(lset,self.

target) #Find gini impurity for right split rgini = DecisionTree.

giniIndex(rset,self.

target) #Find the total weighted gini.

tgini = len(lset)*lgini+len(rset)*rgini if self.

v==1: print(' ******************************************') print(' ***** split based on values ',lvals,'*****') print(' ******************************************') print('———————–') print('Left dataset') print('———————–') print(lset) print('———————–') print('Right dataset') print('———————–') print(rset) print('Weighted Gini for this split ',tgini) #Update minimum gini if tgini<tmngini: tmngini=tgini tldset = lset trdset = rset tbftr = lvals if self.

v==1: print('Best gini for ',feature,' is ',tmngini) #Update minimum gini if tmngini<min_gini: min_gini = tmngini left_dataset = tldset right_dataset = trdset feature_values = tbftr best_feature = feature #No improvement in gini value after split, Make it as leaf node if min_gini>tmngini: node.

leaf = True node.

result = len(data[data[self.

target]])>len(data[data[self.

target]]) return node node.

min_gini= min_gini node.

feature_values = feature_values node.

split = best_feature if self.

v==1: print('Best split is "',best_feature,'" values are ',feature_values,' and GiniIndex is ',min_gini) #Build tree for left dataset if left_dataset is not None: node.

left = self.

build(left_dataset,level+1) #Build tree for right dataset if right_dataset is not None: node.

right = self.

build(right_dataset,level+1) #If both the trees are not built, it has to be leaf if node.

left==None and node.

right==None: node.

leaf = True return node def fit(self): self.

root = self.

build(data) return self.

root def __predict__(self,s,root): if root is None: return False if root.

leaf: return root.

result if s[root.

split] in root.

feature_values: return self.

__predict__(s,root.

left) return self.

__predict__(s,root.

right) def predict(self,s): return self.

__predict__(s,self.

root)I’ll try to explain each and every methoddef __init__(self,data,target='Eat',depth=None,v=0): self.

data = data self.

target = target self.

root = None self.

depth = depth self.

v = vThis is the constructor.

We are initializing the tree with data and target variable.

In this case, our target variable is ‘Eat’.

depth is maximum allowed depth while building tree.

This feature is used to avoid overfitting our tree.

v stands for verbose.

If we want to see the tree building process, we need to pass v=1.

def giniIndex(dset,target): return 1-(len(dset[dset[target]==True])/len(dset))**2-(len(dset[dset[target]==False])/len(dset))**2This is the helper function used to calculate the gini index for the data.

We’ll use gini index to find the best split.

Formula for gini index isFormula (Image taken from http://www.

learnbymarketing.

com/481/decision-tree-flavors-gini-info-gain/)C represents no.

of classes.

In our case C = 2 since it is a binary classifier.

=> next method is the tree building method.

This method is bit large.

I’ll give a high level walk through of this method.

left_dataset = Noneright_dataset = Nonemin_gini = np.

infbest_feature = Nonefeature_values = NoneHere I’m initializing variables for left dataset (left_dataset), right dataset (right_dataset) , best gini index (min_gini), best feature (best_feature), feature values (feature_values)gini = len(data)*DecisionTree.

giniIndex(data,self.

target)Find the gini index for the entire dataset.

node = Node() node.

data = data node.

level = level node.

gini = giniCreate a new node, set data, level and gini indexif self.

depth is not None and level==self.

depth: node.

leaf = True node.

result = len(data[data[self.

target]==True]) >= len(data[data[self.

target]==False]) return nodeIf max depth is reached, then we need to make this node as the leaf.

Here result is True if there are more True valued targets than False valued targets.

Else false.

This tells us the confidence of this leaf node.

This will be used when predicting target for new data.

While evaluating target value for a new data, the target value is same as the ‘result’ value of the node.

If the ‘result’ is True, target is predicted as True.

If the ‘result’ is False, target is predicted as False#If there is no impurity in this data, make it a leaf nodeif gini==0: node.

leaf = True node.

result = len(data[data[self.

target]==True])>=len(data[data[self.

target]==False]) return nodeIf gini index of data is 0, that means, data is pure.

It has to be made as leaf node as we can’t split into two.

And also, set the result to True or False based on the count of target values.

If there are more True valued targets, result is set to True.

If there are more False valued targets, result is set to False========================================I think , below one is the most complicated snippet of allTo evaluate the best split, we need to iterate through all features.

for feature in data.

columns: #We need not split on target column if feature == self.

target: continueIf feature name is same as the target, we can skip it.

unique = data[feature].

unique()Find all the unique values for a feature.

Eg : data[‘Tasty’].

unique() gives [‘Salty’,’Spicy’,’Sweet’].

Now we need to evaluate which split will be the best by iterating through all the splits.

There are 6 possible splits for Taste feature.

They are [‘Salty’],[‘Spicy’],[‘Sweet’],[‘Salt’ or ‘Spicy’],[‘Salt’ or ‘Sweet’], [‘Spicy’ or ‘Sweet’].

We need to evaluate these 6 splits and find the best one.

#Initialize gini, left and right datasets and best feature values tmngini = np.

inf tldset = None trdset = None tbftr = NoneInitialize temporary variables to store gini index, left data set, right data set and feature values of best splitlvals = [unique[x] for x in [t for t in enumerate(list(bin(st)[2:])[::-1]) if t=='1']]#Find left data setlset = data[data[feature].

isin(lvals)]rvals = list(set(unique)-set(lvals))#Find right data setrset = data[data[feature].

isin(rvals)]#Avoid dealing with duplicate setsif len(lvals)>len(rvals): continue#Find gini index for left splitlgini = DecisionTree.

giniIndex(lset,self.

target)#Find gini impurity for right splitrgini = DecisionTree.

giniIndex(rset,self.

target)#Find the total weighted gini.

tgini = len(lset)*lgini+len(rset)*rginiFor all the available feature value subsets, split the data into left data and right data.

Find the weighted gini index by combining gini indices for both left and right subsets.

Find the feature subset that gives the minimum gini index.

In the given example, Taste Feature with [‘Salty’] feature_value will give a best gini index of 1.

0.

By best, I mean the minimum gini index.

#Build tree for left datasetif left_dataset is not None: node.

left = self.

build(left_dataset,level+1) #Build tree for right datasetif right_dataset is not None: node.

right = self.

build(right_dataset,level+1)Now, repeat tree building process for left dataset and right data set recursively.

def __predict__(self,s,root): if root is None: return False if root.

leaf: return root.

result if s[root.

split] in root.

feature_values: return self.

__predict__(s,root.

left) return self.

__predict__(s,root.

right)This is the predict method that will predict target value for new data.

we simply traverses from the root of the tree based on the split and feature_values.

Once we hit the leaf node, ‘result’ variable of the leaf node will give our prediction========================================I tried to find the accuracy by submitting the same dataset that I trained the tree with (re-submission error).

I got 100% percent accuracy, because I overfit the model by constructing the tree to the deepest leaf node.

This clearly must not be done in practice.

My only intention is to prove that tree has been build succesfully.

screenshot of the outputcode link : https://github.

com/prudvinit/MyML/blob/master/lib/tree/DecisionTreeClassifier.

py.