It’s just a metric.
It’s not important to know how it came to be.
It’s important to know how to read it and what it tells us, which we just did above.
Entropy is a measure of disorder or uncertainty and the goal of machine learning models and Data Scientists in general is to reduce uncertainty.
Now we know how to measure disorder.
Next we need a metric to measure the reduction of this disorder in our target variable/class given additional information( features/independent variables) about it.
This is where Information Gain comes in.
Mathematically it can be written as:Information Gain from X on YWe simply subtract the entropy of Y given X from the entropy of just Y to calculate the reduction of uncertainty about Y given an additional piece of information X about Y.
This is called Information Gain.
The greater the reduction in this uncertainty, the more information is gained about Y from X.
Let me illustrate with an example using a simple contingency table and then I’ll bring all of it together with how decisions trees use entropy and information gain to decide what feature to split their nodes on as they are being trained on a data set.
Example: Contingency TableContingency TableHere our target variable is Liability which can take on two values “Normal” and “High” and we only have one feature called Credit Rating which can take on values “Excellent”, “Good” and “Poor”.
There are a total of 14 observations.
7 of them belong to the Normal Liability class and 7 belong to High Liability Class.
So it’s an even split by itself.
Summing across the top row we can see there are 4 observations that have value Excellent for the feature credit rating.
Furthermore , I can also see how my target variable is split for “Excellent” Credit Rating.
For observations that take have value “Excellent” for their credit rating there are 3 that belong to the Normal Liability class and only 1 that belongs to the High Liability class.
I can similarly figure out such values for other values of Credit Rating from the contingency table.
For this illustration , I will use this contingency table to calculate the entropy of our target variable by itself and then calculate the entropy of our target variable given additional information about the feature, credit rating.
This will allow me to calculate how much additional information does “Credit Rating” provide for my target variable “Liability”.
Let’s get to it then.
The entropy of our target variable is 1, at maximum disorder due to the even split between class label “Normal” and “High”.
Our next step is to calculate the entropy of our target variable Liability given additional information about credit score.
For this we will calculate the entropy for Liability for each value of Credit Score and add them using a weighted average of the proportion of observations that end up in each value.
Why we use a weighted average will become clearer when we discuss this in the context of decision trees.
We got the entropy for our target variable given the feature Credit Rating.
Now we can compute the Information Gain on Liability from Credit Rating to see how informative this feature is.
Knowing the Credit Rating helped us reduce the uncertainty around our target variable, Liability!.Isn’t that what a good feature is supposed to do?.Provide us information about our target variable?.Well that’s exactly how and why decision trees use entropy and information gain to determine which feature to split their nodes on to get closer to predicting the target variable with each split and also to determine when to stop splitting the tree!.( in addition to hyper-parameters like max depth of course).
Let’s see this in action with another example using decision trees.
Example: Decision TreeConsider an example where we are building a decision tree to predict wether a loan given to a person would result in a write-off or not.
Our entire population consists of 30 instances.
16 belong to the write-off class and the other 14 belong to the non-write-off class.
We have two features, namely “Balance” that can take on two values -> “< 50K” or “>50K” and “Residence” that can take on three values -> “OWN”, “RENT” or “OTHER”.
I’m going to show you how a decision tree algorithm would decide what attribute to split on first and what feature provides more information, or reduces more uncertainty about our target variable out of the two using the concepts of Entropy and Information Gain.
Feature 1: BalanceProvost, Foster; Fawcett, Tom.
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic ThinkingThe dots are the data points with class right-off and the stars are the non-write-offs.
Splitting the parent node on attribute balance gives us 2 child nodes.
The left node gets 13 of the total observations with 12/13 ( 0.
92 probability) observations from the write-off class and only 1/13( 0.
08 probability) observations from the non-write of class.
The right node gets 17 of the total observation with 13/17( 0.
76 probability) observations from the non-write-off class and 4/17 ( 0.
24 probability) from the write-off class.
Let’s calculate the entropy for the parent node and see how much uncertainty the tree can reduce by splitting on Balance.
Splitting on feature ,“Balance” leads to an information gain of 0.
37 on our target variable.
Let’s do the same thing for feature, “Residence” to see how it compares.
Feature 2: ResidenceProvost, Foster; Fawcett, Tom.
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic ThinkingSplitting the tree on Residence gives us 3 child nodes.
The left child node gets 8 of the total observations with 7/8 (0.
88 probability) observations from the write-off class and only 1/8 (0.
12 probability) observations from the non-write-off class.
The middle child nodes gets 10 of the total observations with 4/10 (0.
4 probability) observations of the write-off class and 6/10( 0.
6 probability) observations from the non-write-off class.
The right child node gets 12 of the total observations with 5/12 ( 0.
42 probability) observations from the write-off class and 7/12 ( 0.
58 ) observations from the non-write-off class.
We already know the entropy for the parent node.
We simply need to calculate the entropy after the split to compute the information gain from “Residence”The information gain from feature, Balance is almost 3 times more than the information gain from Residence!.If you go back and take a look at the graphs you can see that the child nodes from splitting on Balance do seem purer than those of Residence.
However the left most node for residence is also very pure but this is where the weighted averages come in play.
Even though that node is very pure, it has the least amount of the total observations and a result contributes a small portion of it’s purity when we calculate the total entropy from splitting on Residence.
This is important because we’re looking for overall informative power of a feature and we don’t want our results to be skewed by a rare value in a feature.
By itself the feature, Balance provides more information about our target variable than Residence.
It reduces more disorder in our target variable.
A decision tree algorithm would use this result to make the first split on our data using Balance.
From here on, the decision tree algorithm would use this process at every split to decide what feature it is going to split on next.
In a real world scenario , with more than two features the first split is made on the most informative feature and then at every split the information gain for each additional feature needs to be recomputed because it would not be the same as the information gain from each feature by itself.
The entropy and information gain would have to be calculated after one or more splits have already been made which would change the results.
A decision tree would repeat this process as it grows deeper and deeper till either it reaches a pre-defined depth or no additional split can result in a higher information gain beyond a certain threshold which can also usually be specified as a hyper-parameter!There you have it!.You now know what entropy and information gain are and how they are computed.
You understand how a decision tree either by itself or in a tree based ensemble decides on the best order of features to split on and decides when to stop when it trains itself on given data.
If you every have to explain the intricacies of how decision trees work to someone, hopefully you won’t do too bad.
I hope you were able to get some value from this post.
If there is anything that I missed or something was inaccurate or if you have absolutely any feedback , please let me know in the comments.
I would greatly appreciate it.