Using q instead of p requires H(p, q)-H(x) extra bits.

We call this value the relative entropy or Kullback-Leibler (KL) divergence.

If we plug in the equations for cross entropy and entropy and then collect terms, we get that KL-Divergence is:KL-DivergenceAgain, this gives us the extra number of bits (or nats) needed given that we’re using q in place of p.

The KL-Divergence is always greater than 0 unless q equals p.

Therefore, it’s also a common objective function to minimize.

Minimizing the KL-Divergence also minimizes the cross entropy, which we’ve already said is the Maximum Likelihood Estimator.

Mutual InformationWe discussed earlier that some words aren’t independent of each other, and that the dependence causes us to get redundant information when we observe them both (“football” and “score” for example).

We’d like some measure of how much information two variables share.

Recall, that if x and y are independent variables, then their joint distribution is p(x, y) = p(x)p(y).

If they aren’t independent, then we can’t factor the joint distribution in this way.

We don’t have any redundancy when the variables are independent, and as the variables get more dependent, the amount of redundant information increases.

To quantify this, we use the mutual information of x and y:Mutual Information.

The mutual information is the amount of extra information needed if we’re representing the true joint distribution with the independent factorization.

If the variables are independent, then the KL-Divergence will be 0, otherwise it will grow as the variables increase in redundancy.

Mutual information has been frequently used to perform feature selection in machine learning.

For a given feature, we can measure the feature’s mutual information with the class labels.

If the mutual information is high, then the feature is a strong indicator of the class.

For example, if author A always includes their name in their documents, then the mutual information between their name and the class will be extremely high.

Similarly, if we have too many features to consider, we can use mutual information between features to remove those that are redundant.

If author A always includes their name and their home town, then we can safely remove their home town from our vocabulary and still perform well on the task.

ConclusionIn this post, we’ve covered the main concepts from Information Theory which are directly applicable to machine learning.

This hasn’t been an exhaustive treatment in any sense, but these are concepts that I’ve personally seen come up time and time again in practice.

For more reading, I suggest checking out Christopher Bishop’s book Pattern Recognition and Machine Learning.

The entire book is a gold mine of knowledge, but the concepts I’ve discussed can all be found in Chapter 1 under the information theory section.

David MacKay’s book Information Theory, Inference, and Learning Algorithms is also very popular.

See you next time!.. More details