Cross-entropy: From an Information theory point of viewashutosh nayakBlockedUnblockFollowFollowingJun 21Cross-entropy is a widely used loss function in machine learning for classification problems.

Information theory is used widely but the rationale behind using it is not explained in classes.

In this blog, we go through an intuitive understanding of the information theory and finally connecting it with the cross-entropy function.

History: Telecommunications were done using unreliable channels where transmitted signals were not often equal to the received signals as they were corrupted by noise.

Shannon proposed that an encoder-decoder system can be developed which could retrieve transmitted signals with minimum corruption where the amount of corruption is uncertain.

It could be done by redundancy, that is if a signal is aaa, send three signals, each aaa, if one is corrupted as aab, the other 2 can confirm that the transmitted signal was aaa by consensus.

All these were sent in bits and channel rate r = the number of useful bits/ total bits sent.

Shannon proved that the error could be 0 even though the r is >> 0 (he proved it to be 0.

5).

Shannon defined information content of a signal a with information x in Equation (1).

He claimed that it is the right way to measure information content.

From (1), it can be seen that information content is the biggest for an uncertain event.

The information content of the ensemble of transmitted signals is given in Equation (2).

Shannon claimed h(x) should be the compressed file length we should aspire to encode the information to be uniquely identifiable.

He proved that we cannot encode the information in a number of bits < h(x).

Higher uncertainty = higher value of h(x), which means more redundancy is required to encode information (which makes sense, right).

For example, there is much more to learn as to why a bridge collapsed by studying a collapsed bridge (a rare event) than studying perfect bridges.

(1) information content of x = h(x=a) = ln( 1/P(x=a) )(2) h(x) = sum over all a( P(x=a) ln( 1/P(x=a)) )h(x) is maximized for ensemble when all have equal probabilities.

Let's take an example before we move ahead.

For example, if I am thinking of a number between 1–100 and ask you to guess it in the minimum number of trials, “best first” question is “if the number is less than 50”.

This is the best question as it gives the maximum information about my number (eliminates 50 options).

If your first question is “is the number less than 1”, there is a high chance that you did not gain enough information if the answer is no (as you are left with 99 options).

Thus information content is maximized when we select a number 50 which divides the set into equal probabilities.

By Shannon’s principle, providing 1 bit of information should reduce the uncertainty by 1/2.

Entropy: is defined as the expected information content (equation 2).

It is also a measure of uncertainty in the information.

If the uncertainty is high, entropy is high.

For example, if we have a loaded coin which lands on heads every time, there is 0 entropy as there is no information from a coin toss (as it will always land on heads).

If we have to encode a piece of information of length l, we cannot encode it in a number of bits less than the entropy (equation 2).

The average length of code used to convey the message cannot be less than the “bits” of information sent.

To uniquely identify a signal of length l, we can use 2^l bits (0 or 1).

The ideal length for a piece of information i is shown in Equation(3).

To be uniquely identifiable (based on Kraft inequality), the length of the encoding is shown in Equation (4).

(3) l_i = h(x_i) = ln( 1/p_i)(4) l_i = ln( 1/q_i) -ln(Z) where Z is a normalization constant > 0 and ≤ 1Sum of total bits required L = sum over all i (p_i x l_i).

Using l_i value from (4), we reach to L ≥h(X) + KL(p||q) where KL(p||q) is the Kullback-Liebler divergence.

Thus KL divergence is the extra length of encodings required to represent a set of information against ideal bits of encoding.

Now that we understand the history behind KL divergence, let's have a statistical view.

Kullback-Liebler divergence is the measure of “how different is P from Q”.

It is the difference between the information contained in the two distributions using Equation (2), generally measured from the perspective of desired distribution (true classes in classification problem).

The KL divergence is the information content + extra bits required to code the new distribution from the perspective of true distribution.

Now, having set the pieces, we look at cross-entropy.

Cross-entropy = sum over all i (Y_i x ln(Y_i/Y_hat_i).

Summing it for all i and simplification by separating and discarding the constant numerator inside ln() term, cross-entropy remains at sum over all i (Y_i x ln(1/Y_hat_i) )If true distribution is Y and predicted is Y_hat, and Y does not equal to Y_hat, as we never know the true distribution, cross-entropy does not equal to entropy.

difference = -sum over all i (Y_i ( log (Y_hat_i) -log(Y_i) ) = KL (Y||Y_hat)Thus, cross entropy is actually a measure of entropy + KL divergence between the true distribution and the distribution from your machine learning model.

References:Prof David Mackay lecture series on information theoryYoutube video on KL divergence+cross-entropy.