Sigmoid Activation and Binary Crossentropy —A Less Than Perfect Match?

Let’s see.


Checking individual samples’ raw output values in binary classification networksI trained three different networks to the task of categorizing dogs vs.

cats, using a subset of the 2013 Kaggle competition data set (2000 training images, 1000 for validation; following the example of F.

Chollet (Deep Learning with Python.

Manning Publications Co, Shelter Island, New York.



Yes, cats and dogs again, for the sake of ease and focusing on the issue at hand.

The figures and numbers below stem from a simple, hand-crafted convnet: four pairs of 2D convolution/max pooling layers, followed by a single dropout and two dense layers, all with relu activation.

The last, single-element, output layer was without activation, and as the loss function I used above-mentioned Keras wrapper for TensorFlow’s sigmoid_cross_entropy_with_logits.

First, let’s find out whether individual images can in fact result in extreme raw values of the output layer.

After training, I ran the same images as used for training through the network and obtained their raw output from the last layer.

Additionally, I computed the sigmoid-transformed output, as well as the BCE values derived from both outputs.

This is what I got after training for eight epochs, so with relatively little learning having taken place:Figure 4: Left and center, distribution of outputs of the last layer after training the network for eight epochs.

Input for prediction consisted of the same 2000 images as used for training.

Left, raw values; center, sigmoid-transformed values.

Right, scatter plot of BCE values computed from sigmoid output vs.

those computed from raw output.

Batch size = 1.

Obviously, in the initial phase of training, we are outside the danger zone; raw last layer output values are bounded by ca [-3 8] in this example, and BCE values computed from raw and sigmoid outputs are identical.

Also nice to see is the strong ‘squashing’ effect of the sigmoid (Fig.

4, center).

What is the picture like when the network is fully trained (here defined after not having shown a reduced loss in 15 consecutive epochs)?Figure 5: Same depiction as in Fig.

4, but after full training of the network (test accuracy of ca.



Same conventions as in Fig.

4 apply.

Note clipping of BCE values computed from sigmoid outputs (right).

Aha —we see a much clearer separation between the classes, as expected.

And a small number of images did in fact result in extreme logit values that fall into the clipping range.

Let’s focus on class 1 (dogs; orange color in the figures); the argument runs similar for the other class.

None of the samples result in raw output values more negative than ca.

-4, so again that is fine.

However, a certain number of samples — the doggiest dogs — reach raw output values larger than about 16.

Accordingly, the associated BCE, computed via sigmoid + Keras’s binary_crossentropy, is clipped at ca.

10⁻⁷ (Fig.

5, right; see also Fig.


This is a really small value.

Would we expect learning to happen in a systematically different fashion if BCE values of doggy dogs and catty cats were smaller (and individually different) when computed without clipping-induced limits?.Particularly if we use reasonable batch sizes, the samples with intermediate or low raw output values will dominate the loss.

Figure 6 illustrates this for the same data as above with a batch size of 4, which is still really on the low side.

Figure 6: scatter plot of BCE values computed from sigmoid output vs.

those computed from raw output of the fully trained network with batch size = 4.

Results were qualitatively similar with VGG16- and Inception V3-based networks pretrained on the imagenet data set (trained without fine-tuning of the convolutional parts).


ConclusionsFirst of all, let’s reiterate that fears of number under- or overflow due to the combination of sigmoid activation in the last layer and BCE as the loss function are unjustified — in Keras using the TensorFlow backend.

I am not familiar with other frameworks (yet), but I would be very surprised if they did not feature similar precautions.

Based on the ‘experiments’ above with the venerable cats_vs_dogs data set it appears that sigmoid +BCE is fine also in terms of precision.

Particularly if you use a reasonable batch size and properly scaled data, it should not matter how BCE is computed.

However, this is just one data set and very few models tested on them.

So, my tentative summary: Ifyou know or suspect that raw outputs of many of your samples at the last layer attain extreme values, andyour batch size is very small, and/oryou want to exclude numercial imprecision as a possible (if unlikely) cause of trouble,it can’t harm to compute BCE from raw outputs.

Otherwise, just stick with sigmoid+BCE.

Comments, suggestions, experience with other frameworks?.Happy to hear about it.


. More details

Leave a Reply