Bayesian CNN model on MNIST data using Tensorflow-probability (compared to CNN)LU ZOUBlockedUnblockFollowFollowingJan 29MotivationI’ve been recently reading about the Bayesian neural network (BNN) where traditional backpropagation is replaced by Bayes by Backprop.

This was introduced by Blundell et al (2015) and then adopted by many researchers in recent years.

Instead of point estimate of weights, BNN approximates the distribution of weights, commonly a Gaussian/normal distribution with two hyperparameters (mean and standard deviation), based on prior information and data.

The prediction uses the posterior distribution of weights.

The backpropagation is to update the hyperparameters of the weights.

This way, the model can provide uncertainty estimates of the weights and predictions.

BNN can be integrated into any neural network models, but here I’m interested in its application on convolutional neural networks (CNN).

So far, there are several existing packages in Python that implement Bayesian CNN.

For example, Shridhar et al 2018 used Pytorch (also see their blogs), Thomas Wiecki 2017 used PyMC3, and Tran et al 2016 introduced the package Edward and then merged into TensorFlow Probability (Tran et al 2018).

This blog will use TensorFlow Probability to implement Bayesian CNN and compare it to regular CNN, using the famous MNIST data.

The human accuracy on the MNIST data is about 97.

5% — 98%.

A single node neural network model will be used as the baseline model.

from __future__ import absolute_importfrom __future__ import divisionfrom __future__ import print_functionimport osimport warnings# warnings.

simplefilter(action="ignore")warnings.

filterwarnings('ignore')os.

environ['TF_CPP_MIN_LOG_LEVEL'] = '3'import matplotlib.

pyplot as pltfrom matplotlib import figure from matplotlib.

backends import backend_aggimport seaborn as snsfrom tensorflow.

examples.

tutorials.

mnist import input_dataimport tensorflow as tfimport numpy as nptf.

logging.

set_verbosity(tf.

logging.

ERROR)# Dependency importsimport matplotlibimport tensorflow_probability as tfpmatplotlib.

use("Agg")%matplotlib inlineImport dataTensorflow’s built-in MNIST API saves you a lot of effort on manipulating the MNIST data, so that you can focus on model development.

It allows you to import the data with different shapes and one-hot encoded labels, to easily select batches, and the images have been normalized.

The MNIST data are imported here in three versions:images reshaped to 784(28*28) vector, labels one-hot codedimages not reshaped (28281 images), labels not one-hot coded (integers 1–9)images not reshaped, labels one-hot coded.

mnist_onehot = input_data.

read_data_sets(data_dir, one_hot=True)mnist_conv = input_data.

read_data_sets(data_dir,reshape=False ,one_hot=False)mnist_conv_onehot = input_data.

read_data_sets(data_dir,reshape=False ,one_hot=True)# display an imageimg_no = 485one_image = mnist_conv_onehot.

train.

images[img_no].

reshape(28,28)plt.

imshow(one_image, cmap='gist_gray')print('Image label: {}'.

format(np.

argmax(mnist_conv_onehot.

train.

labels[img_no])))Image label: 6Baseline modelAs a baseline model, a neural network with one hidden layer of a single node is built.

This is equivalent to a multinomial logistic regression model.

The model flattens the image, ignoring the connections between neighboring pixels in the image.

The baseline model actually does a good job reaching around 91–93% accuracy.

Codes are from Udemy course “Complete Guide to Tensorflow for Deep Learning with Python”.

# define placeholdersx = tf.

placeholder(tf.

float32, shape=[None, 28*28])y_true = tf.

placeholder(tf.

float32, shape=[None, 10])# define variables: weights and biasW = tf.

Variable(tf.

zeros([28*28, 10]))b = tf.

Variable(tf.

zeros([10]))# create graph operationsy = tf.

matmul(x,W)+b# define loss functioncross_entropy = tf.

reduce_mean(tf.

nn.

softmax_cross_entropy_with_logits(labels=y_true, logits=y))# define optimizeroptimizer = tf.

train.

GradientDescentOptimizer(learning_rate=0.

5)train=optimizer.

minimize(cross_entropy)# create sessionepochs = 5000init = tf.

global_variables_initializer()with tf.

Session() as sess: sess.

run(init) for step in range(epochs): batch_x, batch_y = mnist_onehot.

train.

next_batch(50) sess.

run(train, feed_dict={x:batch_x, y_true:batch_y}) #EVALUATION correct_preds = tf.

equal(tf.

argmax(y,1), tf.

argmax(y_true, 1)) acc = tf.

reduce_mean(tf.

cast(correct_preds, tf.

float32)) print('Accuracy on test set: {}'.

format( sess.

run(acc, feed_dict={x: mnist_onehot.

test.

images, y_true: mnist_onehot.

test.

labels})))Accuracy on test set: 0.

9124000072479248CNN modelThe CNN model is a simple version of the following:Convolutional layer (32 kernels)Max poolingConvolutional layer (64 kernels)Max poolingFlattening layerFully connected layer (1024 output units)Dropout layer (50% dropping rate)Fully connected layer (10 output units, one for each digit)The number of kernels, dropping out rate, and output units are arbitrary here, without any parameter tuning.

After 5000 batches, the model accuracy reached 99%, overtaking human accuracy.

x = tf.

placeholder(tf.

float32,shape=[None,28,28,1])y_true = tf.

placeholder(tf.

float32,shape=[None,10])hold_prob = tf.

placeholder(tf.

float32)cnn = tf.

keras.

Sequential()cnn.

add(tf.

keras.

layers.

Conv2D(32, kernel_size=5, padding='SAME', activation=tf.

nn.

relu))cnn.

add(tf.

keras.

layers.

MaxPooling2D(pool_size=[2, 2], strides=[2, 2], padding="SAME"))cnn.

add(tf.

keras.

layers.

Conv2D(64, kernel_size=5, padding='SAME', activation=tf.

nn.

relu))cnn.

add(tf.

keras.

layers.

MaxPooling2D(pool_size=[2, 2], strides=[2, 2], padding="SAME"))cnn.

add(tf.

keras.

layers.

Flatten())cnn.

add(tf.

keras.

layers.

Dense(1024, activation=tf.

nn.

relu))cnn.

add(tf.

keras.

layers.

Dropout(hold_prob))cnn.

add(tf.

keras.

layers.

Dense(10))y_pred = cnn(x)cross_entropy = tf.

reduce_mean(tf.

nn.

softmax_cross_entropy_with_logits(labels=y_true,logits=y_pred))optimizer = tf.

train.

AdamOptimizer(learning_rate=0.

0001)train = optimizer.

minimize(cross_entropy)steps = 5000init = tf.

global_variables_initializer()with tf.

Session() as sess: sess.

run(init) for i in range(steps+1): batch_x , batch_y = mnist_conv_onehot.

train.

next_batch(50) sess.

run(train,feed_dict={x:batch_x,y_true:batch_y,hold_prob:0.

5}) # PRINT OUT A MESSAGE EVERY 100 STEPS if i%500 == 0: matches = tf.

equal(tf.

argmax(y_pred,1),tf.

argmax(y_true,1)) acc = tf.

reduce_mean(tf.

cast(matches,tf.

float32)) print('Step {}: accuracy={}'.

format(i, sess.

run(acc,feed_dict={x:mnist_conv_onehot.

test.

images, y_true:mnist_conv_onehot.

test.

labels, hold_prob:1.

0})))Step 0: accuracy=0.

20010000467300415Step 500: accuracy=0.

9563999772071838Step 1000: accuracy=0.

973800003528595Step 1500: accuracy=0.

9807999730110168Step 2000: accuracy=0.

9815000295639038Step 2500: accuracy=0.

9854000210762024Step 3000: accuracy=0.

9864000082015991Step 3500: accuracy=0.

9868000149726868Step 4000: accuracy=0.

9886000156402588Step 4500: accuracy=0.

9894999861717224Step 5000: accuracy=0.

9865999817848206Bayesian CNNI chose TensorFlow Probability to implement Bayesian CNN purely for convenience and familiarity with TensorFlow.

This package uses the Flipout gradient estimator to minimize the negative ELBO as the loss.

It computes the integration when deriving the posterior distribution.

Other implementations may be more efficient; for example, Shridhar et al 2018’s applied the Local Reparameterization Trick to avoid the integration by sampling from an approximation of posterior distribution.

The codes are modified based on the provided example here.

The codes of the plots below are taken from the original example, thus not displayed here.

images = tf.

placeholder(tf.

float32,shape=[None,28,28,1])labels = tf.

placeholder(tf.

float32,shape=[None,])hold_prob = tf.

placeholder(tf.

float32)# define the modelneural_net = tf.

keras.

Sequential([ tfp.

layers.

Convolution2DReparameterization(32, kernel_size=5, padding="SAME", activation=tf.

nn.

relu), tf.

keras.

layers.

MaxPooling2D(pool_size=[2, 2], strides=[2, 2], padding="SAME"), tfp.

layers.

Convolution2DReparameterization(64, kernel_size=5, padding="SAME", activation=tf.

nn.

relu), tf.

keras.

layers.

MaxPooling2D(pool_size=[2, 2], strides=[2, 2], padding="SAME"), tf.

keras.

layers.

Flatten(), tfp.

layers.

DenseFlipout(1024, activation=tf.

nn.

relu), tf.

keras.

layers.

Dropout(hold_prob), tfp.

layers.

DenseFlipout(10)])logits = neural_net(images)# Compute the -ELBO as the loss, averaged over the batch size.

labels_distribution = tfp.

distributions.

Categorical(logits=logits)neg_log_likelihood = -tf.

reduce_mean(labels_distribution.

log_prob(labels))kl = sum(neural_net.

losses) / mnist_conv.

train.

num_exampleselbo_loss = neg_log_likelihood + kloptimizer = tf.

train.

AdamOptimizer(learning_rate=learning_rate)train_op = optimizer.

minimize(elbo_loss)# Build metrics for evaluation.

Predictions are formed from a single forward# pass of the probabilistic layers.

They are cheap but noisy predictions.

predictions = tf.

argmax(logits, axis=1)accuracy, accuracy_update_op = tf.

metrics.

accuracy(labels=labels, predictions=predictions)learning_rate = 0.

001 #initial learning ratemax_step = 5000 #number of training steps to runbatch_size = 50 #batch sizeviz_steps = 500 #frequency at which save visualizations.

num_monte_carlo = 50 #Network draws to compute predictive probabilities.

init_op = tf.

group(tf.

global_variables_initializer(), tf.

local_variables_initializer())with tf.

Session() as sess: sess.

run(init_op)# Run the training loop.

for step in range(max_step+1): images_b, labels_b = mnist_conv.

train.

next_batch(batch_size) images_h, labels_h = mnist_conv.

validation.

next_batch(mnist_conv.

validation.

num_examples) _ = sess.

run([train_op, accuracy_update_op], feed_dict={ images: images_b,labels: labels_b,hold_prob:0.

5}) if (step==0) | (step % 500 == 0): loss_value, accuracy_value = sess.

run([elbo_loss, accuracy], feed_dict={images: images_b,labels: labels_b,hold_prob:0.

5}) print("Step: {:>3d} Loss: {:.

3f} Accuracy: {:.

3f}".

format(step, loss_value, accuracy_value))Step: 0 Loss: 161.

928 Accuracy: 0.

140Step: 500 Loss: 135.

825 Accuracy: 0.

858Step: 1000 Loss: 117.

817 Accuracy: 0.

907Step: 1500 Loss: 99.

129 Accuracy: 0.

927Step: 2000 Loss: 80.

596 Accuracy: 0.

938Step: 2500 Loss: 63.

682 Accuracy: 0.

946Step: 3000 Loss: 48.

857 Accuracy: 0.

950Step: 3500 Loss: 36.

574 Accuracy: 0.

954Step: 4000 Loss: 27.

315 Accuracy: 0.

957Step: 4500 Loss: 20.

480 Accuracy: 0.

959Step: 5000 Loss: 15.

652 Accuracy: 0.

961ResultsThe regular CNN takes a shorter time to run and achieves better accuracy, compared to the Bayesian CNN using the same model structure.

However, the one advantage that Bayesian CNN brings in is an uncertainty measure of the weights and predictions.

The following plots show the hyper parameters of weight posterior distributions converge through training steps.

At the beginning, priors dominate the distributions so that all are similar; in the end, the posteriors differ as driven by the data.

Recall the layers in the model:Layer 0: Convolutional layer (32 kernels) Layer 2: Convolutional layer (64 kernels) Layer 5: Fully connected layer (1024 output units) Layer 7: Fully connected layer (10 output units, one for each digit)The graphs below show the uncertainties of prediction at training steps 1, 500 and 5000 (from left to right).

Step 1 shows higher uncertainties; after 500 training batches, the predictions become more confident in general, except for some unclear hand writings.

For example, the second last case is difficult even for humans to be certain (3 or 5?).

After step 5000, the model confidence is improved significantly.

.