Guide to Coding a Custom Convolutional Neural Network in TensorFlow CoreTutorial for Developing in the Low-Level APIAndrew KrugerBlockedUnblockFollowFollowingMar 12The following demonstrates how to use the low-level TensorFlow Core to create Convolutional Neural Network (ConvNet) models without high-level APIs such as Keras.
The goal of this tutorial is to provide a better understanding of the background processes in a deep neural network and to demonstrate concepts on how use TensorFlow to create custom code.
This tutorial will show how to load the MNIST handwritten digit dataset into a data iterator, use graphs and sessions, create a novel ConvNet architecture, train the model with different options, make predictions, and save the trained model.
A complete code will then be provided along with the equivalent model in Keras to allow a direct comparison for further insight for those who use Keras, and to show how powerful high-level APIs are for creating neural networks.
MNIST DatasetMNIST is a set of 28×28 grayscale images of handwritten numbers, with 60,000 images in the train set and 10,000 images in the test set.
First I’ll load and process the MNIST images.
The model will expect the input shape to be [batch_size, height, width, channels].
Since the images are grayscale (single-channel), they have shape [60000,28,28] so they need a channel dimension added to have shape [60000,28,28,1].
They also have type uint8 (pixel values range from 0–255) so they need to be scaled to range between 0–1 by dividing by 255.
The first image is then displayed as an example.
The labels are integer values (e.
0, 1, 2), but the labels are categorical values so they need to be one-hot encoded (e.
[1,0,0], [0,1,0], [0,0,1]) for training.
Keras has an encoder to_categorical that I’ll use to convert the labels (Scikit-learn also has OneHotEncoder as another option).
As will be seen below, we will not need to convert the test labels to be one-hot encoded.
Graphs and SessionsThere are two parts to making a model with TensorFlow: creating a graph, and running the graph in a session.
The graph outlines the computation dataflow.
It organizes when and how operations will be performed on tensors (multi-dimensional data arrays).
The graph can be created inside or outside of a session, but the graph can only be used inside a session.
It’s in the session that tensors are initialized, the operations are performed, and the models are trained.
Data IteratorTo demonstrate a dataflow graph and session, I’ll create a dataset iterator.
Since the MNIST images and ground truth labels are slices of NumPy arrays, a dataset can be created by passing the arrays into the method tf.
We can then create an iterator for the dataset.
We want it to return a batch_size number of images and labels each time it’s run, and to repeat for an indefinite number of epochs.
We have now defined a dataflow graph, and to get a new batch of images and labels, we can run data_batch in a session.
But, not yet.
The dataflow graph is made, but the images haven’t actually been passed in yet.
To do this, the iterator needs to be initialized in a session.
Now a session is running and the first batch of images can be retrieved just by running data_batch.
The images will have shape [batch_size, height, width, channels] and the labels have shape [batch_size, classes].
We can check this by:Images shape: (128, 28, 28, 1)Labels shape: (128, 10)We can display the first image in the the first two batches by running data_batch twice.
In the first batch, the first image was of a 5.
In the second batch, the first image was of a 1.
The session can then be closed by sess.
However, keep in mind that the information will be lost when the session is closed.
For example, if we close and restart the session as shown below, the data iterator will restart from the beginning.
(Note the iterator needs to be initialized in each session.
)Because the session was closed and a new session was created, the data iterator restarted from the beginning and displayed the same image again.
With “with”A session can also be started and automatically closed by a “with” statement.
At the end of the “with” block, the session is closed as indicated.
ConvNet ModelThe following will demonstrate how to use TensorFlow to build this basic ConvNet shown:The architecture has four convolution layers.
The first two layers have 16 filters, the second two have 32 filters, and all filters are of size 3×3.
Each of the four convolution layers also have bias added and a relu activation.
The last two layers are fully-connected (dense) layers.
Weights and BiasesThe initial weights in a ConvNet need to be randomized values for symmetry breaking so the network will be able to learn.
The xavier_initializer “is designed to keep the scale of the gradients roughly the same in all layers” and is typically used in initializing weights for models.
We can create the weights for the layers by using this initializer with tf.
Each convolution layer has filters with shape [filter_height, filter_width, in_channels, out_channels].
Since the dense layers are fully connected, and do not have a 3×3 filter, their shape is simply [in_channels, out_channels].
Biases are also created, each having the same size as the out_channels of the corresponding layer and initialized with zeros.
I’ll create weights and biases dictionaries for organization and simplicity.
Because the MNIST images are grayscale, the in_channels for the first layer is 1.
The output layer needs to have out_channels as 10 because there are 10 classes.
The number of filters in the other layers can be tuned for performance or speed, but each of the in_channels need to be the same as the out_channels of the previous layer.
The size of 7*7*32 for the first dense layer will be explained when the ConvNet is created below.
Convolution LayersTensorFlow has a tf.
conv2d function that can be used for convolving the tensors with the weights.
To simplify the convolutional layers, I’ll create a function that takes the input data x and applies a 2D convolution with weights W, adds a bias b, the uses the relu activation.
Model GraphAnother function can be used to make the model graph.
This will take in the MNIST images as data and use the weights and biases in the different layers.
As shown in the architecture, the layers decrease in height and width after the second and fourth convolutional layers.
Max pooling will slide a window and use only the single maximum value within the region.
By using a 2×2 window (ksize=[1,2,2,1]) and stride of 2 (strides=[1,2,2,1]), the dimensions will be reduced by half.
This helps reduce the model size while keeping the most important features.
For odd input sizes, the output shape is determined by the ceiling after dividing by 2 (e.
7/2 = 4).
After the fourth convolution layer, the tensor needs to be reshaped before the fully connected layer so it is flattened.
The weights for the fully connected layer are not for a 2D convolution, but rather just a matrix multiplication so the conv2d function is not used.
An example dropout layer is added between the last two layers to help reduce overfitting, where the probability of dropping a weight is 0.
The dropout layer should only be used during training, so a training flag is included to bypass the layer if the model is bing used for predictions.
If the dropout layer was included during predictions, the output from the model would be inconsistent and have a lower accuracy due to the randomly dropped weights.
The shapes for the different layers are given in the code comments.
Starting with a 28×28 image, the size is reduced by half twice by the two max pooling layers so its width and height are reduced to size (28/2)/2 = 7.
Before being used in the dense layers, the weights need to be flattened with a reshape.
Since there are 32 filters in the fourth convolutional layer, the first shape dimension of the flattened layer has size 7x7x32 (which is why that was used to get the shape of weights['d1']).
Building the ConvNet GraphNext we can construct the dataflow graph that will be used for training the model.
An important concept is using placeholders, so I will first create the ConvNet using them.
However, I will also show how the model can be made without placeholders (see Training Without Placeholders below).
PlaceholdersWe need to define how the data from the data iterator will be fed into the model graph.
For this, we can make a placeholder that indicates that some tensor with shape [batch_size, height, width, channels] will be fed into the conv_net function.
We can set the height and width to 28 and the channels to 1, and leave the batch size as variable by setting it to None.
We can now use the Xtrain placeholder with the weights and biases dictionaries as input to the conv_net function to get the output logits.
This step is going to allow us to train the model by feeding an image or batches of images into the Xtrain placeholder (see Feeding the Images below).
LossThe model is classifying an image based on a set of 10 mutually-exclusive classes.
During training, the logits can be converted to a relative probability that the image belongs to each class by using the softmax, then the loss can be calculated by the cross-entropy of the softmax.
These can both be done in one step using TensorFlow’s softmax cross entropy.
Since we are measuring the loss for a batch of images, we can use tf.
reduce_mean to get the mean loss for the batch.
We can again use a placeholder ytrain for the graph that will be used to feed in the MNIST labels.
Recall the labels are one-hot encoded and there is one label for each image in the batch, so the shape should be [batch_size, n_classes].
OptimizerThe loss is used to update the model weights and biases by backpropagating the loss through the network.
A learning rate is used to scale down the updates, which keeps the weights from diverging or jumping around the optimal values.
The Adam optimizer was derived from a combination of Adagrad and RMSProp with momentum, and has proven to consistently work well with ConvNets.
Here I use the Adam optimizer with a learning rate of 1e-4.
The gradients are then calculated and used to update the weights by running the minimize method.
The train_op is now the key to training the model since it runs optimizer.
To train the model, the images just need to be fed into the graph and then run train_op to update the weights based on the loss.
Training the ModelAccuracyWe can use the test images to measure the accuracy during the training.
For each image, we perform the model inference and use the class with the maximum logit value as the prediction.
The accuracy is then the fraction of the predictions that are correct.
We need to use the integer values instead of the one-hot encodings for comparing the predictions to the labels, so we need to convert them by using the argmax (this is why test_labels was not converted to one-hot encoding, otherwise it would have to be converted back).
The accuracy can be measured with TensorFlow’s accuracy operator, which has two outputs.
The first output is the accuracy without updating the metrics, and second output is the one we’ll use that returns the accuracy with updated metrics.
Initializing VariablesThe local variables (temporary variables such as the total and count metrics in the accuracy) and global variables (e.
model weights and biases) need to be initialized so they can be used in the session.
Feeding the ImagesRecall that for each batch, a new set of images and ground truths are created that need to be fed into the placeholders Xtrain and ytrain.
Like shown above, we can get the batch_images and batch_labels from data_batch.
They can then be fed into the placeholders by creating a dictionary that has the placeholders as keys and the data as values, then passing the dictionary into the feed_dict parameter when running train_op in the sess.
Training SessionBy calling the train_op in the session, all steps in the graph are ran as outlined above.
I’ll use tqdm to view the training progress during each epoch.
After each epoch, the accuracy is measured and printed out.
The model will continue training for the number of epochs given.
After 5 epochs, this model has an accuracy of about 0.
Training Without PlaceholdersThe example above demonstrated placeholders for feeding in the images, but this graph could be made even more concise by directly using batch_images in the conv_net and batch_labels into the labels parameter of tf.
Then the model could be trained without having to feed any tensors into the graph as shown here.
Accessing Other NodesThe different nodes in the graph can be accessed by calling them in a session, either individually or in a list.
If a list of nodes are run, then the a list of outputs will be given.
For example, let’s say we want to see the loss for each image batch during training.
This can be done by running the loss in the session in a list with train_op.
Here I will use tqdm to print the loss for each batch during the epoch.
PredictionsTest PredictionsWe can use the test_predictions from the test accuracy to visualize the performance in classifying some example images.
Here I’ll display the first 25 test images and use the integer predictions as the titles.
Predicting New Image BatchesIf you want to predict a new set of images, the predictions can be made by simply running the conv_net function on a batch of images in a session.
For example, the following can be used to get the predictions for the first 25 images in the test set:Then the predictions can be used just the same as above for displaying the images.
Predicting a Single ImageRecall that the model expects the input to have shape [batch_size, height, width, channels].
This means, if a prediction is being made on a single image, the image needs to have its dimensions expanded to have the first dimension (the batch_size) as one.
Here the 1000th image in the test set is predicted.
Prediction ProbabilitiesThe normalized probability distribution of the predictions can be calculated using the softmax on the logits from the conv_net model.
The model predicts with high probability that the number is a 9.
There are numbers that aren’t as easy for the model to predict.
For example, here’s another image (idx=3062) in the test set.
While the model does predict an 8 as the most likely class, it had a probability of only 0.
48 and there were several other numbers with increased probabilities.
Here’s an example of an incorrect prediction (idx=259):The model incorrectly predicted a 0, with the number 6 being the second highest prediction.
Saving the ModelAs discussed, the predictions can be made as long as the session is still open.
If the session is closed, the trained weights and biases will be lost.
The session and model graph, along with the trained weights and biases, can be saved as a checkpoint by using tf.
Complete Code ExampleCombining the information above, the following would be an example script to create, train, and save a ConvNet and display the predictions for a sample of handwritten digits.
Equivalent Model in KerasKeras is a high-level API that can be used to vastly simplify the above code by using its Sequential model API (the Functional API is also an option for more complex neural networks).
This is shown as a helpful comparison for those with experience using this API, and to demonstrate its value to those who have not used it.
Notice that a session does not need to be used.
Also, the model weights and biases do not need to be defined outside the model.
They are created when adding the layers, and their shapes are automatically calculated.
SummaryThe TensorFlow graphs create the computational dataflow for how the model will be trained, and the sessions are used to do the actual training.
This article demonstrated different steps and options used in creating the graphs and running them in a training session.
When you have a general understanding of how to create the graphs and use them in sessions, it becomes easier to develop custom neural networks and use TensorFlow Core to meet your specific needs.
Also, as can be seen, Keras is much more concise and it’s clear why this API is favored for testing new neural networks.
However, many of the steps may be black boxes for new AI developers, and this tutorial was meant to help show what’s happening in the background.