HandCrafting an Artificial Neural NetworkIn this article, I have implemented a fully vectorized code for Artificial Neural Network with Dropout and L2 Regularization.
Tirth PatelBlockedUnblockFollowFollowingFeb 22SourceIn this article, I have implemented a fully vectorized python code of an Artificial Neural Network tested on multiple datasets.
Further, dropout and L2 regularization techniques are implemented and explained in detail.
It is highly recommended to go through the basic working of an Artificial Neural Network, forward propagation, and backward propagation.
This article is divided into 10 sections:IntroductionPrerequisitesImporting our LibrariesCoding our Activation functions and their DerivativesOur Neural Network ClassInitializing Weights and BiasesForward PropagationCost FunctionBackpropagationPredicting Labels of a new datasetAlso,Any sort of feedback is highly appreciated.
IntroductionArtificial Neural Network is one of the most beautiful and basic concepts of Supervised Deep Learning.
It can be used to perform multiple tasks like binary or categorical classification.
It seems pretty easy to understand and implement unless you start coding it up.
During coding such a network, small problems pop up which lead to big mistakes and help you understand the concepts that you previously missed.
So, in this article, I have tried to implement an Artificial Neural Network that would probably help you save days of work needed to properly code and understand each and every concept of the topic.
I would be using standard notations and symbols during the article.
This article is too dense so if you are not comfortable with neural networks and their notations, you would probably find it difficult to understand everything.
So, I would suggest to give it some time and go slowly referring to the resources I have provided in the article.
The entire code is available on Github.
tirthasheshpatel/Neural-NetworkHand made Neural Network for demonstration and teaching purposes – tirthasheshpatel/Neural-Networkgithub.
PrerequisitesI would preassume that you know what neural networks are and how they learn.
It would be pretty easy to follow if you are comfortable with Python and libraries like numpy.
Also, good knowledge of linear algebra and calculus is needed to cakewalk through forward and backpropagation section.
Moreover, I would highly suggest going through the videos of courses by Andrew Ng on Coursera.
Importing our librariesNow, we can start coding a neural network.
The first thing is to import all the libraries that we will need to implement our network.
We will use pandas to import and clean our dataset.
Numpy is the most important library for performing matrix algebra and complex calculations.
Libraries like warnings, time, sys and os are rarely used.
Coding our Activation functions and their DerivativesWe will need activation functions later in the article to perform forward propagation.
Also, we need derivatives of the activation functions during backpropagation.
So, let’s code some activation functions up.
We have coded the four most popular activation functions.
First is the regular old sigmoid activation function.
SourceThen we have ReLU or “Rectified Linear Unit”.
We will be mostly using this activation function.
Note that we will keep the derivative of ReLU 0 at point 0.
SourceWe also have an extended version of ReLU called Leaky ReLU.
It works just like ReLU and can provide better results on some datasets (not necessarily all).
Then we have tanh (hyperbolic tangent) activation function.
It is also widely used and almost always superior to sigmoid.
SourceAlso, PHI and PHI_PRIME are the python dictionaries containing the activation functions and their derivatives respectively.
Our Neural Network ClassIn this section, we will create and initialize our Neural Network class.
Firstly, we will decide which parameters to use during initialization.
We need:Number of neurons in each layerActivation function we want to use in each layerOur matrix of features (X with features along the rows and examples along the columns).
Labels corresponding to the matrix of features (y which is a row vector)Method to initialize our weights and biasesLoss function to useKeeping this in mind, let us start coding the class of our Neural Network:Now we have a properly documented class of Neural network and we can proceed to initialize other variables of the network.
As shown, we will use ‘self.
m’ to store the number of examples in our dataset.
n’ will store the information of the number of neurons in each layer.
ac_funcs’ is the python list of activation functions of each layer.
cost’ will store the logged values of the cost function as we train our network.
acc’ will store the logged accuracy achieved on the dataset after training.
Having initialized all the variables of our network, let’s move further to initialize the weights and biases of our network.
Initializing Weights and BiasesThe interesting part starts now.
We know that weights cannot be initialized to zeros as the hypothesis of each neuron becomes the same and the network never learns.
So we have to have some way to break the symmetry and make our neural network learn.
We can use the Gaussian Normal Distributions to get our random values.
As these distributions have a mean of zero, the weights get centered to zero and are very small.
Hence, the network starts learning very quickly and efficiently.
We can use np.
randn() function to generate random values from the normal distribution.
The following two lines of code are enough to initialize our weights and biases.
We have initialized our weights to random values from the Normal Distributions.
The biases have been initialized to zeros.
Forward PropagationFirst, let’s understand forward propagation without any regularization.
SourceWe have Z as the hypothesis of each neuron connection from one layer to other.
Once we calculate Z, we apply activation function f to the Z values to get activations y of each neuron in each layer.
This is the ‘pure vanilla’ forward propagation.
But as stated in the paper by N.
, 2014, Dropout is an amazing technique to improve the generalization of the neural network and make it more robust.
So, let’s first get some intuition of Dropout regularization.
An essence of Dropout RegularizationDropout, as its name suggests, refers to “deactivating” some neurons in our neural network and training the rest of the neurons.
SourceTo improve the performance, we can train tens and hundreds of neural networks with different values of hyperparameters, get the output of all the networks and take their mean to get our final results.
This process is computationally very expensive and cannot be implemented practically.
Hence, we need a way to do something similar in a more optimized and computationally inexpensive way.
Dropout regularization does something exactly similar in a very inexpensive and simple way.
In fact, Dropout is so easy and simple way to optimize the performance, that it gained a lot of attention recently, and is used almost everywhere in numerous other models of Deep Learning.
To implement dropout, we will use the following approach:SourceWe will first pull out random value from Bernoulli’s distribution, keep the neuron if the probability is above a particular threshold, and then perform regular forward propagation.
Note that we don’t apply dropout during predicting the values on a new dataset or during test time.
Code to implement DropoutWe will have keep_prob as the probability of survival of neurons per layer.
We will only keep the neurons with a probability higher than the probability of survival or keep_prob.
Suppose, its value is 0.
It means that we will deactivate 20% of the neurons in each layer and train the rest 80% of the neurons.
Note that we deactivate randomly chosen neurons after each iteration.
This helps the neurons to learn features that generalize over a larger dataset.
A really intuitive proof is given in the paper .
We first initialize the list that will store the Z and A values.
We first append the linear values of the first layer in Z and then append the activation of neurons of the first layer in A.
Here, PHI is a python dictionary containing the activation functions that we coded earlier.
We similarly calculate the values of Z and A for all other layers using a for loop.
Note that we don’t apply dropout in the input layer.
We finally return the calculated values of Z and A.
Cost functionWe will use the standard binary/categorical cross entropy cost function.
We have coded our cost function with L2 Regularization.
The parameter lambda is known as “penalization parameter”.
It helps the values of weights to not increase rapidly and hence generalize better.
Here, ‘a’ contains the activation values of the output layer.
We also have the function _cost_derivative to calculate the derivative of the cost function with respect to the activations of the output layer.
We would need that later during backpropagation.
BackpropagationHere are some formulas that we would need to perform backpropagation.
SourceWe will implement this on a deep neural network.
The formulas on the right are fully vectorized and so we will be using them.
Once you understand these formulas, we can go ahead to code them.
We take epochs, alpha (learning rate), _lambda, keep_prob, and interval as parameters of our function to implement backpropagation.
Description of each of them is given in the documentation comment.
We start with forward propagation.
Then we calculate the derivative of our cost function as delta.
Now, for each layer, we calculate delta_w and delta_b containing the derivative of the cost function with respect to the weights and biases of our network.
Then we update delta, weights, and biases according to their respective formulas.
After updating the weights and biases starting from the last layer to the second layer, we update the weights and biases of the first layer.
We do this for several iterations until the values of weights and biases converge.
Important Note: A big mistake possible here is updating delta after updating the weights and biases.
Doing so can lead to a very bad case of vanishing/exploding gradient problem.
Most of our work is done here but we still need to code function that can predict results on a new dataset.
Hence, as our last step, we will code a function to predict labels of a new dataset.
Predicting Labels of a new datasetThis step is pretty straightforward.
We just need to perform forward propagation but without Dropout Regularization.
We do not apply dropout regularization during test time as we need all the neurons of all the layers to provide us with proper results and not just some random values.
As shown, we will return the activations of the output layer as the result.
Entire CodeHere is the entire code to implement an Artificial Neural Network yourself.
I have added certain pieces of code for printing the cost and accuracy of our network as we train it.
Except that, everything is the same.
Congratulations!.We have finally finished coding our neural network.
Now, we can test our network on different datasets.
Testing our Neural NetworkWe will test our network on the famous MNIST dataset for digit classification.
We will only use 8000 images to train our network and predict on 2000 other images.
You can get the dataset on Kaggle.
I have trained two hidden layered neural network with 32 and 16 neurons.
I have used ReLU activation function in both the layers.
After training the network for 2500 epochs with penalization parameter 1.
0, and learning rate 0.
1 we have:The graph of Cost vs Epochs looks like:We achieve a pretty good accuracy over both training and test set.
We can achieve even more accuracy by tuning the hyperparameters by using techniques like Grid Search, Randomized Grid Search, etc.
Also, feel free to try for different values of hyperparameters, activation functions, and datasets.
If you think that there could be any improvements in the code then do share on GitHub or here in the comment section.
Any sort of feedback is highly appreciated.
Some Challenges for yaIf you have understood the code for the neural network that I have provided above, then here are a few more changes you can do to make it better.
Try to code softmax activation function and get it working.
Say, I want to deactivate 30% of the neurons in the first layer and 50% in the second.
Try to code a network in which, I can use different values of keep_prob for each layer.
Try implementing the mini-batch gradient descent algorithm.
It works out really nice for handwritten digit classification.
I hope you enjoyed the article and challenges.
Wish you a happy data science journey!.. More details