There are some standard rules we can follow when deciding how many neurons (those circles in each layer) to include in each layer.

And in fact, we are going to deviate from the original plan above and make changes to the network based on our needs.

Instead of using 3 input neurons we use 4, instead of using 4 neurons in the hidden layer we use 8, and instead of using 2 output neurons we use 3.

These are all calculated decisions.

Let’s discuss how we decided to do so.

The size of our input layer is decided by the number of features in our data set.

For example, if we are trying to classify the traditional Iris data set, we only include as many input neurons as we do desired features.

In the Iris data set, this may mean we choose to use all of the data set’s popular features and include sepal length, sepal width, petal length, and petal width.

A few rows of our data set, identifying features and outputsUltimately, hidden layers will come down to trial and error.

Hidden layers are a contentious topic with a mountain of research behind them.

Generally, a single hidden layer is sufficient.

0 hidden layers can only work for linearly separable situations.

1 hidden layer allows mapping from one finite space to another.

2 hidden layers is suitable for situations that include arbitrary decision boundaries.

The term “deep learning” is derived from how deep your model is, i.

e.

how many hidden layers the model employs.

The number of neurons in the hidden layer(s) also will involve trial and error but there are some popular cheat sheets floating around for choosing number of neurons in a hidden layer, such as the following:Equation for # of neurons in the hidden layer(s)where Nh is the number of neurons in the hidden layer, Ns is the number of samples in the training data, alpha is the scaling factor, Ni is the number of input neurons, and No is the number of output neurons.

This is a very baseline approach and the following article would be useful for a deeper dive on choosing number of layers and neurons in each layer, it’s a bit more nuanced than our broad example above.

The advanced user should employ more sophisticated methods for picking the number of neurons in the hidden layer: https://towardsdatascience.

com/beginners-ask-how-many-hidden-layers-neurons-to-use-in-artificial-neural-networks-51466afa0d3eFor the output layer, the number of neurons will vary based on our desired outcome.

In the case of a regressor the number of neurons will be one.

And more than one neuron will exist in the case of one node per class label (for softmax).

If we are not using softmax we’ll still use a single node for classification.

But we are going to use softmax so let’s see how it would look for the Iris data set.

In the Iris data set we want to determine if a single observation is Virginica, Versicolor, or Setosa.

In this case we would want our output layer to include 3 neurons, 3 total neurons for the 3 classes.

Let’s briefly discuss softmax and discover why it is such a popularly used activation function in the output layer.

Softmax is used for various multiclass classification methods, neural networks being one such popular usage.

softmax functionAt its core, softmax allows us to suppress lower values and highlight larger values.

We use softmax because the output of the function allows us to examine a probability distribution over various different outcomes.

This is obviously very powerful because it enables many of the complex machine learning and deep learning problems we are able to solve through multiclass classification.

This makes softmax great for use on the Iris dataset.

For something like binary classification, we would see sigmoid referenced as sigmoid is simply a specific case of softmax where the number of classes is reduced to two.

In our hidden layer we pick ReLU as the activation function.

ReLU is the most used activation function in the hidden layer (and for good reason).

ReLU results in faster learning versus many of the popular choices due to its inherent nature.

ReLU makes sense to me coming from an Electrical Engineering background as it is a similar concept to a rectifier circuit.

In ReLU, the negative components are set to 0.

The shape of the ReLU function looks like this:ReLU functionFor a deeper dive into activation functions and to empower your choice each time you build your networks check out this great article.

Now it’s time to pick an optimizer.

We select Adam because of its popularity and relevance with my previous article on moving averages.

That article was on moving averages in finance but the idea stays the same.

Adam is chosen because of its efficiency and low memory requirements.

You may have heard of stochastic gradient descent.

Adam is more efficient for a number of reasons.

First, unlike stochastic gradient descent, Adam adapts the parameter learning rate using an exponential moving average.

Adam uses the average of the second moment of the gradient to adapt the learning rate throughout the learning process.

You can check out the paper introducing Adam here.

To quickly recap what we’ve designed so far:1 input layer, 1 hidden layer, 1 output layer, 4 input neurons for the 4 features, 8 neurons in the hidden layer corresponding to the number of input neurons, output neurons, and size of the data set.

And finally, 3 neurons in the output layer for the 3 desired classes.

We’ve also picked our activation functions (we default to the most popular — Relu for hidden layer & Softmax for output layer), and our optimizer is the adam algorithm…The function above builds our baseline model.

We can call this function any time we want to create a new model.

You can see the function above builds the exact network we described earlier in the article.

The rationale for these decisions is also explained above.

These next 3 lines of code are equally as simple.

We first initialize a model.

We then go ahead and train our model using the Iris training data.

We first pick an arbitrary amount of epochs and set verbose to 1 as this allows us to see if our chosen number of epochs is too high, causing us to waste excess time training.

We do however need to be careful to avoid overfitting.

The article linked above goes into more in depth on the topic of epochs and overfitting.

Or maybe our number of epochs is too low and there is no convergence.

Each epoch the entire dataset is passed forward and backward through the neural network a single time.

This updates the weights each pass.

training processWe can see the accuracy greatly improves and converges around epoch 150.

We adjust our epoch accordingly.

Finally, we can perform classification on test data using the various features:Iris Classification with accuracy > 94%, showcasing highly separable featuresSourcesNeural network – WikipediaA neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of…en.

wikipedia.

orgArtificial neuron – WikipediaAn artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network…en.

wikipedia.

orgAdam: A Method for Stochastic OptimizationWe introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on…arxiv.

orgThe Number of Hidden LayersThis is a repost/update of previous content that discussed how to choose the number and structure of hidden layers for…www.

heatonresearch.

comDeep learning in neural networks: An overviewIn recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern…www.

sciencedirect.

com.