The concepts involved in neural network modelling for a non-specialistFaizan AhmadBlockedUnblockFollowFollowingJan 10With so many buzzwords flying around, such as artificial intelligence (AI) or deep learning, it could be daunting to have no familiarity with the terms, and just remain in awe of the big black box of intelligent machines and magic-like algorithms.

While researchers in the field continue to span existing developments and explore new paradigms in amazingly clever ways, it is important to note that at the end of the day, it all boils down to maths — one does not need any specialist knowledge to understand the fundamentals.

In fact, many AI data scientists or software engineers come from diverse backgrounds, who have upskilled themselves in the area.

The intention of this article is two-fold.

It is an introduction for those who are not data scientists but are keen to get some insight into how computing systems learn by replicating the basic action of neurons in a brain.

Secondly, it is a discussion for beginner-level coders about the essential concepts involved, which are often implemented without understanding the reasoning behind them.

What is a neural network made of?Think of it as a graph of nodes connected to one another, as shown in Figure 1.

In its simplest form, the layers at either end make the main inputs and outputs, whereas all in the middle are linked such that their outputs become the input for the subsequent layer.

As we will see later, these nodes are called neurons because they behave as such.

Figure 1: Illustration of a simple neural networkIn practice, there could be many layers and many nodes per layer, for example the DeepMind’ AlphaGo that beat the world champions of Go in 2017 had over 17 thousand neurons for the input only.

There are also different ways to connect the layers, for instance a recurrent neural network (which deals with sequential information) has nodes where the output at one stage is fed into the same node as an input for the next stage.

What happens at a node?Simply put, the output of a node depends on a linear combination of the inputs.

This is a weighted sum of the inputs plus a constant, where the value of the weights and the constant terms are unknown (i.

e.

they form the parameters of the neural network model).

The equation below, along with Figure 2, shows the case if there were only two inputs:y = f(w_1*x_1 + w_2*x_2 + c)where f stands for ‘a function of’, x_n is the n-th input, w_n is the n-th weight of that input, and c is the constant for the node with output y.

Figure 2: Example of a node with only two inputsWhat does it mean for a neuron to get activated?Essentially, the neuron is active if its output is positive.

Using again the power of simplicity, the activation of a neuron can be explained by the help of concepts of a cut-off or threshold.

If the right-side of the equation above results in a negative value, the output is clipped to zero, as seen in Figure 3(a).

Figure 3: (a) Example of an activation function of a neuronTo view it from another angle, if we take the constant to the other side of the equation, we need the weighted sum of the incoming signals to be above a certain threshold to make the output non-zero, i.

e.

to be activated.

w_1*x_1 + w_2*x_2 + … ≤ -c → Node is inactivew_1*x_1 + w_2*x_2 + …> -c → Node is activeFigure 3(b) demonstrates how this replicates the mechanism of neuronal activity in a brain, where billions of electrically charged cells transmit information through electrochemical signals.

Figure 3(b): Depiction of neuronal activity in a networkNote that what I have described above is a ReLU or rectified linear unit function for activation.

Other functions include sigmoid and tanh but follow a similar concept for neuron activation.

What could the inputs & outputs look like in practice?Let's consider the example where the aim is to predict flight delays.

The following past information would be potentially useful in determining the delay for a particular flight, and could therefore be used as the input to a learning system.

The airports (both departure and arrival);The airline;The day of the week;The scheduled departure time;The scheduled arrival time;Weather conditions: temperature, wind speed, rainfall, snowfall, etc.

for the respective airports at the time of take-off and landing;Weather conditions for the flight route;Information on airport traffic, and so on.

Using the actual departure and arrival times, the outputs can be computed.

Departure delay= Actual departure time — Scheduled departure timeAdditional delay = Arrival delay — Departure delayAs with any mathematical formulation, care needs to be taken when setting up the problem in an AI context .

For example, note that the departure delay (= actual — scheduled) would be a part of the arrival delay.

To make the outputs independent, the additional delay in arrival would be a more suitable output to use instead.

Why do we have to scale numerical information?In addition to general practices of cleaning the data, such as making the units consistent, or removing outliers, many AI problems require certain pre-processing steps to generate valid results — this involves transforming both the input and output data.

When different types of data fields are used simultaneously, they all have to be normalised to avoid any bias.

To see what this means, let's refer back to the equation in the first section.

In our case of flight delays for example, it should not matter whether we use Celsius or Fahrenheit as the units for temperature (say x_1), but clearly using 25 degree Celsius or an equivalent 77 degrees F would give different results if this number is fed in directly.

As we will see later, building the model means determining the relative weights in this equation, which could give undue significance to one input variable over another if not treated properly.

It is important to note that it is not only about the choice of units — it also applies to dimensionless fields, e.

g.

the number of flights arriving or departing the airport in the same time-slot, compared to the fraction of check-in gates open at the departure airport.

Therefore, to make it a level playing field, each input needs to be scaled.

There are many ways to do so, e.

g.

by normalising it to be between 0 and 1, or by converting it to a standard normal distribution with mean zero and a standard deviation of 1.

How do we encode non-numerical information?Any qualitative or categorical information would naturally need to be converted to a numerical form in order to be used in the set-up described above.

For instance, an important piece of information to add in our prediction would be the associated airline for each flight (the hypothesis being that airlines differ in their punctuality or efficiency irrespective of external factors).

Let's say there are three airlines A, B and C.

We cannot just assign them numbers 1,2,3 for example, for a very key reason that has to do with the difference among them.

If we do so, we are telling the system that airlines A and B are similar entities (2–1=1) compared to A and C (3–1=2).

They might in fact be similar, but in that case we use the metric of similarity (e.

g.

the size of the airline company) as a separate variable.

However, here we are encoding their identity that is unique, and equally different from all the others.

One correct way to do so is to hot-encode using dummy variables, where each category is different from all others by the same amount, i.

e.

a change of two bits (i.

e.

one variable goes from 0 to 1, and another from 1 to 0).

… d1 d2 d3A : 0 0 1B : 0 1 0C : 1 0 0where d1, d2 and d3 are dummy variables to represent the three airlines.

For instance, to change from A to B, dummy variables d2 and d3 flip their values.

In essence, the information has been converted to a binary format.

d1 is 1 if the airline is A, otherwise 0.

d2 is 1 if the airline is B, otherwise 0.

d3 is 1 if the airline is C, otherwise 0.

Yet another way to grasp is through visualisation, as shown in Figure 4.

If distance is defined as the change in the number of bits, then A, B and C will form an equilateral triangle with each side equal to two units.

If there were a fourth airline D, the result will be a pyramid-like structure with a triangular base as before, and a height of 2/sqrt(2).

Figure 4: Visualising the conversion of categorical information into numerical codeHow does the system learn and predict?The learning process in a nutshell is as follows.

For each subset of data (e.

g.

all the information for a set of flights), the neural network calculates the outputs at all its nodes using the weights from the previous data point (can we initialise the values randomly), and compares its output with the actual output (as this is past data, we already know how much the flight was delayed).

It then does something called backpropagation, in which it re-adjusts the weights to match its output to the true value, or in other words, it solves an optimisation problem for the weights to reduce this error.

This process is depicted in Figure 5.

Figure 5: Schematic diagram of the learning process using a neural networkIf it has substantial amount of meaningful data, it may be able to find the weights and constants of the network that best represent the true picture (e.

g.

the complex, multi-factor system dictating flight delays in the real world).

To check this, the actual output is compared to the predicted output for a fraction of the past data that has not already been fed to it before.

If this is satisfactory, it could then be deployed to predict future outcomes.

Can this be compared to the human learning process?Think about how we learnt what a ball is, when we were toddlers.

We could hear an adult say the word ‘ball’ when we looked at something round.

This happened several times such that a round object (the image being the input) corresponded to the word ‘ball’ (the sound being the output).

Over time, we saw, along with many other things at the same time, balls of different colours, sizes, either stationary or moving, till we eventually figured out (or predicted the next time we saw one!) that a round object is probably called a ball.

In the same way, an artificial neural network updates itself with each batch of data, learning step by step, and typically requires large amounts of data to learn the underlying mechanism properly.

It could thus be of value in learning about situations that not easily perceptible by humans.

It could fail (and does fail) — not only because of the limitations of the model or scarcity of data, but also if there is no predictive power in the nature of the data.

For example, adding in the average age of the crew or details about the snack menu on the flight is very unlikely to lead us anywhere in predicting the flight delay!Final thoughtsWhen applying AI methods in practice, our hypothesis about the problem that we are trying to solve is crucial — there is no magic wand, nor a fortune-telling globe involved.

In an actual case, we assume that are a number of factors that are collectively responsible for an outcome of interest, but the way they interact with one another is complex and not easily discernible.

Therefore, we decide to use an AI technique like a neural network to investigate.

We then make an informed choice of the inputs, test our hypotheses and iterate as needed.

It is not a passive, hands-off, all-in approach — it requires critical thinking and active decision-making.

This is original work carried out in my spare time.

Please let me know of any errors, omissions or improvements.

.. More details