Speech, DNA chains, and time series.

The answer is “sequential.

” The data is not static or discontinuous, but forgoing and successive.

But what do we mean by sequential or consecutive when it comes to data?Let’s recall what we’ve been doing with other models like ANN.

We had sample data and passed them through layers from left to right.

This means that we will input the data at a time and then they will travel toward the output layer.

It was feedforward propagation and had only one direction of the flow.

In the case of RNN, however, the data aren’t inputted at the same time.

As you can see the picture on the right, we will input X1 first and then input X2 to the result of X1 computation.

So in the same way, X3 is computed with the result from X2 computation stage.

Therefore when it comes to data, ‘sequential’ means we have an order in time between the data.

When it’s ANN, there isn’t any concept of order in X1, X2 and X3.

We just input them at once.

In the case of RNN, however, they are inputted at different times.

Therefore if we change the order, it becomes significantly different.

A sentence will lose its meaning.

And when it comes to DNA, this change might create… a mutant.

So the RNN has its appeal in that we can connect the data with the previous data.

This means that the model starts to care about the past and what is coming next.

As the recurrent units hold the past values, we can refer to this as memory.

We are now able to consider a real meaning of ‘context’ in data.

The structure of Recurrent Neural NetworkNow with this basic intuition, let’s go deeper into the structure of RNN.

This is a simple RNN with one shallow layer.

Our model is now going to take two values: the X input value at time t and the output value A from the previous cell (at time t-1).

Please take a look at the equation ①.

There are weights and bias as we did with simple ANN.

It’s just adding one more input value A0.

And there are two different outputs from the cell.

The output of A1 which will go to the next unit(②) and the final output Y1 of the unit cell(③).

Don’t get stressed by all these subscripts.

They are just indicating to what value the weight belongs.

Now let’s move to the next units.

Simple and cool, right?.Now we can predict the future.

If it’s about the stock market, we can predict the stock price of a company.

We then guess If there is a big dataset say, the history data from 10 years ago, we’ll get better accuracy.

So the longer the data, the better the outcome, Right?.But the truth is, this model ain’t as ideal as we’d expect.

The real thing starts from hereThe idea of remembering the past is fantastic.

But there is one critical problem in back-propagation.

Back-propagation is a step for going backward to update the weights of each layer.

To update the weights, we get the gradient of the cost function and keep multiplying the gradients at the given layers using chain rule.

The actual backward propagation in RNN is a bit complex than this diagram but let’s skip them for simplicity.

(For example, The real backward propagation takes not only the final output Yt but also all the other output Y used by the cost function.

)Imagine when the gradients are bigger than 1.

The updated values become so big to use it for optimizing.

This is called exploding gradients.

But this isn’t a severe problem because we can fix the range that gradients can’t go over.

The real problem occurs when the gradients are smaller than 1.

If we keep multiplying the values lower than 1, the result becomes smaller and smaller.

After some steps, there will be no significant difference in outcome, and it can’t make any update in weights.

It is called vanishing gradients.

It means the back-propagation effect can’t go far enough to reach the early stage of layers.

This is why people say RNN has a bad memory.

If the length of the input values gets longer, we can’t expect actual optimization.

This is a really critical problem because the power of neural networks comes from updating weights.

Would there be other ways to fix this problem?Let’s just forget about it!To be honest, it’s very hard to get good memories for people with bad memories.

We all know that.

We can’t remember too many things.

The same goes for our model.

Instead of dragging all the past, maybe it’ll be better to remember selectively.

This is like choosing only the important information and forgetting the rest.

Having all those past values causes the vanishing gradients.

Therefore we’ll give an additional step to simple RNN, which is called Gated Recurrent Units.

The diagram on the left shows the computation inside the unit cell of RNN.

There’s nothing new here.

It’s just showing that it takes two input values and returns two output values after the computation.

On the right side, you can see one small change inside the box.

The green box.

What is this?.This is a gate controller.

This box determines whether we should remember or not.

We’ll compute the tanh activation (①) as what we did with RNN, but this won’t be used right ahead.

It’s like a candidate.

Here we’ll also get the new value (②) for the gate.

Since it takes the sigmoid function, the output value always goes between 0 to 1.

So by multiplying it with the ① value, we’re going to decide whether we’ll use it or not.

When it’s 0, “No Use” or “No Update.

” (Use the previous value in this case) When it’s 1, “Use” or “Update.

” This is like opening and closing a gate.

The gate and the gateNow we finally arrived at the LSTM, Long Short Term Memory networks.

Actually, LSTM was proposed earlier than GRUs.

But I brought GRUs first because the structure of LSTM is more complex compared to that of GRUs with two more gates.

Additionally, there will be a new concept here which is called ‘cell state.

’ But you’re already used to this concept.

It’s putting the value containing the memories aside from the hidden state value A.

Let’s have a look at what LSTM has in details.

The blue box and the green box (the input gate) in the middle is the same with GRUs.

There are two additional green boxes here: the forget gate and the output gate.

The computation will also be the same, so we’re going to get the output from the tanh function (①) and the input gate value from the sigmoid function (②).

But the multiplying step is a bit different this time.

Instead of taking only the result of the input gate, we also consider the forget gate value coming from the left.

Let’s see the equation in line 3 on the right.

We have the forget gate, and this will be used for updating the cell state value like what you can see in line 4.

Lastly there is one gate left, the output gate.

And as you can see in line 6, we get the A value at time t by multiplying the output gate value and the tanh activation value of C at time t.

With these three additional gates, LSTM can have a stronger memory ability.

Those are all about controlling which parts or what amount it should remember.

It’s so robust and effective that it’s really popular among the sequence models.

Then could we say is it better than GRUs for all cases?.As we say there’s no silver bullet all the time, GRUs has its own strength as well.

Its structure is simpler than LSTM, so it’s sometimes proper to use GRUs to make a big model.

ConclusionWhat we’ve discussed so far was the sequence model with one shallow layer.

Then what it would be like when we add more layers?.If we say adding more layers goes horizontal in ANN or CNN, it goes vertically in RNN.

But the sequence model is already a big model with one or two layers, it may end up overfitting if we add a few more layers on top of that.

Therefore applying normalization techniques such as dropout or batch normalization is required.

This post was mainly about understanding the structure of each model from simple RNN to LSTM.

The mathematical expressions we walked through can be complicated at first sight.

After you get familiar with the overall structures of the models, however, it becomes easy to understand.

Because maths are just a numeric expression to represent the concept in a clear and effective manner.

If you can see the meaning behind all those numbers, then maths will become fun and interesting too.

I also brought other additional resources as always.

When you’re ready to take a further step with the sequence models, I highly recommend you to check out these articles as well.

Understanding LSTM Networks by Christopher Olah: Such a great article on understanding the structure of RNN and LSTM.

And Colah’s blog is really popular.

You can see many other materials which have this blog as a reference.

Must-Read Tutorial to Learn Sequence Modeling by Pulkit Sharma: I already posted his work in the previous blog, and this is the series for the sequence models.

Highly recommend you to check out other series as well if you haven’t.

Recurrent Neural Networks by Example in Python by Will Koehrsen: A gentle guide from the top writer of Medium.

This will be a great start for building your first RNN in Python.

Thank you for reading and I hope you found this post interesting.

If there’s anything need to be corrected, please share your insight with us.

I’m always open to talk so feel free to leave comments below and share your thoughts.

I’ll come back with another exciting project next time.

Until then, happy machine learning!.