", df_ge.

isna().

sum())Normalizing the dataThe data is not normalized and the range for each column varies, especially Volume.

Normalizing data helps the algorithm in converging i.

e.

to find local/ global minimum efficiently.

I will use MinMaxScaler from Sci-kit Learn.

But before that we have to split the dataset into training and testing datasets.

Also I will convert the DataFrame to ndarray in the process.

Converting data to time-series and supervised learning problemThis is quite important and somewhat tricky.

This is where the knowledge LSTM is needed.

I would give a brief description of key concepts that are needed here but I strongly recommend reading Andre karpathy’s blog here, which is considered one of the best resources on LSTM out there and this.

Or you can watch Andrew Ng’s video too (which by the way mentions Andre’s blog too).

LSTMs consume input in format [ batch_size, time_steps, Features ]; a 3- dimensional array.

Batch Size says how many samples of input do you want your Neural Net to see before updating the weights.

So let’s say you have 100 samples (input dataset) and you want to update weights every time your NN has seen an input.

In that case batch size would be 1 and total number of batches would be 100.

Like wise if you wanted your network to update weights after it has seen all the samples, batch size would be 100 and number of batches would be 1.

As it turns out using very small batch size reduces the speed of training and on the other hand using too big batch size (like whole dataset) reduces the models ability to generalize to different data and it also consumes more memory.

But it takes fewer steps to find the minima for your objective function.

So you have to try out various values on your data and find the sweet spot.

It’s quite a big topic.

We will see how to search these in somewhat smarter way in the next article.

Time Steps define how many units back in time you want your network to see.

For example if you were working on a character prediction problem where you have a text corpus to train on and you decide to feed your network 6 characters at a time.

Then your time step is 6.

In our case we will be using 60 as time step i.

e.

we will look into 2 months of data to predict next days price.

Features is the number of attributes used to represent each time step.

Consider the character prediction example above, and assume that you use a one-hot encoded vector of size 100 to represent each character.

Then feature size here is 100.

Now that we have some what cleared up terminologies out of the way, let’s convert our stock data into a suitable format.

Let’s assume, for simplicity, that we chose 3 as time our time step (we want our network to look back on 3 days of data to predict price on 4th day) then we would form our dataset like this:Samples 0 to 2 would be our first input and Close price of sample 3 would be its corresponding output value; both enclosed by green rectangle.

Similarly samples 1 to 3 would be our second input and Close price of sample 4 would be output value; represented by blue rectangle.

And so on.

So till now we have a matrix of shape (3, 5), 3 being the time step and 5 being the number of features.

Now think how many such input-output pairs are possible in the image above!.5.

Also mix the batch size with this.

Let’s assume we choose batch size of 2.

Then input-output pair 1 (green rectangle) and pair 2 (blue rectangle) would constitute batch one.

And so on.

Here is the python code snippet to do this:‘y_col_index’ is the index of your output column.

Now suppose after converting data into supervised learning format, like shown above, you have 41 samples in your training dataset but your batch size is 20 then you will have to trim your training set to remove the odd samples left out.

I will look for a better way to go around this, but for now this is what I have done:Now using the above functions lets form our train, validation and test datasetsx_t, y_t = build_timeseries(x_train, 3)x_t = trim_dataset(x_t, BATCH_SIZE)y_t = trim_dataset(y_t, BATCH_SIZE)x_temp, y_temp = build_timeseries(x_test, 3)x_val, x_test_t = np.

split(trim_dataset(x_temp, BATCH_SIZE),2)y_val, y_test_t = np.

split(trim_dataset(y_temp, BATCH_SIZE),2)Now that our data is ready we can concentrate on building the model.

Creating modelWe will be using LSTM for this task, which is a variation of Recurrent Neural Network.

Creating LSTM model is as simple as this:Now that you have your model compiled and ready to be trained, train it like shown below.

If you are wondering about what values to use for parameters like epochs, batch size etc.

, don’t worry we will see how to figure those out in the next article.

Training this model (with fine tuned hyperparameters) gave best error of 3.

27e-4 and best validation error of 3.

7e-4.

Here is what the Training loss vs Validation loss looked like:Training error vs Validation errorThis is how the prediction looked with above model:prediction vs real dataI have found that this configuration for LSTM works the best out of all the combinations I have tried (for this dataset), and I have tried more than 100?.So the question is how do you land on the perfect (or in almost all the cases, close to perfect) architecture for your neural network.This leads us to our next and important section, to be continued in the next article.

You can find all the complete programs on my Github profile here.

.