Predicting time series with deep learning, R and Apache MXNetAnkit KhediaBlockedUnblockFollowFollowingJan 18Multivariate time series forecasting is one of the most commonly encountered problems with various applications such as weather forecasting, stock price prediction, real estate predictions.

Recently, deep learning techniques have been applied to solve this class of problems.

In this blog, I will show how Apache MXNet R package (MXNet-R) can be used can be used to model and solve time series forecasting problems.

MXNet-R is a binding of Apache MXNet deep learning back-end with the R language as a front-end.

MXNet-R allows R enthusiasts to leverage deep learning in their research projects and applications, staying within the R eco-system, while leveraging the powerful Apache MXNet deep learning framework.

In this blog, we will train a model to learn a time series (linear combination of sine waves) to model a multi-dimensional synthetic dataset.

With this dataset we will demonstrate how you can use MXNet-R to train on the data to create a model, and then use the model to forecast time series.

The code below has been tested on MacOS, running on CPU with MXNet 1.

3.

0.

To get started, visit the MXNet install page and install the latest MXNet-R package based on your system configuration.

Currently, MXNet-R supports Linux, Windows and OSX.

The next step requires to create synthetic datasets and then we will discuss steps required to achieve our goal.

Full code for the task can be found in gist.

Dataset CreationWe will create a dataset where each dimension will be a time series sine wave and the output will be linear combination of these six time series data.

We will try to predict the output based on previous time series data for the output value.

Time series for each dimensionPre-processing stepsWe start with loading R libraries required for the task, followed by cleaning and normalizing the data.

Please install the required packages, if not already installed, using the R console.

library("readr")library("dplyr")library("mxnet")library("abind")Now we will start with creating the synthetic datasets.

We will create six sine wave time series and the seventh time series will be linear combination of previous six to create a multidimensional dataset where each of the previous sine wave serves as one dimension.

# generating datasetslen<-44000df <- data.

frame(dim1= numeric(len),dim2= numeric(len),dim3= numeric(len),dim4= numeric(len),dim5= numeric(len),dim6= numeric(len),dim7= numeric(len))df$dim1 <- sin(pi/12 * (1:(len)))df$dim2 <- sin(5 + pi/6 * (1:(len)))df$dim3 <- sin(15 + pi/8 * (1:(len)))df$dim4 <- sin(55 + pi/10 * (1:(len)))df$dim5 <- sin(2 + pi/16 * (1:(len)))df$dim6 <- sin(1+ pi/7 * (1:(len)))df$dim7 <- 7*df$dim1+ 5*df$dim2+ 17*df$dim3+ 25*df$dim4 + 19*df$dim5+ 21*df$dim6We also normalize each of the feature set to a range [0,1] as normalizing helps in getting rid of noise and network learns better.

# Normalizing featuresdf <- matrix(as.

matrix(df), ncol = ncol(df), dimnames = NULL)rangenorm <- function(x) { (x – min(x))/(max(x) – min(x))}df <- apply(df, 2, rangenorm)df <- t(df)For using multi-dimensional data with MXNet-R, we need to convert training data to the form n_dim x seq_len x num_samples.

For one-to-one RNN flavors, labels should be of the form seq_len x num_samples, while for many-to-one flavor, the labels should be of the form (1 x num_samples).

Please note that MXNet-R currently supports only these two flavors of RNN.

One-to-one RNNs has one output at each timestep i.

e.

length of input sequence and output sequence are equal (ex.

Part of Speech tagging) whereas many-to-one RNNs have one output for a sequence (ex.

sentiment analysis for a sequence of words).

Different flavors of RNNs supported by MXNet-R [Diagram Source]We will use n_dim = 7, seq_len = 100, and num_samples = 430 because the dataset has 430 samples, each the length of 100 timestamps.

We have 7 time series as input features so each input has dimension of seven at each time step.

n_dim <- 7seq_len <- 100num_samples <- 430The next step shows how to format data and labels in the form required by the RNNs in MXNet-R.

We will extract only required data and the labels.

# extract only required data from datasettrX <- df[1:n_dim, 1:(seq_len * num_samples)]# the label data(next output value) should be one time # step ahead of the current output valuetrY <- df[7, 2:(1+(seq_len * num_samples))]We will reshape the matrices to convert into the input form required by the RNN network.

trainX <- trXdim(trainX) <- c(n_dim, seq_len, num_samples)trainY <- trYdim(trainY) <- c(seq_len, num_samples)Training the networkOnce we have the formatted dataset, we will create the neural network, using symbolic programming, and will then train the newly created network.

We start with setting up samples.

We will take the first 300 samples for training and the next 100 samples for evaluation.

We will also keep aside few samples for testing purposes.

batch.

size <- 32train_ids <- 1:300eval_ids <- 301:400The next step is to create data iterators for feeding into the RNN network.

Data iterators in MXNet are iterator objects.

In MXNet, data iterators return a batch of data as DataBatch on each call to next.

A DataBatch often contains n training examples and their corresponding labels.

## create data iteratorstrain.

data <- mx.

io.

arrayiter(data = trainX[,,train_ids, drop = F], label = trainY[, train_ids], batch.

size = batch.

size, shuffle = TRUE)eval.

data <- mx.

io.

arrayiter(data = trainX[,,eval_ids, drop = F], label = trainY[, eval_ids], batch.

size = batch.

size, shuffle = FALSE)Now we will create the neural network using RNN symbols.

We will set the number of RNN layers to be 2 and the number of hidden units to be 50 since the dataset at hand is quite small and simple and we don’t need a very complex model to learn it.

We will use LSTM units to learn the sequences.

Long Short-Term Memory (LSTMs) are RNNs used for learning long temporal sequences.

## Create the symbol for RNNsymbol <- rnn.

graph(num_rnn_layer = 2, num_hidden = 50, input_size = NULL, num_embed = NULL, num_decode = 1, masking = F, loss_output = "linear", dropout = 0.

5, ignore_label = -1, cell_type = "lstm", output_last_state = T, config = "one-to-one")Next step involves setting up the loss function.

We will use mean squared error as our metric for loss and we can define it as shown below.

mx.

metric.

mse.

seq <- mx.

metric.

custom("MSE", function(label, pred) { label = mx.

nd.

reshape(label, shape = -1) pred = mx.

nd.

reshape(pred, shape = -1) res <- mx.

nd.

mean(mx.

nd.

square(label – pred)) return(as.

array(res))})We will set up context, optimizer and initializer for the network and custom callback functions for each epoch.

We used CPU as we were running on Mac.

However, you MXNet-R also has GPU support for Ubuntu and Windows.

ctx <- mx.

cpu()initializer <- mx.

init.

Xavier(rnd_type = "gaussian", factor_type = "avg", magnitude = 1)optimizer <- mx.

opt.

create("adadelta", rho = 0.

9, eps = 1e-06, wd = 1e-06, clip_gradient = 1, rescale.

grad = 1/batch.

size)logger <- mx.

metric.

logger()epoch.

end.

callback <- mx.

callback.

log.

train.

metric(period = 10, logger = logger)Now, we are all set to start training the network.

Training may take a while, depending on the hardware you are running on.

It took me 234 seconds running on MacBook Pro with 2.

3 GHz Intel Core i5.

## train the networksystem.

time(model <- mx.

model.

buckets(symbol = symbol, train.

data = train.

data, eval.

data = eval.

data, num.

round = 200, ctx = ctx, verbose = TRUE, metric = mx.

metric.

mse.

seq, initializer = initializer, optimizer = optimizer, batch.

end.

callback = NULL, epoch.

end.

callback= epoch.

end.

callback))The output of the above code:Start training with 1 devices[1] Train-MSE=0.

0942884422838688[1] Validation-MSE=0.

0346374846994877[2] Train-MSE=0.

0371618691831827[2] Validation-MSE=0.

0358746200799942.

.

.

.

[199] Train-MSE=0.

00409387513063848[199] Validation-MSE=0.

00288719875970855[200] Train-MSE=0.

00403149246703833[200] Validation-MSE=0.

00306809868197888 user system elapsed 234.

265 20.

535 134.

364Training vs Validation errorFrom the above plot, we can see how training as well as validation error decreased gradually and network was able to learn the time-series.

Predicting future valuesNow we will use the trained network for inference.

We start with extracting state symbols for the trained model.

## Extracting the state symbols for RNNinternals <- model$symbol$get.

internals()sym_state <- internals$get.

output(which(internals$outputs %in% "RNN_state"))sym_state_cell <- internals$get.

output(which(internals$outputs %in% "RNN_state_cell"))sym_output <- internals$get.

output(which(internals$outputs %in% "loss_output"))symbol <- mx.

symbol.

Group(sym_output, sym_state, sym_state_cell)We will try to predict the timeseries values of our first test sample, the 401st sample.

pred_length <- 100predicted <- numeric()We need to initialize the states of the RNN.

Hence, we will pass the 400th sample through the network to get the values for RNN states and use it to start the prediction of the next 100 timestamps.

data <- mx.

nd.

array(trainX[, , 400, drop = F])label <- mx.

nd.

array(trainY[, 400, drop = F])We create data iterators for the input, please note that the label is required to create the iterator and will not be used in the inference.

You can use dummy values too in the label.

infer.

data <- mx.

io.

arrayiter(data = data, label = label, batch.

size = 1, shuffle = FALSE)infer <- mx.

infer.

rnn.

one(infer.

data = infer.

data, symbol = symbol, arg.

params = model$arg.

params, aux.

params = model$aux.

params, input.

params = NULL, ctx = ctx)We will now iterate time-step by timestep to generate each pollution value, starting at the beginning of the 401st sample.

Here we use the RNN state values generated from the ground truth data of the previous timestep to do the next time-step prediction.

The other way to predict the next time-step is by using the predicted values as inputs.

This is known as “auto-regressive inference”, as we will see after this example.

actual <- trainY[, 401]## Iterate one by one over timestampsfor (i in 1:pred_length) { data <- mx.

nd.

array(trainX[, i, 401, drop = F]) label <- mx.

nd.

array(trainY[i, 401, drop = F]) infer.

data <- mx.

io.

arrayiter(data = data, label = label, batch.

size = 1, shuffle = FALSE) ## use previous RNN state values infer <- mx.

infer.

rnn.

one(infer.

data = infer.

data, symbol = symbol, ctx = ctx, arg.

params = model$arg.

params, aux.

params = model$aux.

params, input.

params = list(rnn.

state=infer[[2]], rnn.

state.

cell = infer[[3]])) pred <- infer[[1]] predicted <- c(predicted, as.

numeric(as.

array(pred)))}predictedcontains the predicted values whereas actual contains actual values.

Let’s try to visualize the prediction with respect to actual values.

We generated 401st sample as discussed above and the results are given below.

Predicted and actual values for 401st time sampleThe model is quite simple yet learns the the time series data pretty well.

We can also repeat the above experiment for the 401st time sample with auto-regressive inference.

In this example, we will use the output from one time-step as input for next time-step.

The initial setup for getting RN states would be same as above.

## setting up RNN statespred_length <- 100predicted <- numeric()data <- mx.

nd.

array(trainX[, , 400, drop = F])label <- mx.

nd.

array(trainY[, 400, drop = F])infer.

data <- mx.

io.

arrayiter(data = data, label = label, batch.

size = 1, shuffle = FALSE)infer <- mx.

infer.

rnn.

one(infer.

data = infer.

data, symbol = symbol, arg.

params = model$arg.

params, aux.

params = model$aux.

params, input.

params = NULL, ctx = ctx)However, the inference loop can be changed appropriately for the task as shown below.

pred <- as.

numeric(as.

array(infer[[1]])[1,100])actual <- trainY[, 401]## Iterate one by one over timestampsfor (i in 1:pred_length) { data_auto <- mx.

nd.

array(trainX[, i, 401, drop = F]) m <- as.

array(data_auto) m[1,,] <- as.

numeric(as.

array(pred)) data_auto <- mx.

nd.

array(m) label <- mx.

nd.

array(trainY[i, 401, drop = F]) infer.

data <- mx.

io.

arrayiter(data = data_auto, label = label, batch.

size = 1, shuffle = FALSE) ## use previous RNN state values infer <- mx.

infer.

rnn.

one(infer.

data = infer.

data, symbol = symbol, ctx = ctx, arg.

params = model$arg.

params, aux.

params = model$aux.

params, input.

params = list(rnn.

state=infer[[2]], rnn.

state.

cell = infer[[3]])) pred <- infer[[1]] predicted <- c(predicted, as.

numeric(as.

array(pred)))}The inference result looks as follows.

Predicted and actual values for 401st time sample using auto-regressive inference methodThe auto-regressive inference does deviate from the actual predictions as expected but the model learns the seasonality and correlation with the covariate for the 100 next time steps, not bad!You can improve further with more tweaking of hyper-parameters, and the network architecture using more complex state-of-the-art models like DeepAR and LSTNet.

You can find the entire code for the above task in gist.

Call to ActionThere’s a lot more you can do with MXNet and MXNet-R — head out to the tutorials to check it out!.MXNet is a powerful deep learning framework with which you can save the trained models and use it with other MXNet language bindings and/or Model Server.

ReferencesThe Unreasonable Effectiveness of Recurrent Neural NetworksMusings of a Computer Scientist.

karpathy.

github.

ioapache/incubator-mxnetLightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler…github.

comTime-Series with custom RNN cellThe resulting data is of shape [1, 100, 500] corresponding to [num_features, seq_length, samples].

In multivariate…jeremiedb.

github.

io.. More details