Extreme Event Forecasting with LSTM AutoencodersImprove forecasting performance developing a strong Neural Network architectureMarco CerlianiBlockedUnblockFollowFollowingMay 21Dealing with extreme event prediction is a frequent nightmare for every Data Scientist.

Looking around I found very interesting resources that deal with this problem.

Personally I literally fall in love with the approach released by Uber Researchers.

In their papers (two version are available here and here) they developed a ML solution for daily future prediction of traveler demand.

Their methodology stole my attention for its geniality, good explanation and easy implementation.

So my purpose is to reproduce their discovery in pythonic language.

I’m very satisfied of this challenge and at the end I improved my knowledge for regression forecasting.

The most important take aways from this post can be summarize as:Develop a stable approach to evaluate and compare Keras models (avoiding at the same time the problem of weights seed generator);Implent a simple and clever LSTM Autoencoder for new features creation;Improve forecast prediction performance for time series with easy tricks (see step above);Deal with nested dataset, i.

e.

problems where we have observations which belong to different entities (for exemple time series of different stores/engines/people and so on)… in this sense we develop only an high performance model for all!But Keep Kalm and let’s proceed step by step.

PROBLEM OVERVIEWAt Uber accurate prediction for completed trips (particularly during special events) provides a series of important benefits: more efficient driver allocation resulting in a decreased wait time for the riders, budget planning and other related tasks.

In order to reach high accurate predictions of driver demand for ride sharing, Uber Researchers developed an high performance model for time series forecasting.

They are able to fit (oneshot) a single model with a lot of heterogeneous time series, coming from different location/cities.

This process permits to extract relevant time patterns.

At the end they were able to forecast demand, generalizing for different location/cities, outperforming the classical forecasting methods.

THE DATASETFor this task Uber made use of an internal dataset of daily trips among different cities, including additional features; i.

e.

weather information and city level information.

They aimed to forecast the next day demand from a fixed window of past observations.

Unfortunately we don’t have at our disposal this kind of data, so we, as Kaggle Fans, chose the nice Avocado Prices Dataset.

This data shows historical avocado prices, of two different species, and sales volume in multiple US markets.

Our choise was due to the need of a nested dataset with temporal dependency: we a have time series for each US market, 54 in total, number that grows to 108 if we consider one time series for each type (conventional and organic).

This data structure is highlited as important by Uber Researchers because it permits to our model to discover important invisible relations.

Also correlation among series brings advantages for our LSTM Autoencoder during the process of features extraction.

To build our model we utilized time series of prices at our disposal up to the end of 2017.

The first 2 months of 2018 are stored and used as test set.

For our analysis we will take into consideration also all the provided regressors.

The obervations are shown with a weakly frequency so our purpose is: given a fixed past window (4 weeks) of features, predict the upcoming weakly price.

Train (blue) Test (orange) overwiev of avocado pricesDue to the absence of exponential growth and tranding behavior we don’t need to scale our price series.

MODELINGIn order to solve our prediction task, we replicate the novel model architecture, proposed by Uber, that provides a single model for heterogeneous forecasting.

As below figure shows, the model first primes the network by auto feature extraction, training an LSTM Autoencoder, which is critical to capture complex time-series dynamics at scale.

Features vectors are then concatenated with the new input and fed to LSTM Forecaster for prediction.

Our forecasting workflow is easy to imagine: we have our initial windows of weekly prices for different markets.

We start to train our LSTM Autoencoder on them; next we remove the encoder and utilize it as feautures creator.

The second and final step required to train a prediction LSTM model for forecasting.

Based on real/existing regressors and the previous artificial generated features, we are able to provide next week avocado price prediction.

from Time-series Extreme Event Forecasting with Neural Networks at UberWe easy recreate this logic with Keras.

Our LSTM Autoencoders is combosed by a simple LSTM encoder layer, followed by another simple LSTM decoder.

Don’t forget at the end the TimeDistributed layer.

You will understand the utility of dropout during evaluation, at this point they are harmless, trust me!inputs_ae = Input(shape=(sequence_length, 1))encoded_ae = LSTM(128, return_sequences=True, dropout=0.

3)(inputs_ae, training=True)decoded_ae = LSTM(32, return_sequences=True, dropout=0.

3)(encoded_ae, training=True)out_ae = TimeDistributed(Dense(1))(decoded_ae)sequence_autoencoder = Model(inputs_ae, out_ae)sequence_autoencoder.

compile(optimizer='adam', loss='mse', metrics=['mse'])sequence_autoencoder.

fit(X, X, batch_size=16, epochs=100, verbose=2, shuffle=True)We compute features extraction and concatenate the result with other variables.

At this point I made a little deviation from the Uber solution: they suggest to manipulate the feature vectors extracted by our encoder aggregating them via an ensemble technique (e.

g.

, averaging).

I decided to let them original and free.

I make this choice because it permits to achive better results in my experiments.

encoder = Model(inputs_ae, encoded_ae)XX = encoder.

predict(X)XXF = np.

concatenate([XX, F], axis=2)At the end, the prediction model is another simple LSTM based naural network:inputs1 = Input(shape=(X_train1.

shape[1], X_train1.

shape[2]))lstm1 = LSTM(128, return_sequences=True, dropout=0.

3)(inputs1, training=True)lstm1 = LSTM(32, return_sequences=False, dropout=0.

3)(lstm1, training=True)dense1 = Dense(50)(lstm1)out1 = Dense(1)(dense1)model1 = Model(inputs1, out1)model1.

compile(loss=’mse’, optimizer=’adam’, metrics=[‘mse’])model1.

fit(X_train1, y_train1, epochs=30, batch_size=128, verbose=2, shuffle=True)EVALUATIONFinally we are almost ready to see some results and make predictions.

The last steps involve the creation of a rival model and the consequence definition of a robust forecasting methodology for results comparison.

Personally the best way to evaluate two different procedures is to replicate them as much as possible, in order to mark attention only at the points of really interest.

In this implementation I want to show evidence of LSTM Autoencoder power as a tool for relevant features creation for time series forecasting.

In this sense to evaluate the goodness of our methodology I decide to develop a new model for price forecasting with the same structure of our previous forecasting NN.

inputs2 = Input(shape=(X_train2.

shape[1], X_train2.

shape[2]))lstm2 = LSTM(128, return_sequences=True, dropout=0.

3)(inputs2, training=True)lstm2 = LSTM(32, return_sequences=False, dropout=0.

3)(lstm2, training=True)dense2 = Dense(50)(lstm2)out2 = Dense(1)(dense2)model2 = Model(inputs2, out2)model2.

compile(loss='mse', optimizer='adam', metrics=['mse'])model2.

fit(X_train2, y_train2, epochs=30, batch_size=128, verbose=2, shuffle=True)The only difference between model1 and model2 is the features they recived as input: model1 receives the encoder output plus the external regressors; model2 receives past raw prices plus the external regressors.

UNCERTAINTY ESTIMATIONTime series forecasting is critical in nature for the extreme variability of the domain of interest.

In addition, if you try to build a model based on Neural Network your results are also subject to internal weight initialization.

To overcome this drawbacks a number of approaches exist for uncertainty estimation: from Bayesian to those based on the bootstrap theory.

In their work Uber Researchers combine Bootstrap and Bayesian approaches to produce a simple, robust and tight uncertainty bound with good coverage and provable convergence properties.

This technique is extremely simple and practical… indirectly we have already implemented it!.As you can see in the figure below, during the feedforward process, dropout is applied to all layers in both the encoder and the prediction network.

As a result, the random dropout in the encoder perturbs the input intelligently in the embedding space, which accounts for potential model misspecification and gets further propagated through the prediction network.

from Time-series Extreme Event Forecasting with Neural Networks at UberPythonic speaking we have simply to add trainable dropout layers in our Neural Network, and reactivate them during prediction (Keras used to cut dropout during prediction).

Here the simplified function that I used, which compresses: the dropout activation, features concatenation and predition all in oneshoot.

def stoc_drop1(r): enc = K.

function([encoder.

layers[0].

input, K.

learning_phase()], [encoder.

layers[-1].

output]) NN = K.

function([model1.

layers[0].

input, K.

learning_phase()], [model1.

layers[-1].

output]) enc_pred = np.

vstack(enc([x_test, r])) enc_pred = np.

concatenate([enc_pred, f_test], axis=2) NN_pred = NN([enc_pred, r]) return np.

vstack(NN_pred)For final evaluation we must iterate the calling of the above function and store the results.

I also compute the scoring of the prediction at each iteraction (I chose Mean Absolute Error).

scores1 = []for i in tqdm.

tqdm(range(0,100)): scores1.

append(mean_absolute_error(stoc_drop1(0.

5), y_test1))print(np.

mean(scores1), np.

std(scores1))We must set the number of time we compute evaluation (100 times in our case) and dropout probability (I choose 0.

5 at each layer).

With scores stored we are able to compute mean, standard deviation and the relative uncertainty of MAE.

FORECASTING AND RESULTSWe replicate the same procedure for our ‘rival model’ made by only LSTM prediction network.

After averaging scores and computing uncertainty, the final results are: 0.

118 MAE (0.

0012 MAE uncertainty) for LSTM Autoencoder + LSTM Forecaster and 0.

124 MAE (0.

0015 MAE uncertainty) for single LSTM Forecaster.

We register an overall final improvement of 5% in forecast accuracy with a similary degree of uncertainty.

We can assert that our LSTM Autoencoder is a good weapon to extract importan unseen features from time series.

Below I also report scoring performance on single market for both organic and conventional avocado types.

Performance (MAE) comparison on test dataDuring training I also reserve my self to exclude one entire market (‘Albany’ region).

This because I want to test the power of our Network on un unseen series.

We register again an improvement of the performance in both Organic and Conventional market segments.

Performance (MAE) comparison on unseen time seriesSUMMARYIn this post I replicate an end-to-end neural network architecture developed at Uber for special event forecasting.

I want to emphasize: the power of LSTM Autoencoder in the role of feature extractor; the scalability of this solution to generalize well avoiding to train multiple model for every time series; the ability to provide a stable and profitable method for neural network evaluation.

I also remark that this kind of solution suits well when you have at your disposal an adequate number of time series that share common behaviours… It’s not important that these are immediately visible, the Autoencoder make this for us.

CHECK MY GITHUB REPOKeep in touch: LinkedinREFERENCES[1] Deep and Confident Prediction for Time Series at Uber: Lingxue Zhu, Nikolay Laptev[2] Time-series Extreme Event Forecasting with Neural Networks at Uber: Nikolay Laptev, Jason Yosinski, Li Erran Li, Slawek Smyl.. More details