The importance of context in data sets: A short experimentUsing four forecasting methods in the same time series to show performance differencesTies van den EndeBlockedUnblockFollowFollowingFeb 11These days, data scientists are spoilt for choice when it comes to selecting which method they will use on a database.
Academic literature has moved well beyond the well-known and once widely-used Ordinary Least Squares method.
Contrary to most other fields of research, new theories and methods are usually implemented into business applications relatively quickly, and as it stands today, the field of Time Series Forecasting is arguably the best-researched and most-used one.
This article is not a deep dive into a single method that works best under certain circumstances.
Instead, I would like to show you that no method is unambiguously preferred in every scenario.
In fact, I have set up a short experiment to show you the importance of the context of a dataset, and the importance of full-scale retraining.
My demonstration makes use of four different methods and a publicly available dataset: the hourly footfall in the centre of York, made available by the U.
The data ranges from March 30th, 2009 until December 17th, 2018.
Overview of MethodsA short disclaimer first: in order to fully justify my point, I would need to do a thorough investigation with several models and preprocessing methods, which is outside of the scope of this article.
But a preliminary conclusion can be drawn even when restricting research to the following four methods: Averaging, Exponential Smoothing, Random Forest and Gradient Boosting Machine.
AveragingA simple composite averaging method.
First, average values per day of the week are obtained after aggregating the data at the daily level.
Next, percentages attributable to hours within a day are computed, enabling us to redistribute the data from daily forecasts to hourly forecasts.
Exponential SmoothingExponential smoothing entails assigning exponentially decreasing weights over time, which means this model explicitly takes into account past events.
To train the model, a smoothing parameter β (0 < β < 1) needs to be estimated.
Random ForestTraining a Random Forest implies repeatedly selecting a subset of the data and a random selection of the features to train a regression tree.
Subsequently, predictions are constructed by averaging predictions made using each individual regression tree.
Gradient Boosting MachineGradient Boosting Machines are trained by iteratively combining so-called weak learners into one single strong learner.
Each iteration aims to further minimise the error term of earlier trained learners by estimating a regression tree.
Note that, in order to apply machine learning techniques to a time series such as this dataset, feature engineering cannot be ignored.
By the very nature of machine learning, temporal considerations are not taken into account, so data scientists should indicate the time dependencies through the features.
Particularly when the data shows a trend, autoregression (including lagged values) can be valuable, as machine learning techniques usually do not extrapolate well.
In this exercise, for both the Random Forest and the Gradient Boosting Machine methods, I opted for including these time indicators: year, month, and day of week.
ResultsI first used data up until December 9th, 2017 to train our various models.
I let the models predict one week ahead.
Based on these results, the Random Forest model seems to have the highest predictive power, and so would constitute the preferred model.
Next, I trained all the models on the same period of the year, just one year later (until December 9th, 2018), and let them predict one week ahead again.
Even though the forecasts target the same period of the year, just one year further, the outcomes and subsequent conclusions now are strikingly different.
And that is my main point.
Changing the context can lead to substantially different outcomes and conclusions even within a relatively simple dataset.
One conclusion drawn from this experiment is the importance of full-scale retraining.
Usually, retraining boils down to applying the exact same method, preprocessing, and feature engineering to a slightly different dataset.
However, as shown above, persistent usage of a once chosen approach in time will lead to incorrect conclusions.
What might be more important is that it demonstrates that the context of the dataset influences results far beyond the predictive power of any single model.
It implies that each modeling assignment requires a range of methods to be taken into consideration.
The performance of each method (and the method selection) will greatly depend on the context of the problem at hand.