Two thoughts on the question “Are times series models considered part of Machine Learning or not?”Skander Hannachi, Ph.

DBlockedUnblockFollowFollowingApr 30This is a very brief note that I decided to write after yet again coming across some form of the question “Is time series analysis part of machine learning?/Is time series analysis considered supervised learning?” in a discussion forum.

This question is obviously a very broad one, and is to some extent subjective.

I got into an argument at work a few months back with a colleague on whether fitting an exponential smoothing model constituted “learning from data” or not, even though we were in complete agreement over how the model fitting and the technology we were using worked.

I do not pretend to do this broad question justice, or to provide a comprehensive answer to it.

Instead, here I will present two very useful ways of looking at the question of Time Series analysis and ML methods.

I consider these two ways of looking at the question useful, because keeping them in mind will allow you to better approach the question of what approach to use when trying to solve a time series analysis problem.

Moreover, the distinctions I describe here avoid any sophistry along the line of “Is linear regression considered machine learning?”, etc…Statistical methods are parametric, ML methods are not:In the time series community, most people (following for example Makridakis, Spiliotis, and Assimakopoulos) would divide time series analysis techniques into two categories: Statistical methods like ARIMA, Holt-Winters, Theta etc…and ML methods like Neural Networks, Support Vector Regression or Random Forests.

Using this classification, the main difference between the two categories is that in the former case, the models are parametric, i.

e.

a known function is assumed to represent the data (for example exponential smoothing: Y*(t) = αY(t-1) + (1-α)Y*(t-1)), and we just need to fit the right parameters to that function.

In the latter case we don’t make any assumptions about the shape of the function that represents our data, and we rely on the universal approximation properties of our algorithm to find the best fit to our time series (Strictly speaking, most ML models are parametric as well, but they are a looser, broader form of parametric.

For this case we can consider them non parametric in the sense that can approximate any function to an arbitrary level of precision and complexity).

Why is this useful?.When choosing whether to go with statistical methods or with ML, you should ask yourself:Is the stochastic process underlying my time series data complex enough that it warrants some type of universal approximator to model it?Do I have enough data points and a high enough signal to noise ratio in my time series so that I can fit a complex non-parametric model?Do the compute resources available allow for ML based methods, since fitting simple parametric models requires less computation?.(This is especially relevant if you plan on automating your process and running it in a production environment)Sequential Methods vs.

Pure Auto-Regressive methods:A second interesting classification was provided by Bergmeir, Hyndman and Koo.

In their paper, they divide time series methods into sequential methods and pure auto-regressive methods:Sequential models like exponential smoothing models: Here the relation between the predicted value Y*(t) and the past values Y(t-1), Y(t-2) etc…is recursive, and with each new time step, the model “consumes” an additional lagged value.

Ultimately the entire time series needs to be used if one wants a complete expression of the model in terms of the data.

One can see this for example with simple exponential smoothing: Y*(t) = αY(t-1) + (1-α)Y*(t-1).

If we want to expand this expression in terms of the available data, we have to unravel the expression: Y*(t) = αY(t-1) + (1-α)(αY(t-2) + (1-α)Y*(t-2)), etc…until we end up with a model of the form : Y*(t) = αY(t-1) + (1-α)(αY(t-2) + (1-α)²(αY(t-3)) + (1-α)³(αY(t-4))….

(1-α)^(t-1)Y*(0).

(Note that Y*(0) is an un-observable variable which is why these methods typically use maximum likelihood estimators).

You see here why the model is considered “sequential”: the only way to fit the parameters to the data is calculate everything in sequence to fit our model.

Exponential Smoothing models, ARIMA models with an MA component, most state space models, fit into the sequential category.

Typically you fit such models with Maximum Likelihood methods and/or Kalman filters.

“Pure” auto-regressive methods are any time series model where the forecast value Y*(t) is a function of a fixed number of n past lags of the time series variable Y(t): Y*(t) = f(Y(t-1),Y(t-2),…Y(t-n)).

f(…) Can be linear or non linear, the key is that the number of past lags used for prediction is always fixed.

ML models all fall into this category, but so do some ARIMA models (i.

e.

ARIMA(p.

d.

q) models where q=0, so without a moving average component).

You don’t need the entire sequence to estimate f(Y(t-1),Y(t-2),…Y(t-n)), and you use only observable data points to fit/train your model, hence OLS or Gradient Descent can be used.

Why is this useful?.The main result of Bergmeir, Hyndman and Koo’s paper is that for sequential models, normal cross validation is not applicable, because the order in which the model sees the data is crucial to how the model is fitted.

You would have to resort to time series cross validation, which is trickier to do, or give up cross validation all together in favor of another model selection method.

For pure auto-regressive methods on the other hand, normal cross-validation is applicable, provided that the errors in you model are uncorrelated.

So whether or not you can use cross validation will depend on the approach you use.

Incidentally, the term “pure” auto-regressive models is used because ARIMA or BSTS can have auto-regressive components, but are still overall sequential models.

One final note: Facebook’s increasingly popular Prophet model doesn’t fit in either of the categories that Bergmeir et al.

propose, since it is a GAM style model that is fitted directly against the time variable.

It is closer to the sequential models in spirit in the sense that normal cross-validation wouldn’t work that well with it, given the explicitly time dependent nature of the model.

Neither does Prophet fit very well in either of the categories I mention in my first classification: On one hand it is parametric and very statistical in spirit, being based on GAMs, on the other hand the second term in the Prophet model is a Fourier expansion which makes it kind of similar in spirit to some kernel based ML methods.

Maybe that’s why it’s becoming so popular, it has a “best of both worlds” vibe going for it.

References:S.

Makridakis, E.

Spiliotis, V.

Assimakopoulos, “Statistical and Machine Learning forecasting methods: Concerns and ways forward”, PloS one, 2018.

C.

Bergmeir, R.

J.

Hyndman, B.

Koo, “A note on the validity of cross-validation for evaluating time series prediction”, Monash Univer- sity Dept of Econometrics and Business Statistics Working Paper, 2015.

S.

J.

Taylor, B.

Latham, “Forecasting at scale”, The American Statistician, 2018.

.